Apache Pinot: Import JSON data from a CSV file - Illegal Json Path: $['id'] does not match document
I’ve been working on an Apache Pinot dataset where I ingested a JSON document stored in a CSV file. I made a mistake with the representation of the JSON and it took me a while to figure out what I’d done wrong.
We’ll go through it in this blog post.
Setup
We’re going to spin up a local instance of Pinot and Kafka using the following Docker compose config:
version: '3.7'
services:
zookeeper:
image: zookeeper:3.5.6
container_name: "zookeeper"
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
pinot-controller:
image: apachepinot/pinot:0.11.0-SNAPSHOT-438c53b-20220719
command: "StartController -zkAddress zookeeper:2181"
container_name: "pinot-controller"
volumes:
- ./config:/config
- ./data:/data
restart: unless-stopped
ports:
- "9000:9000"
depends_on:
- zookeeper
pinot-broker:
image: apachepinot/pinot:0.11.0-SNAPSHOT-438c53b-20220719
command: "StartBroker -zkAddress zookeeper:2181"
restart: unless-stopped
container_name: "pinot-broker"
volumes:
- ./config:/config
ports:
- "8099:8099"
depends_on:
- pinot-controller
pinot-server:
image: apachepinot/pinot:0.11.0-SNAPSHOT-438c53b-20220719
command: "StartServer -zkAddress zookeeper:2181"
restart: unless-stopped
container_name: "pinot-server"
volumes:
- ./config:/config
ports:
- "8098:8098"
- "8097:8097"
depends_on:
- pinot-broker
We can launch all the components by running the following command:
docker-compose up
Schema and Table
We’re going to be using the following schema:
{
"schemaName": "users",
"dimensionFieldSpecs": [
{
"name": "json_field",
"dataType": "JSON"
}
],
"dateTimeFieldSpecs": [
{
"name": "timestamp_field",
"dataType": "TIMESTAMP",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}
]
}
And this table config:
{
"tableName": "users",
"tableType": "OFFLINE",
"segmentsConfig": {
"replication": 1,
"schemaName": "users",
"timeColumnName": "timestamp_field"
},
"tenants": {},
"tableIndexConfig": {},
"ingestionConfig": {},
"metadata": {}
}
We can create them both by running the following command:
docker exec -it pinot-controller bin/pinot-admin.sh AddTable \
-tableConfigFile /config/table.json \
-schemaFile /config/schema.json \
-exec
Importing CSV file
Next we’re going to import the following CSV file:
timestamp_field | json_field |
---|---|
760530903363 |
"{\"id\": 7886, \"details\": {\"collaborator\": \"Brett Gill\", \"score\": 6056, \"address\": \"2882 Sheila Lakes Apt. 264\\nRhondaville, KS 09803\"}}" |
We’ll do this using this job spec:
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/output.csv'
outputDirURI: '/opt/pinot/data/crimes/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
tableName: 'users'
pinotClusterSpecs:
- controllerURI: "http://localhost:9000"
The following command will ingest the CSV file:
docker exec -it pinot-controller bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml
Querying the JSON Field
Now it’s time to write a query that pulls out the id
field from json_field
, which we can do with this query:
select json_extract_scalar(json_field, '$.id', 'STRING') AS id,
json_field
from users
[
{
"message": "QueryExecutionError:\njava.lang.IllegalArgumentException: Illegal Json Path: $['id'] does not match document\n\tat org.apache.pinot.core.common.evaluators.DefaultJsonPathEvaluator.throwPathNotFoundException(DefaultJsonPathEvaluator.java:613)\n\tat org.apache.pinot.core.common.evaluators.DefaultJsonPathEvaluator.processValue(DefaultJsonPathEvaluator.java:540)\n\tat org.apache.pinot.core.common.evaluators.DefaultJsonPathEvaluator.evaluateBlock(DefaultJsonPathEvaluator.java:250)\n\tat org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readStringValues(DataFetcher.java:594)",
"errorCode": 200
}
]
Hmm, that didn’t quite work.
Let’s have a look at the contents of json_field
:
select json_field
from users
json_field |
---|
"{\"id\": 7886, \"details\": {\"collaborator\": \"Brett Gill\", \"score\": 6056, \"address\": \"2882 Sheila Lakes Apt. 264\\nRhondaville, KS 09803\"}}" |
We can see from the output that we’ve actually got a string in this field rather than a JSON document, which is why the JSON path query doesn’t work.
We’ll need to reimport the data after fixing the JSON field, as shown in the CSV file below:
timestamp_field | json_field |
---|---|
760530903363 |
{"id": 8360, "details": {"collaborator": "Mckenzie Brown", "score": 1384, "address": "68131 Robinson Vista\nChristianport, HI 60353"}} |
Once we’ve done that we can run the following query again:
select json_extract_scalar(json_field, '$.id', 'STRING') AS id,
json_field
from users
id | json_field |
---|---|
id json_field |
8360 |
Success!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.