21 Jul 2022

Apache Pinot: Import JSON data from a CSV file - Illegal Json Path: $['id'] does not match document

I’ve been working on an Apache Pinot dataset where I ingested a JSON document stored in a CSV file. I made a mistake with the representation of the JSON and it took me a while to figure out what I’d done wrong.

We’ll go through it in this blog post.

Figure 1. Apache Pinot: Import JSON data from a CSV file - Illegal Json Path: $['id'] does not match document

Setup

We’re going to spin up a local instance of Pinot and Kafka using the following Docker compose config:

docker-compose.yml

JSON version: '3.7'
services:
  zookeeper:
    image: zookeeper:3.5.6
    container_name: "zookeeper"
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
  pinot-controller:
    image: apachepinot/pinot:0.11.0-SNAPSHOT-438c53b-20220719
    command: "StartController -zkAddress zookeeper:2181"
    container_name: "pinot-controller"
    volumes:
      - ./config:/config
      - ./data:/data
    restart: unless-stopped
    ports:
      - "9000:9000"
    depends_on:
      - zookeeper
  pinot-broker:
    image: apachepinot/pinot:0.11.0-SNAPSHOT-438c53b-20220719
    command: "StartBroker -zkAddress zookeeper:2181"
    restart: unless-stopped
    container_name: "pinot-broker"
    volumes:
      - ./config:/config
    ports:
      - "8099:8099"
    depends_on:
      - pinot-controller
  pinot-server:
    image: apachepinot/pinot:0.11.0-SNAPSHOT-438c53b-20220719
    command: "StartServer -zkAddress zookeeper:2181"
    restart: unless-stopped
    container_name: "pinot-server"
    volumes:
      - ./config:/config
    ports:
      - "8098:8098"
      - "8097:8097"
    depends_on:
      - pinot-broker

We can launch all the components by running the following command:

BASH docker-compose up

Schema and Table

We’re going to be using the following schema:

config/schema.json

JSON {
    "schemaName": "users",
    "dimensionFieldSpecs": [
      {
        "name": "json_field",
        "dataType": "JSON"
      }
    ],
    "dateTimeFieldSpecs": [
      {
        "name": "timestamp_field",
        "dataType": "TIMESTAMP",
        "format": "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
      }
    ]
  }

And this table config:

config/table.json

JSON {
    "tableName": "users",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "replication": 1,
      "schemaName": "users",
      "timeColumnName": "timestamp_field"
    },
    "tenants": {},
    "tableIndexConfig": {},
    "ingestionConfig": {},
    "metadata": {}
  }

We can create them both by running the following command:

BASH docker exec -it pinot-controller bin/pinot-admin.sh AddTable   \
  -tableConfigFile /config/table.json   \
  -schemaFile /config/schema.json \
  -exec

Importing CSV file

Next we’re going to import the following CSV file:

Table 1. data/output.csv
timestamp_field	json_field
760530903363	"{\"id\": 7886, \"details\": {\"collaborator\": \"Brett Gill\", \"score\": 6056, \"address\": \"2882 Sheila Lakes Apt. 264\\nRhondaville, KS 09803\"}}"

We’ll do this using this job spec:

config/job-spec.yml

YML executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/output.csv'
outputDirURI: '/opt/pinot/data/crimes/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'users'
pinotClusterSpecs:
  - controllerURI: "http://localhost:9000"

The following command will ingest the CSV file:

BASH docker exec -it pinot-controller bin/pinot-admin.sh LaunchDataIngestionJob \
  -jobSpecFile /config/job-spec.yml

Querying the JSON Field

Now it’s time to write a query that pulls out the id field from json_field, which we can do with this query:

SQL select json_extract_scalar(json_field, '$.id', 'STRING') AS id,
       json_field
from users

Output

[
  {
    "message": "QueryExecutionError:\njava.lang.IllegalArgumentException: Illegal Json Path: $['id'] does not match document\n\tat org.apache.pinot.core.common.evaluators.DefaultJsonPathEvaluator.throwPathNotFoundException(DefaultJsonPathEvaluator.java:613)\n\tat org.apache.pinot.core.common.evaluators.DefaultJsonPathEvaluator.processValue(DefaultJsonPathEvaluator.java:540)\n\tat org.apache.pinot.core.common.evaluators.DefaultJsonPathEvaluator.evaluateBlock(DefaultJsonPathEvaluator.java:250)\n\tat org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readStringValues(DataFetcher.java:594)",
    "errorCode": 200
  }
]

Hmm, that didn’t quite work. Let’s have a look at the contents of json_field:

SQL select json_field
from users

Table 2. Results
json_field
"{\"id\": 7886, \"details\": {\"collaborator\": \"Brett Gill\", \"score\": 6056, \"address\": \"2882 Sheila Lakes Apt. 264\\nRhondaville, KS 09803\"}}"

We can see from the output that we’ve actually got a string in this field rather than a JSON document, which is why the JSON path query doesn’t work.

We’ll need to reimport the data after fixing the JSON field, as shown in the CSV file below:

Table 3. data/output.csv
timestamp_field	json_field
760530903363	{"id": 8360, "details": {"collaborator": "Mckenzie Brown", "score": 1384, "address": "68131 Robinson Vista\nChristianport, HI 60353"}}

Once we’ve done that we can run the following query again:

SQL select json_extract_scalar(json_field, '$.id', 'STRING') AS id,
       json_field
from users

Table 4. Results
id	json_field
id json_field	8360

Success!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.