3 Nov 2023

ClickHouse: Nested type Array(String) cannot be inside Nullable type (ILLEGAL_TYPE_OF_ARGUMENT)

I’ve been working with some data that’s in CSV format but has tab-separated values in some columns. In this blog post, we’re going to learn how to process that data in ClickHouse.

The CSV file that we’re working with looks like this:

Table 1. data.csv
value
foo bar

We’ll launch ClickHouse Local (clickhouse local) and then run the following:

FROM file('data.csv', CSVWithNames)
SELECT *;

Output

┌─value─────┐
│ foo   bar │
└───────────┘

Let’s try to split the value field on tab using the splitByString function:

FROM file('data.csv', CSVWithNames)
SELECT splitByString('\t', value);

Output

Received exception:
Code: 43. DB::Exception: Nested type Array(String) cannot be inside Nullable type: While processing splitByString('\t', value). (ILLEGAL_TYPE_OF_ARGUMENT)

The problem we’ve run into here is that ClickHouse’s CSV reader assumes that fields are nullable since it’s possible that there could be nulls. We can disable the schema_inference_make_columns_nullable setting to tell it to not treat fields as nullable.

FROM file('data.csv', CSVWithNames)
SELECT splitByString('\t', value)
SETTINGS schema_inference_make_columns_nullable = 0;

Output

┌─splitByString('\t', value)─┐
│ ['foo','bar']              │
└────────────────────────────┘

Alternatively, we can use the assumeNotNull function if we want it to keep all other fields nullable:

FROM file('data.csv', CSVWithNames)
SELECT splitByString('\t', assumeNotNull(value));

Output

┌─splitByString('\t', assumeNotNull(value))─┐
│ ['foo','bar']                             │
└───────────────────────────────────────────┘

But this function will return an arbitrary value if a null is found. So let’s say we update our CSV file to look like this:

Table 2. data.csv
value
foo bar
mark bar

If we run that last query, we’ll see the following output:

Output

┌─splitByString('\t', assumeNotNull(value))─┐
│ ['foo','bar']                             │
│ ['']                                      │
│ ['mark','bar']                            │
└───────────────────────────────────────────┘

So we might prefer to use ifNull, which lets us choose our own default value to use if a null value is encountered:

FROM file('data.csv', CSVWithNames)
SELECT splitByString('\t', ifNull(value, 'N/A'));

Output

┌─splitByString('\t', ifNull(value, 'N/A'))─┐
│ ['foo','bar']                             │
│ ['N/A']                                   │
│ ['mark','bar']                            │
└───────────────────────────────────────────┘

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.