ClickHouse: Float equality
I’ve been playing around with NumPy data in ClickHouse this week and wanted to share what I learnt when checking for equality of float values. Let’s get going!
Creating arrays
We’re going to use Python’s NumPy library to create 5 arrays containing 10 values each:
import numpy as np
rng = np.random.default_rng(seed=42)
rng.random(size=(5, 5))
array([[0.28138389, 0.29359376, 0.66191651, 0.55703215, 0.78389821],
[0.66431354, 0.40638686, 0.81402038, 0.16697292, 0.02271207],
[0.09004786, 0.72235935, 0.46187723, 0.16127178, 0.50104478],
[0.1523121 , 0.69632038, 0.44615628, 0.38102123, 0.30151209],
[0.63028259, 0.36181261, 0.08764992, 0.1180059 , 0.96189766]])
By default, the type of the values is float64
:
rng.random(size=(5, 10)).dtype
dtype('float64')
But we’re going to cast that to float32
since that’s the format of the data in the file that I’ve been working with.
We’ll then save the arrays to a file:
values = rng.random(size=(5, 10))
np.save('arrays.npy', values.astype(np.float32))
Querying in ClickHouse
Next, let’s download and launch ClickHouse Local:
curl https://clickhouse.com/ | sh
./clickhouse local -m
We can then read the arrays.npy
file:
FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array);
┌─array────────────────────────────────────────────────────┬─toTypeName(array)─┐
│ [0.2813839,0.29359376,0.6619165,0.55703217,0.78389823] │ Array(Float32) │
│ [0.66431355,0.40638685,0.8140204,0.16697292,0.022712072] │ Array(Float32) │
│ [0.09004786,0.72235936,0.46187723,0.16127178,0.50104475] │ Array(Float32) │
│ [0.1523121,0.69632035,0.44615626,0.38102123,0.3015121] │ Array(Float32) │
│ [0.6302826,0.36181262,0.08764992,0.1180059,0.9618977] │ Array(Float32) │
└──────────────────────────────────────────────────────────┴───────────────────┘
Next, let’s see if we can return only the rows where the first item of the array is 0.99237555
.
My first attempt looked like this:
FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array)
WHERE array[1] = 0.66431355
FORMAT Vertical;
0 rows in set. Elapsed: 0.002 sec.
Computer says no! Let’s write a query that returns the type of our float literal:
SELECT 0.66431355 AS literal, toTypeName(literal);
┌────literal─┬─toTypeName(0.66431355)─┐
│ 0.66431355 │ Float64 │
└────────────┴────────────────────────┘
Aha, it’s a Float64
, which explains why it didn’t match our array of Float32
values.
There are at least two ways that I know of to rectify this.
We can use the toFloat32
function to cast the float literal to the right type:
FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array)
WHERE array[1] = toFloat32(0.66431355);
Or, we can cast the value to Float32
FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array)
WHERE array[1] = CAST(0.66431355 AS Float32);
Either way, we get the following result.
Row 1:
──────
┌─array────────────────────────────────────────────────────┬─toTypeName(array)─┐
│ [0.66431355,0.40638685,0.8140204,0.16697292,0.022712072] │ Array(Float32) │
└──────────────────────────────────────────────────────────┴───────────────────┘
My takeaway when working with floats (or even decimals) is to make sure that you know the underlying data types that you’re working with.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.