17 Dec 2020

Materialize: Querying JSON files

I recently learnt about Materialize, a SQL streaming database, via their Series B fundraising announcement, and thought I’d take it for a spin.

My go-to dataset for new databases is Strava, an app that I use to record my runs. It has an API that returns a JSON representation of each run, containing information like the distance covered, elapsed time, heart rate metrics, and more.

I’ve extracted my latest 30 activities to a file in the JSON lines format and in this post we’re going to analyse that data using Materialize.

The data is stored in the activities-clean.json file on the mneedham/materialize-sandbox repository. Below is an example of the first line in this file:

JSON {"resource_state":2,"athlete":{"id":6958432,"resource_state":1},"name":"Morning Run","distance":14175.1,"moving_time":4222,"elapsed_time":4391,"total_elevation_gain":105.4,"type":"Run","workout_type":null,"id":4470124807,"upload_id":4776694342,"start_date":"2020-12-14T05:31:43Z","start_date_local":"2020-12-14T05:31:43Z","timezone":"(GMT+00:00) Europe/London","utc_offset":0,"location_city":null,"location_state":null,"location_country":"United Kingdom","achievement_count":0,"kudos_count":2,"comment_count":0,"athlete_count":1,"photo_count":0,"trainer":false,"commute":false,"manual":false,"private":false,"visibility":"everyone","flagged":false,"gear_id":"g7064038","from_accepted_tag":false,"upload_id_str":"4776694342","average_speed":3.357,"max_speed":4.3,"average_cadence":87.4,"has_heartrate":true,"average_heartrate":150.3,"max_heartrate":168,"heartrate_opt_out":false,"display_hide_heartrate_option":true,"elev_high":76.2,"elev_low":25,"pr_count":0,"total_photo_count":0,"has_kudoed":false,"suffer_score":116}

Connecting to Materialize

We’re going to query that data using Materialize, which we’ll set up using the following Docker Compose configuration:

docker-compose.yml

YAML   version: '3'
  services:
    materialize:
      image: materialize/materialized:v0.5.3
      container_name: "materialize-sandbox-docker-compose"
      volumes:
        - ./data:/data
      ports:
        - "6875:6875"

The data directory containing the activities-clean.json file is in the mneedham/materialize-sandbox/strava GitHub repository. The repository also contains setup instructions.

Once we’ve cloned that repository, we can launch Materialize by running the following command:

BASH docker-compose up

We should see the following output:

Results

Dec 16 16:13:15.027  INFO materialized: booting server
materialized v0.5.3 (f56c5a8a3)
OpenSSL 1.1.1g  21 Apr 2020
librdkafka v1.5.0
....
=======================================================================
Thank you for trying Materialize!

We are interested in any and all feedback you have, which may be able
to improve both our software and your queries! Please reach out at:

    Web: https://materialize.com
    GitHub issues: https://github.com/MaterializeInc/materialize/issues
    Email: support@materialize.io
    Twitter: @MaterializeInc
=======================================================================
materialized v0.5.3 (f56c5a8a3) listening on 0.0.0.0:6875...

We’re now ready to connect to Materialize, which we can do using the psql CLI tool:

BASH psql -h localhost -p 6875 materialize

Results

psql (12.5 (Ubuntu 12.5-0ubuntu0.20.04.1), server 9.5.0)
Type "help" for help.

materialize=>

Creating a materialized view

To query this file using Materialize we’ll need to create a source around the file and then a materialized view on top of that source. My understanding of the architecture at a high level is shown in the diagram below:

Figure 1. Materialize Architecture

Let’s now create a source on top of the file. We can do this using the CREATE SOURCE clause, as shown below:

SQL CREATE SOURCE activities
FROM FILE '/data/activities-clean.json' FORMAT TEXT;

We need to indicate where the file lives on the server, as well as the format of its contents. We also need to specify the format of the file, which in this case is TEXT. We’ll be able to describe the JSON structure in more detail when we create the view.

We can describe the source by running the following query:

SQL SHOW COLUMNS FROM activities;

Table 1. Results
name	nullable	type
mz_line_no	f	int8
text	f	text

So the main field we’re interested in is the text one, which contains our JSON documents.

Next we’re going to create a materialized view, in which we can use JSON functions to extract fields from the JSON documents and cast them to appropriate types. By default fields are imported as the text type. We can pull out the interesting data from the activities by creating the following materialized view:

SQL CREATE MATERIALIZED VIEW activities_view AS

SELECT  (val->>'id')::float::bigint AS id,
        (val->>'distance')::float AS distance,
        (val->>'moving_time')::float::int AS moving_time,
        (val->>'elapsed_time')::float::int AS elapsed_time,
        (val->>'total_elevation_gain')::float AS total_elevation_gain,
        (val->>'elev_high')::float AS elev_high,
        (val->>'elev_low')::float AS elev_low,
        (val->>'average_speed')::float AS average_speed,
        (val->>'max_speed')::float AS max_speed,
        (val->>'average_heartrate')::float AS average_heartrate,
        (val->>'max_heartrate')::float AS max_heartrate,
        (val->>'start_date')::timestamp AS start_date

FROM (SELECT text::json AS val FROM activities);

While I was working out which fields to extract, I found myself repeatedly creating and dropping the view. We can drop the view using the DROP VIEW clause:

SQL DROP VIEW activities_view;

And once we’re happy with it, we can use the SHOW COLUMNS clause to describe the view, as shown below:

SQL SHOW COLUMNS FROM activities_view;

Table 2. Results
name	nullable	type
average_heartrate	t	float8
average_speed	t	float8
distance	t	float8
elapsed_time	t	int4
elev_high	t	float8
elev_low	t	float8
id	t	int8
max_heartrate	t	float8
max_speed	t	float8
moving_time	t	int4
start_date	t	timestamp
total_elevation_gain	t	float8

Querying a materialized view

Ok, now it’s time for some fun, let’s query the materialized view that we’ve just created.

We’ll start with a SQL query that returns the latest 10 activities, sorted in descending order by start_date:

SQL SELECT *
FROM activities_view
ORDER BY start_date DESC
LIMIT 10;

Table 3. Results
id	distance	moving_time	elapsed_time	total_elevation_gain	elev_high	elev_low	average_speed	max_speed	average_heartrate	max_heartrate	start_date
470124807	14175.1	4222	4391	105.4	76.2	25	3.357	4.3	150.3	168	2020-12-14 05:31:43
460808499	13132.6	3794	3935	81.4	67.4	25	3.461	4.6	147.3	163	2020-12-12 06:28:06
456998655	12244.8	3580	3643	74.4	67.4	25	3.42	4.4	156.2	190	2020-12-11 04:15:32
448905703	12304.8	3567	3889	87.7	69.9	25	3.45	4.4	153.5	182	2020-12-09 05:17:29
440554208	12239.9	3856	3987	87	69.8	28.1	3.174	4	154.4	178	2020-12-07 05:38:44
431429475	13116.8	3992	4049	111.8	122.3	57	3.286	4.3	156.1	185	2020-12-05 08:41:17
427493338	12253.2	3756	3814	73.3	67.4	29.1	3.262	4.5	158.8	182	2020-12-04 05:33:21
419097799	12305	3852	4013	93.9	74.6	29.6	3.194	4.6	143.5	174	2020-12-02 05:36:33
410708776	10916.3	3456	3645	87.8	67.3	22.9	3.159	4.1	145.7	178	2020-11-30 05:40:06
400404590	12926.9	4142	4243	123.1	74.6	28.5	3.121	6.4	150.6	193	2020-11-28 07:17:00

We can use all the SQL aggregation functions that we’re used to. So if we wanted to work out the distance run, grouped by month, we could write the following query that uses the date_trunc function:

SQL SELECT date_trunc('month', start_date) AS month,
       sum(distance) AS totalDistance
FROM activities_view
GROUP BY month
ORDER BY month;

Table 4. Results
month	totaldistance
2020-10-01 00:00:00	75149.69999992847
2020-11-01 00:00:00	194009.49999952316
2020-12-01 00:00:00	101772.19999974966

I ran a lot more in November than the other two months, but admittedly we don’t have complete data for October or December (so far). If we pull in all that data, the total distance covered would likely be similar.

What about the fastest run and the pace per mile on that run? We can compute with a little help from the to_char function:

SQL SELECT start_date, distance,
       to_char(to_timestamp(moving_time / distance * 1609.34), 'MI:SS') AS pacePerMile,
       to_char(to_timestamp(moving_time), 'HH24:MI:SS') AS time
FROM activities_view
ORDER BY moving_time / distance
LIMIT 10;

Table 5. Results
start_date	distance	pacepermile	time
2020-12-12 06:28:06	13132.6	07:44	01:03:14
2020-12-09 05:17:29	12304.8	07:46	00:59:27
2020-10-21 04:34:38	12624.5	07:48	01:01:14
2020-10-23 03:49:00	13762.7	07:49	01:06:57
2020-12-11 04:15:32	12244.8	07:50	00:59:40
2020-12-14 05:31:43	14175.1	07:59	01:10:22
2020-10-24 06:04:41	13703.4	08:01	01:08:21
2020-10-30 05:43:05	8172.7	08:02	00:40:49
2020-10-28 05:33:27	14494.7	08:05	01:12:52
2020-11-18 05:42:50	11968.1	08:06	01:00:16

It looks like my fastest run was just a few days ago, but I have had a few others that were at a similar pace.

We could combine the previous two queries together to find the fastest pace per month, as shown in the following query:

SQL SELECT to_char(date_trunc('month', start_date), 'YYYY-MM') AS month,
       to_char(to_timestamp(min(moving_time / distance * 1609.34)), 'MI:SS') AS pacePerMile,
       to_char(to_timestamp(min(moving_time)), 'HH24:MI:SS') AS time
FROM activities_view
GROUP BY month;

Table 6. Results
month	pacepermile	time
2020-10	07:48	00:40:49
2020-11	08:06	00:39:39
2020-12	07:44	00:59:27

If we wanted to also show the average pace per mile, we could do this using the avg function instead of the min function on the 2nd line of the query:

SQL SELECT to_char(date_trunc('month', start_date), 'YYYY-MM') AS month,
       to_char(to_timestamp(min(moving_time / distance * 1609.34)), 'MI:SS') AS bestPacePerMile,
       to_char(to_timestamp(avg(moving_time / distance * 1609.34)), 'MI:SS') AS averagePacePerMile
FROM activities_view
GROUP BY month;

Table 7. Results
month	bestpacepermile	averagepacepermile
2020-10	07:48	08:00
2020-11	08:06	08:23
2020-12	07:44	08:04

In summary

I’ve only skimmed the surface of what we can do with Materialize. From my understanding, a more common use case is to execute SQL queries over streaming data, whereas what we have here is a flat file with static content. I guess querying streaming data is the next thing for me to explore!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.