· materialize

Materialize: Creating multiple views on one source

This is another post describing my exploration of Materialize, a SQL streaming database. In this post we’re going to learn how to create multiple views on top of the same underlying source.

materialize multi views banner

We’re still going to be using data extracted from Strava, an app that I use to record my runs, but this time we have more detailed information about each run. As in the previous blog posts, each run is represented as JSON document and store in the activities-detailed-all.json file in JSON lines format. Below is an example of one line in this file:

head -n1 data/activities-detailed-all.json |
jq -c '{id, distance, moving_time, elapsed_time,
        total_elevation_gain, elv_high, elev_low, average_speed,
        max_speed, average_heartrate, max_heartrate, start_date,
        best_efforts, splits_metric, splits_standard, segments_efforts}'
Results
{"id":259558400,"distance":3116.1,"moving_time":"0:17:48","elapsed_time":"0:17:55","total_elevation_gain":0,"elv_high":null,"elev_low":6,"average_speed":2.918,"max_speed":7.2,"average_heartrate":null,"max_heartrate":null,"start_date":"2015-02-25T06:18:39+00:00","best_efforts":[{"effort_name":"400m","moving_time":131,"elapsed_time":130,"effort_distance":400,"average_heartrate":null,"max_heartrate":null},{"effort_name":"1/2 mile","moving_time":266,"elapsed_time":265,"effort_distance":805,"average_heartrate":null,"max_heartrate":null},{"effort_name":"1k","moving_time":336,"elapsed_time":336,"effort_distance":1000,"average_heartrate":null,"max_heartrate":null},{"effort_name":"1 mile","moving_time":548,"elapsed_time":547,"effort_distance":1609,"average_heartrate":null,"max_heartrate":null}],"splits_metric":[{"distance":1002.9,"elapsed_time":"0:05:48","elevation_difference":3.2,"moving_time":"0:05:41","average_heartrate":null,"split":1,"pace_zone":0,"average_speed":2.94},{"distance":999.4,"elapsed_time":"0:05:48","elevation_difference":-1.9,"moving_time":"0:05:48","average_heartrate":null,"split":2,"pace_zone":0,"average_speed":2.87},{"distance":1001.7,"elapsed_time":"0:05:39","elevation_difference":-2.1,"moving_time":"0:05:39","average_heartrate":null,"split":3,"pace_zone":0,"average_speed":2.95},{"distance":112.1,"elapsed_time":"0:00:40","elevation_difference":0.7,"moving_time":"0:00:40","average_heartrate":null,"split":4,"pace_zone":0,"average_speed":2.8}],"splits_standard":[{"distance":1612.2,"elapsed_time":"0:09:21","elevation_difference":4.7,"moving_time":"0:09:14","average_heartrate":null,"split":1,"pace_zone":0,"average_speed":2.91},{"distance":1503.9,"elapsed_time":"0:08:34","elevation_difference":-4.8,"moving_time":"0:08:34","average_heartrate":null,"split":2,"pace_zone":0,"average_speed":2.93}],"segments_efforts":null}

Connecting to Materialize

We’re going to query that data using Materialize, which we’ll set up using the following Docker Compose configuration:

docker-compose.yml
  version: '3'
  services:
    materialize:
      image: materialize/materialized:v0.5.3
      container_name: "materialize-sandbox-docker-compose"
      volumes:
        - ./data:/data
      ports:
        - "6875:6875"

The data directory containing the activities-detailed-all.json file is in the mneedham/materialize-sandbox/strava GitHub repository. The repository also contains setup instructions.

Once we’ve cloned that repository, we can launch Materialize by running the following command:

docker-compose up

We’re now ready to connect to Materialize, which we can do using the psql CLI tool:

psql -h localhost -p 6875 materialize
Results
psql (12.5 (Ubuntu 12.5-0ubuntu0.20.04.1), server 9.5.0)
Type "help" for help.

materialize=>

Creating a materialized view

We’re going to start by creating a source around the file:

CREATE SOURCE
CREATE SOURCE activities_detailed_source
FROM FILE '/data/activities-detailed-all.json'
WITH(tail=true)
FORMAT TEXT;

Not it’s time to create some views on top of the source. We’ll start with the activites_detailed view, which returns the raw values from the JSON file:

CREATE VIEW
CREATE VIEW activities_detailed AS
SELECT  (val->>'id')::float::bigint AS id,
        (val->>'start_date')::timestamp AS start_date,
        (val->'best_efforts') AS best_efforts,
        (val->'splits_metric') AS splits_metric,
        (val->'splits_standard') AS splits_standard,
        (val->'segment_efforts') AS segment_efforts
FROM (SELECT text::json AS val FROM activities_detailed_source);

We’re not going to do any direct querying against this view, so we create a VIEW rather than a MATERIALIZED VIEW. If we want to query that view, we’ll need to make sure we append the AS OF clause to the end of our query, as shown in the following example:

SELECT *
FROM activities_detailed
LIMIT 1
AS OF 1;
Table 1. Results
id start_date best_efforts splits_metric splits_standard segment_efforts

3217642015

2020-03-25 18:27:10

[]

[{"average_heartrate":136.11428571428573,"average_speed":3.09,"distance":108.1,"elapsed_time":"0:00:41","elevation_difference":-0.3,"moving_time":"0:00:35","pace_zone":2.0,"split":1.0}]
[{"average_heartrate":136.11428571428573,"average_speed":3.09,"distance":108.1,"elapsed_time":"0:00:41","elevation_difference":-0.3,"moving_time":"0:00:35","pace_zone":2.0,"split":1.0}]

[]

This view returns JSON arrays for splits_metric, splits_standard, best_efforts, and segment_efforts. It’d be good if we could explode those array elements into individual rows so that we could query them, which we can do with the jsonb_array_elements function that I wrote about in Materialize: Querying JSON arrays.

We’d generally be querying each of these things separately, so we’ll create one materialized view for each one, as shown in the following diagram:

materialize views

Let’s see how to create a couple of those materialized views. We can do this by running the following queries:

CREATE MATERIALZIED VIEW best_efforts
CREATE MATERIALIZED VIEW best_efforts AS
SELECT id, start_date,
       (effort->>'effort_name') AS effort_name,
       (effort->>'effort_distance')::float AS distance,
       (effort->>'elapsed_time')::float AS elapsed_time,
       (effort->>'moving_time')::float AS moving_time,
       (effort->>'average_heartrate')::float AS average_heartrate,
       (effort->>'max_heartrate')::float AS max_heartrate
FROM  (
  SELECT id, start_date, best_efforts
  FROM activities_detailed
), jsonb_array_elements(best_efforts) AS effort;
CREATE MATERIALIZED VIEW segment_efforts
CREATE MATERIALIZED VIEW segment_efforts AS
SELECT id, start_date,
       (effort->>'segment_id') AS segment_id,
       (effort->>'segment_name') AS segment_name,
       (effort->>'distance')::float AS distance,
       (effort->>'elapsed_time')::float AS elapsed_time,
       (effort->>'moving_time')::float AS moving_time
FROM  (
  SELECT id, start_date, segment_efforts
  FROM activities_detailed
), jsonb_array_elements(segment_efforts) AS effort;

Querying the materialized views

Now that we’ve created those views, it’s time to query them.

We’ll start by finding the fastest 10km runs that I’ve done

select start_date, to_char(to_timestamp(moving_time), 'HH24:MI:SS') AS time
from best_efforts
WHERE effort_name = '10k'
ORDER BY moving_time
LIMIT 10;
Table 2. Results
start_date time

2018-08-18 09:50:40

00:44:03

2018-07-11 04:24:10

00:44:17

2019-06-14 10:14:39

00:44:32

2018-10-20 06:14:46

00:44:38

2018-05-19 08:55:00

00:44:43

2018-10-27 09:48:01

00:44:54

2018-07-20 11:05:28

00:44:57

2018-06-01 04:17:44

00:45:00

2019-05-25 06:45:53

00:45:03

2018-08-08 04:28:36

00:45:06

All my fastest times were in 2018 or 2019! I haven’t been anywhere near that sort of pace for a long while.

If we have the id for a specific run we can write a query to find all the best efforts for that run, as shown below:

select start_date,
       effort_name,
       to_char(to_timestamp(moving_time), 'MI:SS') AS time
from best_efforts
WHERE id=2448908609
ORDER BY moving_time
LIMIT 10;
Table 3. Results
start_date effort_name time

2019-06-14 10:14:39

400m

01:39

2019-06-14 10:14:39

1/2 mile

03:23

2019-06-14 10:14:39

1k

04:15

2019-06-14 10:14:39

1 mile

06:52

2019-06-14 10:14:39

2 mile

13:57

2019-06-14 10:14:39

5k

21:53

2019-06-14 10:14:39

10k

44:32

Now let’s move onto segment efforts. We can find the most popular segments by running the following query:

select segment_id, segment_name, count(*) AS count
from segment_efforts
GROUP BY segment_id, segment_name
ORDER BY count DESC
LIMIT 5;
Table 4. Results
segment_id segment_name count

17236468

Lap of the track

260

17875147

York to Vet

226

17875143

Stanley to Bridge

186

23952234

The misdiseaveing incline

96

23952158

5 gym dash

93

Lap of the track is a running track that I’ve clearly done a lot of laps on. Let’s explore those laps in more detail:

SELECT start_date,
       to_char(to_timestamp(min(moving_time)), 'MI:SS') AS fastestTime,
       to_char(to_timestamp(max(moving_time)), 'MI:SS') AS slowestTime,
       to_char(to_timestamp(avg(moving_time)), 'MI:SS') AS averageTime,
       count(*) AS lapsDone
FROM segment_efforts
WHERE segment_id = '17236468'
GROUP BY start_date
ORDER BY min(moving_time)
LIMIT 5;
Table 5. Results
start_date fastesttime slowesttime averagetime lapsdone

2018-04-02 16:05:31

01:42

02:08

01:55

18

2018-03-26 02:23:47

01:46

02:10

02:07

15

2016-07-22 05:18:37

01:48

02:10

02:01

9

2016-08-03 12:16:23

01:49

02:06

02:01

13

2020-03-11 02:35:09

01:50

02:24

02:11

25

Again my fastest laps were in 2018! I clearly need to work on my speed.

In summary

One of the things that I really like about Materialize is that you can defer data transformation. I’m used to having to decide up front how I want to shape the data and then having the queries I can write be limited by that initial transformation. With Materialize you can instead create a single source and then later on create as many views as you want to analyse the data in all different ways.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket