Materialize: Querying JSON arrays
In a blog post I wrote a couple of weeks ago, we learned how to analyse JSON files using the Materialize SQL streaming database.
In this post we’re going to build on that knowledge to analyse a JSON file of tweets that contain arrays of hashtags. It took me a while to figure out to do this, so I wanted to share what I learnt along the way.
The JSON file that we’re going to analyse looks like this and we’ll save that file in a data
directory locally.
{"id": "1341755954614861826", "conversation_id": "1341755954614861826", "created_at": "2020-12-23 14:42:02 GMT", "date": "2020-12-23", "time": "14:42:02", "timezone": "+0000", "user_id": 856240505826496513, "username": "suriyasubraman", "name": "Suriya Subramanian", "place": "", "tweet": "Impact of COVID-19 On Internet of Things (IoT) Networks Market 2020 Industry Challenges ... https://t.co/ndGN2xRKzv #iot #data #internetofthings", "language": "en", "mentions": [], "urls": ["http://dlvr.it/RpCyyv"], "photos": [], "replies_count": 0, "retweets_count": 0, "likes_count": 0, "hashtags": ["iot", "data", "internetofthings"], "cashtags": [], "link": "https://twitter.com/SuriyaSubraman/status/1341755954614861826", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
{"id": "1341755827158441984", "conversation_id": "1341755827158441984", "created_at": "2020-12-23 14:41:32 GMT", "date": "2020-12-23", "time": "14:41:32", "timezone": "+0000", "user_id": 26450334, "username": "nickkeca", "name": "Nick Keca", "place": "", "tweet": "Non-compliance is the only thing ordinary people have to fight against the agenda hiding behind this #covid cloak. BUT, we are fighting against financial interests that are so powerful that only strength of numbers & people power can prevail", "language": "en", "mentions": [], "urls": [], "photos": [], "replies_count": 0, "retweets_count": 0, "likes_count": 0, "hashtags": ["covid"], "cashtags": [], "link": "https://twitter.com/nickkeca/status/1341755827158441984", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
{"id": "1341755981420703744", "conversation_id": "1341755981420703744", "created_at": "2020-12-23 14:42:09 GMT", "date": "2020-12-23", "time": "14:42:09", "timezone": "+0000", "user_id": 147562101, "username": "jasonjamesstone", "name": "Jason Stone 🔶 🇪🇺🇬🇧🏴🇧🇷", "place": "", "tweet": "We can't afford to have a damaging #Brexit and an out of control Coronavirus pandemic at the same time. Sign the petition: Extend the #BrexitTransition Period until the virus is under control https://t.co/yyI3miEJLg", "language": "en", "mentions": [], "urls": ["https://www.londonlibdems.org.uk/extendbrexit?e=13bee6003ca0b15761a0a8e71e926169&utm_source=ldlondon&utm_medium=email&utm_campaign=brexit_extension&n=3"], "photos": [], "replies_count": 0, "retweets_count": 0, "likes_count": 0, "hashtags": ["brexit", "brexittransition"], "cashtags": [], "link": "https://twitter.com/jasonjamesstone/status/1341755981420703744", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
{"id": "1341754919896948736", "conversation_id": "1341754919896948736", "created_at": "2020-12-23 14:37:56 GMT", "date": "2020-12-23", "time": "14:37:56", "timezone": "+0000", "user_id": 14508711, "username": "beecee", "name": "Brigid Coady 🌈", "place": {"type": "Point", "coordinates": [51.49594393, -0.13355317]}, "tweet": "Finish work. Give blood. My Xmas present to the world! #morningcommute #xmas #giveblood #plasma #platelets #covid19 #coronavirus #tier4 #lockdown #London @ Westminster https://t.co/Tq42OaN811", "language": "en", "mentions": [], "urls": ["https://www.instagram.com/p/CJJNIidgG4M/?igshid=1hu20v1xohrce"], "photos": [], "replies_count": 0, "retweets_count": 0, "likes_count": 0, "hashtags": ["morningcommute", "xmas", "giveblood", "plasma", "platelets", "covid19", "coronavirus", "tier4", "lockdown", "london"], "cashtags": [], "link": "https://twitter.com/beecee/status/1341754919896948736", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
{"id": "1341753938333237253", "conversation_id": "1341753938333237253", "created_at": "2020-12-23 14:34:02 GMT", "date": "2020-12-23", "time": "14:34:02", "timezone": "+0000", "user_id": 820537716798619648, "username": "amprouk", "name": "AMPro", "place": "", "tweet": "Motability provides #coronavirus update for customers https://t.co/dRX6xXEtOM", "language": "en", "mentions": [], "urls": ["http://dlvr.it/RpCxYZ"], "photos": [], "replies_count": 0, "retweets_count": 0, "likes_count": 0, "hashtags": ["coronavirus"], "cashtags": [], "link": "https://twitter.com/amprouk/status/1341753938333237253", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
Connecting to Materialize
We’re going to query that data using Materialize, which we’ll set up using the following Docker Compose configuration:
version: '3'
services:
materialize:
image: materialize/materialized:v0.5.3
container_name: "materialize-sandbox-docker-compose"
volumes:
- ./data:/data
ports:
- "6875:6875"
The data
directory containing the covid_sample.json
file is in the mneedham/materialize-sandbox/twitter GitHub repository.
The repository also contains setup instructions.
Once we’ve cloned that repository, we can launch Materialize by running the following command:
docker-compose up
We’re now ready to connect to Materialize, which we can do using the psql
CLI tool:
psql -h localhost -p 6875 materialize
psql (12.5 (Ubuntu 12.5-0ubuntu0.20.04.1), server 9.5.0)
Type "help" for help.
materialize=>
Creating a materialized view
Now we’re to create a source around the file and then a materialized view on top of that source.
CREATE SOURCE covid_sample_source
FROM FILE '/data/covid_sample.json'
WITH(tail=true)
FORMAT TEXT;
CREATE MATERIALIZED VIEW covid_sample AS
SELECT (val->>'id')::float::bigint AS id,
(val->>'username')::text AS username,
(val->>'hashtags')::text AS hashtags,
(val->>'tweet')::text AS tweet
FROM (SELECT text::jsonb AS val FROM covid_sample_source);
We can describe the view by running the following query:
SHOW COLUMNS
FROM covid_sample;
name | nullable | type |
---|---|---|
hashtags |
t |
text |
id |
t |
int8 |
tweet |
t |
text |
username |
t |
text |
The interesting thing here is that the hashtags
column is storing data in the text
type.
Querying a materialized view
Now it’s time to write some queries.
Let’s start with a query that returns all columns and rows in the covid_sample
view:
SElECT *
FROM covid_sample;
id | username | hashtags | tweet |
---|---|---|---|
1341753938333237248 |
amprouk |
["coronavirus"] |
Motability provides #coronavirus update for customers https://t.co/dRX6xXEtOM |
1341755954614861824 |
suriyasubraman |
["iot","data","internetofthings"] |
Impact of COVID-19 On Internet of Things (IoT) Networks Market 2020 Industry Challenges … https://t.co/ndGN2xRKzv #iot #data #internetofthings |
1341755981420703744 |
jasonjamesstone |
["brexit","brexittransition"] |
We can’t afford to have a damaging #Brexit and an out of control Coronavirus pandemic at the same time. Sign the petition: Extend the #BrexitTransition Period until the virus is under control https://t.co/yyI3miEJLg |
1341755827158441984 |
nickkeca |
["covid"] |
Non-compliance is the only thing ordinary people have to fight against the agenda hiding behind this #covid cloak. BUT, we are fighting against financial interests that are so powerful that only strength of numbers & people power can prevail |
1341754919896948736 |
beecee |
["morningcommute","xmas","giveblood","plasma","platelets","covid19","coronavirus","tier4","lockdown","london"] |
Finish work. Give blood. My Xmas present to the world! #morningcommute #xmas #giveblood #plasma #platelets #covid19 #coronavirus #tier4 #lockdown #London @ Westminster https://t.co/Tq42OaN811 |
I wanted to write a query that shows how many tweets each hashtag appears in, so we’ll need to 'explode' the values in hashtags
column into rows, which we can do using the jsonb_array_elements
function.
My first attempt was the following:
SELECT jsonb_array_elements(hashtags) AS ht
FROM covid_sample
LIMIT 5;
ERROR: table function (jsonb_array_elements) in scalar position not yet supported, see https://github.com/MaterializeInc/materialize/issues/1546 for more details
This error indicates that we need to use the json_array_elements
function in the FROM
part of the query.
Let’s try that:
SELECT ht
FROM covid_sample,
jsonb_array_elements(hashtags) AS ht
LIMIT 5;
ERROR: Cannot call function jsonb_array_elements(string): arguments cannot be implicitly cast to any implementation's parameters; try providing explicit casts
Hmmm, still no good.
The problem this time is that the jsonb_array_elements
function excepts to receive values of type jsonb
and the hashtags
column returns values of type text
.
We can fix that by casting to the correct type, like this:
SELECT ht
FROM covid_sample,
jsonb_array_elements(hashtags :: jsonb) AS ht
LIMIT 5;
ht |
---|
"iot" |
"data" |
"xmas" |
"covid" |
"tier4" |
Success! Now we can write a query that shows how many times each hashtag has been used:
SELECT ht, count(*) AS count
FROM covid_sample,
jsonb_array_elements(hashtags :: jsonb) AS ht
GROUP BY ht
ORDER BY count DESC
LIMIT 5;
ht | count |
---|---|
"coronavirus" |
2 |
"iot" |
1 |
"data" |
1 |
"xmas" |
1 |
"covid" |
1 |
We could even go further than this and create a view that returns hashtags and their counts:
CREATE MATERIALIZED VIEW covid_hashtags AS
SELECT ht, count(*) AS count
FROM covid_sample,
jsonb_array_elements(hashtags :: jsonb) AS ht
GROUP BY ht
ORDER BY count DESC;
Which we can query like this:
SELECT *
FROM covid_hashtags
LIMIT 5;
ht | count |
---|---|
"iot" |
1 |
"data" |
1 |
"xmas" |
1 |
"covid" |
1 |
"tier4" |
1 |
Interestingly the ORDER BY
doesn’t seem to be reflected in the results.
I learnt from Frank McSherry that it’s actually a feature of SQL that the ORDER BY
clause isn’t inherited from the view.
Now let’s add a couple of extra tweets to the end of covid_sample.json
:
{"id": "1341759529520926722", "conversation_id": "1341759529520926722", "created_at": "2020-12-23 14:56:15 GMT", "date": "2020-12-23", "time": "14:56:15", "timezone": "+0000", "user_id": 368587842, "username": "piterk68", "name": "Peter Lewis", "place": "", "tweet": "A quick thread reflecting on two themes of my year : #COVID & #Inequality particularly race inequality, with a few thank yous at the end:", "language": "en", "mentions": [], "urls": [], "photos": [], "replies_count": 1, "retweets_count": 0, "likes_count": 1, "hashtags": ["covid", "inequality"], "cashtags": [], "link": "https://twitter.com/piterk68/status/1341759529520926722", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
{"id": "1341759301908631554", "conversation_id": "1341759301908631554", "created_at": "2020-12-23 14:55:20 GMT", "date": "2020-12-23", "time": "14:55:20", "timezone": "+0000", "user_id": 57581622, "username": "lamuscle", "name": "LA Muscle", "place": "", "tweet": "Fitter & Stronger With Minimal Equipment Try This Excellent Pump Workout Full article here: https://t.co/HY3e9wfX0P #lamuscle #bodyweight #coronavirus #covid19 #isolation #training #fitness #health #muscle #exercise #workout #homeworkout #outdoors #healthy #dumbbells #lean #diet https://t.co/91wU7DcZRA", "language": "en", "mentions": [], "urls": ["https://www.lamuscle.com/knowledge/lose-fat/fitter-stronger-pump-workout"], "photos": ["https://pbs.twimg.com/media/Ep7i9fvXYAIb5Ay.jpg"], "replies_count": 0, "retweets_count": 0, "likes_count": 1, "hashtags": ["lamuscle", "bodyweight", "coronavirus", "covid19", "isolation", "training", "fitness", "health", "muscle", "exercise", "workout", "homeworkout", "outdoors", "healthy", "dumbbells", "lean", "diet"], "cashtags": [], "link": "https://twitter.com/LAMuscle/status/1341759301908631554", "retweet": false, "quote_url": "", "video": 1, "thumbnail": "https://pbs.twimg.com/media/Ep7i9fvXYAIb5Ay.jpg", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
And if we run the query against covid_hashtags
again:
SELECT *
FROM covid_hashtags
ORDER BY count DESC
LIMIT 5;
The hashtags from those tweets will be reflected in the results returned:
ht | count |
---|---|
"coronavirus" |
3 |
"covid" |
2 |
"covid19" |
2 |
"iot" |
1 |
"data" |
1 |
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.