Materialize: Unable to automatically determine a timestamp for your query; this can happen if your query depends on non-materialized sources
This is another post describing my exploration of Materialize, a SQL streaming database. In this post I’m going to explain a confusing (to me at least) error message that you might come across when you’re getting started.
As I mentioned in my first post about Materialize, the general idea is that you create a source around a data resource and then a view on top of that. Those views can either be materialized or non-materialized.
-
A materialized view embeds a query like a traditional SQL view, but—unlike a SQL view—compute and incrementally update the results of the embedded query.
-
A non-materialized view stores a verbatim query, and provides a shorthand for performing the query.
Before we decide what query the view is going to embed, we need to create a source. In this post, we’ll use the following file that contains one JSON document:
{"id": "1341755954614861826", "conversation_id": "1341755954614861826", "created_at": "2020-12-23 14:42:02 GMT", "date": "2020-12-23", "time": "14:42:02", "timezone": "+0000", "user_id": 856240505826496513, "username": "suriyasubraman", "name": "Suriya Subramanian", "place": "", "tweet": "Impact of COVID-19 On Internet of Things (IoT) Networks Market 2020 Industry Challenges ... https://t.co/ndGN2xRKzv #iot #data #internetofthings", "language": "en", "mentions": [], "urls": ["http://dlvr.it/RpCyyv"], "photos": [], "replies_count": 0, "retweets_count": 0, "likes_count": 0, "hashtags": ["iot", "data", "internetofthings"], "cashtags": [], "link": "https://twitter.com/SuriyaSubraman/status/1341755954614861826", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""}
Note
|
For brevity’s sake, I’m not including instructions for how to install or connect to Materialize in this post. If you want to learn how to do that, see my post about querying JSON files with Materialize. |
We can create a source for this file by running the following query:
CREATE SOURCE covid_sample_source
FROM FILE '/data/covid_sample_small.json'
WITH(tail=true)
FORMAT TEXT;
Next, we’re going to create a materialized view on top of this source so that we can query the data. At the moment we don’t know which fields to include in that view, so let’s see what we’ve got to work with, by running the following query:
SELECT *
FROM covid_sample_source;
ERROR: Unable to automatically determine a timestamp for your query; this can happen if your query depends on non-materialized sources |
Hmmm, that’s not what I expected to happen! I couldn’t find any documentation explaining what to do about this error, so a trip to the Materialize User Slack was needed!
In a message on there, Frank McSherry pointed out that Materialize needs to determine a time for the query since the source represents a changing source of data.
If we wrap the query in a materialized view this problem is taken care for us, otherwise we need to specify a timestamp using the AS OF
syntax.
Let’s give that a try:
SELECT *
FROM covid_sample_source
AS OF 1;
text | mz_line_no |
---|---|
{"id": "1341755954614861826", "conversation_id": "1341755954614861826", "created_at": "2020-12-23 14:42:02 GMT", "date": "2020-12-23", "time": "14:42:02", "timezone": "+0000", "user_id": "856240505826496513", "username": "suriyasubraman", "name": "Suriya Subramanian", "place": "", "tweet": "Impact of COVID-19 On Internet of Things (IoT) Networks Market 2020 Industry Challenges … https://t.co/ndGN2xRKzv #iot #data #internetofthings", "language": "en", "mentions": [], "urls": ["http://dlvr.it/RpCyyv"], "photos": [], "replies_count": 0, "retweets_count": 0, "likes_count": 0, "hashtags": ["iot", "data", "internetofthings"], "cashtags": [], "link": "https://twitter.com/SuriyaSubraman/status/1341755954614861826", "retweet": false, "quote_url": "", "video": 0, "thumbnail": "", "near": "London", "geo": "", "source": "", "user_rt_id": "", "user_rt": "", "retweet_id": "", "reply_to": [], "retweet_date": "", "translate": "", "trans_src": "", "trans_dest": ""} |
1 |
That’s more like it! Now we can update our query to extract just the fields that we want using the JSON operators:
SELECT val->>'id' AS id,
val->>'tweet' AS tweet,
val->'hashtags' AS hashtags,
val->>'date' AS date
FROM (SELECT text::jsonb AS val FROM covid_sample_source)
AS OF 1;
id | tweet | hashtags | date |
---|---|---|---|
1341755954614861826 |
Impact of COVID-19 On Internet of Things (IoT) Networks Market 2020 Industry Challenges … https://t.co/ndGN2xRKzv #iot #data #internetofthings |
["iot","data","internetofthings"] |
2020-12-23 |
Once we’re happy with our query, we can create a materialized view around the query, making sure to remove the AS OF
bit:
CREATE MATERIALIZED VIEW covid_sample AS
SELECT val->>'id' AS id,
val->>'tweet' AS tweet,
val->'hashtags' AS hashtags,
val->>'date' AS date
FROM (SELECT text::jsonb val FROM covid_sample_source);
We can now query the materialized view like so:
SELECT *
FROM covid_sample;
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.