Loading and analysing Strava runs using PostgreSQL JSON data type
In my last post I showed how to map Strava runs using data that I’d extracted from their https://strava.github.io/api/v3/activities/ API, but the API returns a lot of other data that I discarded because I wasn’t sure what I should keep.
The API returns a nested JSON structure so the easiest solution would be to save each run as an individual file but I’ve always wanted to try out PostgreSQL’s JSON data type and this seemed like a good opportunity.
Creating a JSON ready PostgreSQL table
First up we need to create a database in which we’ll store our Strava data. Let’s name it appropriately:
create database strava;
\connect strava;
Now we can now create a table with one field with the JSON data type:
CREATE TABLE runs (
id INTEGER NOT NULL,
data jsonb
);
ALTER TABLE runs ADD PRIMARY KEY(id);
Easy enough. Now we’re ready to populate the table.
Importing Strava API
We can partially reuse the script from the last post except rather than saving to CSV file we’ll save to PostgreSQL using the psycopg2 library.
The script relies on a TOKEN environment variable. If you want to try this on your own Strava account you’ll need to create an application, which will give you a key.
extract-runs.py
import requests
import os
import json
import psycopg2
token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}
with psycopg2.connect("dbname=strava user=markneedham") as conn:
with conn.cursor() as cur:
page = 1
while True:
r = requests.get("https://www.strava.com/api/v3/athlete/activities?page={0}".format(page), headers = headers)
response = r.json()
if len(response) == 0:
break
else:
for activity in response:
r = requests.get("https://www.strava.com/api/v3/activities/{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
json_response = r.json()
cur.execute("INSERT INTO runs (id, data) VALUES(%s, %s)", (activity["id"], json.dumps(json_response)))
conn.commit()
page += 1
Querying Strava
We can now write some queries against our newly imported data.
My quickest runs
SELECT id, data->>'start_date' as start_date,
(data->>'average_speed')::float as speed
FROM runs
ORDER BY speed DESC
LIMIT 5
id | start_date | speed
-----------+----------------------+-------
649253963 | 2016-07-22T05:18:37Z | 3.736
914796614 | 2017-03-26T08:37:56Z | 3.614
653703601 | 2016-07-26T05:25:07Z | 3.606
548540883 | 2016-04-17T18:18:05Z | 3.604
665006485 | 2016-08-05T04:11:21Z | 3.604
(5 rows)
My longest runs
SELECT id, data->>'start_date' as start_date,
(data->>'distance')::float as distance
FROM runs
ORDER BY distance DESC
LIMIT 5
id | start_date | distance
-----------+----------------------+----------
840246999 | 2017-01-22T10:20:33Z | 10764.1
461124609 | 2016-01-02T08:42:47Z | 10457.9
467634177 | 2016-01-10T18:48:47Z | 10434.5
471467618 | 2016-01-16T12:33:28Z | 10359.3
540811705 | 2016-04-10T07:26:55Z | 9651.6
(5 rows)
Runs this year
SELECT COUNT(*)
FROM runs
WHERE data->>'start_date' >= '2017-01-01 00:00:00'
count
-------
62
(1 row)
Runs per year
SELECT EXTRACT(year from to_date(data->>'start_date', 'YYYY-mm-dd')) AS year,
count(*)
FROM runs
GROUP BY year
ORDER BY year
year | count
------+-------
2014 | 18
2015 | 139
2016 | 166
2017 | 62
(4 rows)
That’s all for now. Next I’m going to learn how to query segments, which are stored inside a nested array inside the JSON document. Stay tuned for that in a future post.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.