Neo4j: Querying the Strava Graph using Py2neo
Last week Nigel released v4 of Py2neo and given I was just getting ready to write some queries against my Strava activity graph I thought I’d give it a try.
If you want to learn how to create your own Strava graph you should read my previous post, but just to recap, this is the graph model that we created:
Let’s get to it!
tl;dr
the code in this post is available as a Jupyter notebook so if you want the code and nothing but the code head over there!
Importing libraries
Importing and creating a Graph
object in Py2neo is as simple as executing the following lines of code:
from py2neo import Graph
graph = Graph("bolt://localhost:7687", auth=("neo4j", "neo"))
Find the most recent activities
We’ll start with a simple query where we find the most recent activities that I’ve done.
recent_runs_df = graph.run("""\
MATCH (r:Run)
WITH r { .id, .startDate, .name, .movingTime, .distance,
pace: duration({seconds: r.movingTime.seconds / r.distance * 1000}) }
RETURN r.name,
apoc.date.format(r.startDate.epochSeconds, 's', 'MMM d yyyy') AS dateOfRun,
r.distance,
apoc.date.format(r.movingTime.milliseconds, 'ms', 'HH:mm:ss') AS time,
apoc.date.format(r.pace.milliseconds, "ms", "mm:ss") AS pacePerKm,
r.startDate AS startDate
""").to_data_frame()
recent_runs_df \
.sort_values(by = ["startDate"], ascending=False) \
.drop(["startDate"], axis=1) \
.head(10)
This query converts the Cypher output into a DataFrame with the to_data_frame
function.
There are other formats that we can use instead but I want to post process the results using Pandas so this format works best for me.
If we run that query we’ll get the following output:
dateOfRun | pacePerKm | r.distance | r.name | time | |
---|---|---|---|---|---|
605 | Jun 13 2018 | 04:33 | 10780.2 | Morning Run | 00:49:11 |
604 | Jun 11 2018 | 04:34 | 10035.8 | Morning Run | 00:45:53 |
603 | Jun 9 2018 | 04:23 | 11043.2 | Morning Run | 00:48:29 |
602 | Jun 8 2018 | 04:36 | 11281.2 | Morning Run | 00:51:57 |
586 | Jun 6 2018 | 04:28 | 10884.9 | Morning Run | 00:48:38 |
585 | Jun 4 2018 | 04:24 | 10030.5 | Morning Run | 00:44:14 |
584 | Jun 2 2018 | 04:25 | 13039.8 | Morning Run | 00:57:41 |
601 | Jun 1 2018 | 04:29 | 10701.7 | Morning Run | 00:48:02 |
600 | May 30 2018 | 04:41 | 9828.4 | Morning Run | 00:46:06 |
599 | May 28 2018 | 04:34 | 10193.0 | Morning Run | 00:46:42 |
We could also do the sorting of the data directly in Cypher, but I wanted to have all the activities available in the DataFrame so that we could play around with it directly in Pandas without having to re-run the query each time.
Find the longest activities
For example, we could find the activities ordered by distance:
recent_runs_df \
.sort_values(by = ["r.distance"], ascending=False) \
.drop(["startDate"], axis=1) \
.head(10)
If we run that query we’ll get the following output:
dateOfRun | pacePerKm | r.distance | r.name | time | |
---|---|---|---|---|---|
594 | May 19 2018 | 04:30 | 13635.2 | Morning Run | 01:01:27 |
584 | Jun 2 2018 | 04:25 | 13039.8 | Morning Run | 00:57:41 |
590 | May 12 2018 | 04:40 | 12963.9 | Morning Run | 01:00:33 |
597 | May 25 2018 | 04:37 | 11690.0 | Morning Run | 00:54:02 |
572 | May 5 2018 | 04:31 | 11674.4 | Morning Run | 00:52:48 |
531 | Mar 17 2018 | 04:37 | 11299.0 | Afternoon Run | 00:52:19 |
602 | Jun 8 2018 | 04:36 | 11281.2 | Morning Run | 00:51:57 |
548 | Jan 27 2018 | 04:37 | 11064.7 | Afternoon Run | 00:51:05 |
603 | Jun 9 2018 | 04:23 | 11043.2 | Morning Run | 00:48:29 |
586 | Jun 6 2018 | 04:28 | 10884.9 | Morning Run | 00:48:38 |
We can also visualise the distances over time using matplotlib:
%matplotlib inline
recent_runs_df.plot(x="startDate", y="r.distance")
Estimated runs
In our graph we also have estimated attempts at different distances.
When you submit an activity Strava generates predicted best times for a range of distances less than the total distance for that activity. For example if you submit an activity of 6,000 metres, Strava will generated estimated times for 5k, 2 miles, 1 mile, 1k, and 400m.
We can write the following query to find my best 10k attempts:
estimated_effort_query = """\
MATCH (distance:Distance {name: {distance}})<-[:DISTANCE]-(effort),
(effort)<-[:DISTANCE_EFFORT]-(run)
WITH run { .id, .startDate, .distance, pace: duration({
seconds: run.elapsedTime.seconds / run.distance * 1000})},
effort { .elapsedTime, pace: duration({
seconds: effort.elapsedTime.seconds / distance.distance * 1000 })}
RETURN apoc.date.format(run.startDate.epochSeconds, 's', 'MMM d yyyy') AS dateOfRun,
apoc.date.format(effort.elapsedTime.milliseconds, 'ms', 'mm:ss') AS time,
apoc.date.format(effort.pace.milliseconds, "ms", "mm:ss") AS pacePerKm,
apoc.math.round(toFloat(effort.pace.seconds) / 60, 2) AS pacePerKmFloat,
apoc.date.format(run.pace.milliseconds, "ms", "mm:ss") AS overallPacePerKm,
run.distance AS totalDistance,
run.startDate AS startDate
ORDER BY effort.elapsedTime
LIMIT {limit}
"""
df_10k = graph \
.run(estimated_effort_query, {"distance": "10k", "limit": 1000}) \
.to_data_frame()
df_10k \
.drop(["startDate", "pacePerKmFloat"], axis=1) \
.head(10)
dateOfRun | overallPacePerKm | pacePerKm | time | totalDistance | |
---|---|---|---|---|---|
0 | Jun 9 2018 | 04:24 | 04:24 | 44:05 | 11043.2 |
1 | Jun 2 2018 | 04:25 | 04:25 | 44:12 | 13039.8 |
2 | Jun 4 2018 | 04:27 | 04:27 | 44:35 | 10030.5 |
3 | Jun 6 2018 | 04:28 | 04:28 | 44:49 | 10884.9 |
4 | Jun 1 2018 | 04:29 | 04:29 | 44:59 | 10701.7 |
5 | May 19 2018 | 04:31 | 04:30 | 45:02 | 13635.2 |
6 | May 5 2018 | 04:33 | 04:32 | 45:26 | 11674.4 |
7 | May 21 2018 | 04:33 | 04:34 | 45:41 | 10147.0 |
8 | Jun 13 2018 | 04:34 | 04:34 | 45:48 | 10780.2 |
9 | May 16 2018 | 04:34 | 04:34 | 45:49 | 10148.8 |
We can also create a matplotlib plot to see my 10k runs over time:
%matplotlib inline
df_10k.sort_values(by=["startDate"]).plot(x="startDate", y="pacePerKmFloat")
Combining estimated runs
What if we want to combine my efforts for multiple distances in a single row?
The pandas merge
function comes in handy here.
For example we could write the following code to create a DataFrame that combines my 10k and 5k runs:
import pandas as pd
cols = ["run.id", "dateOfRun_x", "time_x", "pacePerKm_x", "time_y", "pacePerKm_y"]
df_5k = graph \
.run(estimated_effort_query, {"distance": "5k", "limit": 1000}) \
.to_data_frame()
df_10k = graph \
.run(estimated_effort_query, {"distance": "10k", "limit": 1000}) \
.to_data_frame()
pd.merge(df_5k, df_10k, on="run.id")[cols].head(10)
run.id | dateOfRun_x | time_x | pacePerKm_x | time_y | pacePerKm_y | |
---|---|---|---|---|---|---|
0 | 1611168962 | Jun 2 2018 | 21:24 | 04:16 | 44:12 | 04:25 |
1 | 1626636004 | Jun 9 2018 | 21:34 | 04:18 | 44:05 | 04:24 |
2 | 1620188065 | Jun 6 2018 | 22:00 | 04:24 | 44:49 | 04:28 |
3 | 1616050069 | Jun 4 2018 | 22:01 | 04:24 | 44:35 | 04:27 |
4 | 1581401227 | May 19 2018 | 22:15 | 04:27 | 45:02 | 04:30 |
5 | 1609355524 | Jun 1 2018 | 22:16 | 04:27 | 44:59 | 04:29 |
6 | 1635264102 | Jun 13 2018 | 22:20 | 04:28 | 45:48 | 04:34 |
7 | 1585759604 | May 21 2018 | 22:27 | 04:29 | 45:41 | 04:34 |
8 | 1566256778 | May 12 2018 | 22:30 | 04:30 | 46:11 | 04:37 |
9 | 1624450599 | Jun 8 2018 | 22:30 | 04:30 | 46:11 | 04:37 |
That’s all for now. Hopefully that’s given you some ideas of how you can use the new version of Py2neo to query your graph data.
I only covered some examples in this blog post. If you want to see the full set you can look at the accompanying Jupyter notebook.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.