5 Jun 2018

Neo4j APOC: Importing data from Strava's paginated JSON API

Over the weekend I’ve been playing around with loading data from the Strava API into Neo4j and I started with the following Python script which creates a node with a Run label for each of my activities.

If you want to follow along on your own data you’ll need to get an API key via the 'My API Application' section of the website. Once you’ve got that put it in the TOKEN environment variable and you should be good to go.

PYTHON import os
from neo4j.v1 import GraphDatabase

password = os.environ["NEO4J_PASSWORD"]

driver = GraphDatabase.driver("bolt://localhost", auth=("neo4j", password))
with driver.session() as session:
    page = 1
    while True:
        result = session.run("""\
        WITH "https://www.strava.com/api/v3/athlete/activities?page=" + $page AS uri
        CALL apoc.load.jsonParams(uri, {Authorization: $stravaToken}, null)
        YIELD value

        MERGE (run:Run {id: value.id})
        SET run.distance = toFloat(value.distance),
            run.startDate = datetime(value.start_date_local),
            run.elapsedTime = duration({seconds: value.elapsed_time})

        RETURN count(*) AS count
        """, {"page": page, "stravaToken": "Bearer {0}".format(os.environ["TOKEN"])})

        runs_imported = result.peek()["count"]
        print("Runs imported:", runs_imported)
        if runs_imported == 0:
            break
        else:
            page += 1

The Strava API is a bit unusual in that it doesn’t return any meta data indicating whether there are more pages to come - you get the data and only the data! We’ll receive an empty array once we reach the end so we have to check for that condition and exit our loop when its met.

Most of the word is being done by APOC's apoc.load.jsonParams procedure - we’re only handling the pagination stuff in Python. Much as I love a good Python script, I was curious whether I could write the whole import script using just Cypher and get rid of the Python code completely.

Attempt 1: Using an `Import` meta data node

Let’s get started! Before we do anything we’ll create a parameter containing our Strava token:

:params {stravaToken: "Bearer <insert-strava-token>"}

My first solution for handling this pagination is to create a separate meta data node which can keep track of the page we’re up to. We should then be able to increment a page property on that node after every call to the Strava API. We’ll wrap our call to apoc.load.jsonParams inside one of APOC’s periodic commit procedures.

First let’s create our meta data node:

CREATE (:Import {page: 1})

Now we need to work out which of the periodic commit procedures is the best fit. There are several to pick from:

apoc.periodic.commit - runs the given kernelTransaction in separate transactions until it returns 0
apoc.periodic.rock_n_roll_while - run the action kernelTransaction in batches over the iterator kernelTransaction’s results in a separate thread.
apoc.periodic.rock_n_roll - run the action kernelTransaction in batches over the iterator kernelTransaction’s results in a separate thread.
apoc.periodic.iterate - run the second kernelTransaction for each item returned by the first kernelTransaction.

After playing around with a solution in my head and then sketching out a toy example in the Neo4j browser (as well as doubting that this was even possible), I figured out that apoc.periodic.commit would be best suited.

We can wrap our call to apoc.load.jsonParams in the following call to apoc.periodic.commit:

call apoc.periodic.commit("
  MATCH (import:Import)
  WITH 'https://www.strava.com/api/v3/athlete/activities?page=' + import.page AS uri, import.page AS initialPage, import
  CALL apoc.load.jsonParams(uri, {Authorization: $stravaToken}, null)
  YIELD value

  MERGE (run:Run {id: value.id})
  SET run.distance = toFloat(value.distance),
      run.startDate = datetime(value.start_date_local),
      run.elapsedTime = duration({seconds: value.elapsed_time})

  WITH initialPage, import, CASE WHEN count(*) < 30 THEN 0 ELSE count(*) END AS count
  FOREACH(ignoreMe in CASE WHEN count = 0 THEN [] ELSE [1] END |
    MERGE (import)
    SET import.page = initialPage+1
  )
  RETURN count
", {stravaToken: $stravaToken})

This script will read the page property from our Import node and call the Strava API starting at that page via apoc.load.jsonParams. We create a node with the label Run for each activity and then we check if any rows were actually returned. The default activities per page is 30 so if we receive less than that back then we know we’re reached the end of the stream and can return a count of 0. We then use the FOREACH hack to increment the page property on the Import node before returning the count. Finally we return the count and once the value returned is 0 the Cypher statement will complete.

This works but it’s a bit annoying to have to create the Import node to store our page number. We also can’t easily reuse the code to pick up any new runs. We’d have to set the page back to 1 and then it would go through everything again which is a bit of a waste.

Attempt 2: The `after` parameter and timestamps

While trying to come up with a cleaner way to paginate I realised that the Strava API allows you to pass in an after parameter. The after parameter indicates a minimum timestamp for the activities that should be returned. We can use this to simplify our Cypher statement!

The Run nodes that we’re creating contain a startDate which we can convert into a timestamp and pass to the API. If we haven’t loaded any runs yet we can use the coalesce function to start from 0.

We now end up with a much simpler script:

call apoc.periodic.commit("
  OPTIONAL MATCH (run:Run)
  WITH run ORDER BY run.startDate DESC LIMIT 1
  WITH coalesce(run.startDate.epochSeconds, 0) AS after
  WITH 'https://www.strava.com/api/v3/athlete/activities?after=' + after AS uri
  CALL apoc.load.jsonParams(uri, {Authorization: $stravaToken}, null)
  YIELD value

  MERGE (run:Run {id: value.id})
  SET run.distance = toFloat(value.distance),
      run.startDate = datetime(value.start_date_local),
      run.elapsedTime = duration({seconds: value.elapsed_time})

  RETURN CASE WHEN count(*) < 30 THEN 0 ELSE count(*) END AS count
", {stravaToken: $stravaToken})

We no longer need to create a meta data node, and we can easily execute this to pick up new activities.

Thank you APOC!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.

Neo4j APOC: Importing data from Strava's paginated JSON API

Attempt 1: Using an Import meta data node

Attempt 2: The after parameter and timestamps

About the author

Attempt 1: Using an `Import` meta data node

Attempt 2: The `after` parameter and timestamps