· quickgraph neo4j apoc

QuickGraph #4: UK Official Singles Chart 2019

For our first QuickGraph of the new decade we’re going to explore data from the Official UK Top 40 Chart. This chart ranks the top 100 songs of the week based on official sales of sales of downloads, CD, vinyl, audio streams and video streams. Every week BBC Radio 1 broadcast the top 40 songs, which explains the name of the chart.

official charts header
Figure 1. The Official UK Charts

Scraping the Official Charts

I couldn’t find a dump of the dataset, so we’re going to use our web scraping skills again. Below is a printscreen showing the page that we’re going to scrape.

uk charts raw data
Figure 2. UK Charts Raw Data

If we take a look at the underlying HTML, we’ll see that each chart entry is conveniently represented by a row in a table:

uk charts one row
Figure 3. HTML of one chart entry in the UK Charts

I wrote a couple of Python scripts to download each of the charts and then for each chart entry, extract the position, track, artist, and record label. We generate one JSON document per chart entry, and below is an example of one of these documents:

{
  "start": "01 March 2019",
  "end": "07 March 2019",
  "position": 2,
  "track_name": "Giant",
  "track_uri": "/search/singles/giant/",
  "track_file_name": "search-singles-giant",
  "artist_name": "Calvin Harris & Rag'N'Bone Man",
  "artist_uri": "/artist/54916/calvin-harris-and-ragnbone-man/",
  "label": "Columbia"
}

For many of the tracks the data scraped from the Official Charts website is sufficient, but for some of them I found it difficult to figure out the names of all the collaborating artists. I thought this would be a good opportunity to use the Spotify API to enrich the dataset.

Enriching the data with the Spotify API

We’ll use Spotify’s Search API to retrieve the names of all collaborating artists for a track. Instructions explaining how how to download and enrich the data are described in the uk-charts-neo4j repository.

Exploring the data

Let’s have a quick look at the first few lines of this file using command line tools:

head -n3 import/items.json | jq '.'

If we run that command, we’ll see the following output:

{
  "start": "01 March 2019",
  "end": "07 March 2019",
  "position": 1,
  "track_name": "Someone You Loved",
  "track_uri": "/search/singles/someone-you-loved/",
  "track_file_name": "search-singles-someone-you-loved",
  "artist_name": "Lewis Capaldi",
  "artist_uri": "/artist/51369/lewis-capaldi/",
  "label": "Emi",
  "duration": 182160,
  "artists": [
    {
      "id": "4GNC7GD6oZMSxPGyXy4MNB",
      "name": "Lewis Capaldi"
    }
  ]
}
{
  "start": "01 March 2019",
  "end": "07 March 2019",
  "position": 2,
  "track_name": "Giant",
  "track_uri": "/search/singles/giant/",
  "track_file_name": "search-singles-giant",
  "artist_name": "Calvin Harris & Rag'N'Bone Man",
  "artist_uri": "/artist/54916/calvin-harris-and-ragnbone-man/",
  "label": "Columbia",
  "duration": 229184,
  "artists": [
    {
      "id": "7CajNmpbOovFoOoasH2HaY",
      "name": "Calvin Harris"
    },
    {
      "id": "4f9iBmdUOhQWeP7dcAn1pf",
      "name": "Rag'n'Bone Man"
    }
  ]
}
{
  "start": "01 March 2019",
  "end": "07 March 2019",
  "position": 3,
  "track_name": "Break Up With Your Girlfriend I'M Bored",
  "track_uri": "/search/singles/break-up-with-your-girlfriend-i'm-bored/",
  "track_file_name": "search-singles-break-up-with-your-girlfriend-i'm-bored",
  "artist_name": "Ariana Grande",
  "artist_uri": "/artist/26221/ariana-grande/",
  "label": "Republic Records",
  "duration": 190440,
  "artists": [
    {
      "id": "66CXWjxzNUsdJxJ2JdwvnR",
      "name": "Ariana Grande"
    }
  ]
}

We can also use the APOC Library’s apoc.load.json procedure to explore the data. The following query returns the 1st chart entry:

CALL apoc.load.json("https://github.com/mneedham/uk-charts-neo4j/raw/master/import/items.json")
YIELD value
RETURN value
LIMIT 1

If we run that query we’ll see the following output:

Table 1. Exploring the data
value

{"duration":182160,"track_file_name":"search-singles-someone-you-loved","artist_uri":"/artist/51369/lewis-capaldi/","artist_name":"Lewis Capaldi","artists":[{"name":"Lewis Capaldi","id":"4GNC7GD6oZMSxPGyXy4MNB"}],"start":"01 March 2019","end":"07 March 2019","position":1,"label":"Emi","track_uri":"/search/singles/someone-you-loved/","track_name":"Someone You Loved"}

Next we’re going to use the apoc.load.json procedure to load the data into Neo4j.

Importing the data

We’re going to import the data into the following graph model:

uk chart model
Figure 4. Graph Model for the UK Charts

The following statements set up indexes and constraints for our graph:

CREATE INDEX ON :Chart(start);
CREATE CONSTRAINT ON (l:Label) ASSERT l.name IS UNIQUE;
CREATE CONSTRAINT ON (s:Song) ASSERT s.uri IS UNIQUE;
CREATE CONSTRAINT ON (a:Artist) ASSERT a.id IS UNIQUE;

The following statement imports the charts, songs, artists, and labels into the graph:

CALL apoc.load.json("file://items.json")
YIELD value

MERGE (chart:Chart {
  start: date(datetime({epochmillis: apoc.date.parse(value.start, "ms", "dd MMMM yyyy")})),
  end: date(datetime({epochmillis: apoc.date.parse(value.end, "ms", "dd MMMM yyyy")}))
})
MERGE (label:Label {name: value.label})
MERGE (song:Song {uri: value.track_uri})
SET song.title = value.track_name,
    song.duration = CASE WHEN value.duration is null THEN null
                         ELSE duration({milliseconds:value.duration})
                    END

MERGE (song)-[:LABEL]->(label)
MERGE (song)-[inChart:IN_CHART]->(chart)
SET inChart.position = value.position

FOREACH(artist IN value.artists |
  MERGE (a:Artist {id: artist.id})
  SET a.name = artist.name
  MERGE (song)-[:ARTIST]->(a)
);

We can see a sample of the imported graph in the Neo4j Browser visualisation below:

uk chart preview
Figure 5. Sample of the UK Charts Graph

Querying the graph

Now let’s explore the data using Neo4j’s Cypher query language.

Which song was number 1 for the most weeks?

MATCH (song:Song)-[inChart:IN_CHART {position: 1}]->(chart),
      (song)-[:ARTIST]->(artist)

// order the chart dates so that they show in order when
// we collect them in the RETURN statement
WITH song, collect(artist.name) AS artists, inChart, chart
ORDER BY chart.start

RETURN song.title AS song, artists,
       apoc.date.format(song.duration.milliseconds, 'ms', 'mm:ss') AS duration,
       count(*) AS count, collect(toString(chart.end)) AS weeks
ORDER By count DESC;
Table 2. Which song was number 1 for the most weeks?
song artists duration count weeks

"Dance Monkey"

["Tones and I"]

"03:29"

11

["2019-10-10", "2019-10-17", "2019-10-24", "2019-10-31", "2019-11-07", "2019-11-14", "2019-11-21", "2019-11-28", "2019-12-05", "2019-12-12", "2019-12-19"]

"I Don’T Care"

["Ed Sheeran", "Justin Bieber"]

"03:39"

8

["2019-05-23", "2019-05-30", "2019-06-06", "2019-06-13", "2019-06-20", "2019-06-27", "2019-07-04", "2019-07-11"]

"Someone You Loved"

["Lewis Capaldi"]

"03:02"

7

["2019-03-07", "2019-03-14", "2019-03-21", "2019-03-28", "2019-04-04", "2019-04-11", "2019-04-18"]

"Senorita"

["Shawn Mendes", "Camila Cabello"]

"03:10"

6

["2019-07-18", "2019-08-01", "2019-08-08", "2019-08-15", "2019-08-22", "2019-08-29"]

"Take Me Back To London"

["Ed Sheeran", "Stormzy"]

"03:09"

5

["2019-09-05", "2019-09-12", "2019-09-19", "2019-09-26", "2019-10-03"]

"Sweet But Psycho"

["Ava Max"]

"03:07"

4

["2019-01-03", "2019-01-10", "2019-01-17", "2019-01-24"]

"7 Rings"

["Ariana Grande"]

"02:58"

4

["2019-01-31", "2019-02-07", "2019-02-14", "2019-02-28"]

"Old Town Road"

["Billy Ray Cyrus", "Lil Nas X"]

"02:37"

2

["2019-04-25", "2019-05-02"]

"Vossi Bop"

["Stormzy"]

"03:16"

2

["2019-05-09", "2019-05-16"]

"Break Up With Your Girlfriend I’M Bored"

["Ariana Grande"]

"03:10"

1

["2019-02-21"]

"Beautiful People"

["Khalid", "Ed Sheeran"]

"03:17"

1

["2019-07-25"]

"I Love Sausage Rolls"

["LadBaby"]

"03:23"

1

["2019-12-26"]

The extremely catchy Dance Monkey was number 1 for 11 weeks, or just less than 3 months of the year. We can see Ed Sheeran show up in the artists column a few times as well, so let’s next see which artist was number 1 for the most weeks.

Which artist was number 1 for the most weeks?

MATCH (song:Song)-[inChart:IN_CHART {position: 1}]->(chart),
      (song)-[:ARTIST]->(artist:Artist)
WITH artist, song, count(*) AS weeks
RETURN artist.name,
       apoc.map.fromPairs(collect([song.title, weeks])) AS songs,
       count(*) AS count,
       sum(weeks) AS weeks
ORDER BY weeks DESC;
Table 3. Which artist was number 1 for the most weeks?
artist songs count weeks

"Ed Sheeran"

{Take Me Back To London: 5, I Don’T Care: 8, Beautiful People: 1}

3

14

"Tones and I"

{Dance Monkey: 11}

1

11

"Justin Bieber"

{I Don’T Care: 8}

1

8

"Lewis Capaldi"

{Someone You Loved: 7}

1

7

"Stormzy"

{Take Me Back To London: 5, Vossi Bop: 2}

2

7

"Shawn Mendes"

{Senorita: 6}

1

6

"Camila Cabello"

{Senorita: 6}

1

6

"Ariana Grande"

{7 Rings: 4, Break Up With Your Girlfriend I’M Bored: 1}

2

5

"Ava Max"

{Sweet But Psycho: 4}

1

4

"Billy Ray Cyrus"

{Old Town Road: 2}

1

2

"Lil Nas X"

{Old Town Road: 2}

1

2

"Khalid"

{Beautiful People: 1}

1

1

"LadBaby"

{I Love Sausage Rolls: 1}

1

1

Ed Sheeran had 3 number 1’s last year, giving him a total of 14 weeks at number 1. I was surprised not to see what I thought was his most popular song of the year, South Of The Border. Let’s see what happened there.

How did South Of The Border do on the charts?

MATCH (song:Song {title: "South Of The Border"})
MATCH (song)-[inChart:IN_CHART]->(chart)

WITH song, chart, inChart ORDER BY chart.start

RETURN inChart.position AS position,
       collect(toString(chart.end)) AS charts
ORDER BY position
Table 4. How did South Of The Border do on the charts?
position charts

4

["2019-10-31", "2019-11-14", "2019-11-21"]

5

["2019-11-07"]

7

["2019-10-24", "2019-11-28"]

9

["2019-12-05", "2019-12-12"]

13

["2019-12-19"]

24

["2019-10-17"]

40

["2019-10-10"]

55

["2019-12-26"]

Hmmm, it peaked at number 4 back in October and November, before falling down the chart after that. I wonder if Ed Sheeran released any other songs last year?

How did Ed Sheeran’s other songs do?

MATCH (artist:Artist {name: "Ed Sheeran"})
MATCH (artist)<-[:ARTIST]-(song)-[inChart:IN_CHART]->(chart)

WITH song, artist,
     min(inChart.position) AS bestPosition,
     toString(min(chart.end)) AS chart
OPTIONAL MATCH (artist)<-[:ARTIST]-(song)-[:ARTIST]->(collaborator)

RETURN song.title AS song,
       collect(collaborator.name) AS collaborators,
       bestPosition, chart
ORDER BY bestPosition
Table 5. How did Ed Sheeran’s other songs do?
song collaborators bestPosition chart

"I Don’T Care"

["Justin Bieber"]

1

"2019-05-23"

"Take Me Back To London"

["Stormzy"]

1

"2019-07-25"

"Beautiful People"

["Khalid"]

1

"2019-07-11"

"Own It"

["Burna Boy", "Stormzy"]

2

"2019-12-05"

"South Of The Border"

["Camila Cabello", "Cardi B"]

4

"2019-10-10"

"Cross Me"

["Chance the Rapper", "PnB Rock"]

4

"2019-06-06"

"Perfect"

[]

40

"2019-01-10"

"Shape Of You"

[]

82

"2019-01-10"

The last two on this list are hangovers from 2018. I expect they charted much higher during that year.

We can create a visual representation of this data by tweaking our query slightly:

MATCH (artist:Artist {name: "Ed Sheeran"})
MATCH (artist)<-[:ARTIST]-(song)-[inChart:IN_CHART]->(chart)

WITH song, artist,
     min(inChart.position) AS bestPosition,
     toString(min(chart.end)) AS chart

// Create virtual nodes and relationships to represent
// the best chart position for a song
CALL apoc.create.vNode(["Chart"], {end: chart}) YIELD node AS bestChart
CALL apoc.create.vRelationship(song, "IN_CHART", {position: bestPosition}, bestChart) YIELD rel

OPTIONAL MATCH (artist)<-[:ARTIST]-(song)-[:ARTIST]->(collaborator)
RETURN *

We could probably achieve the same result by tweaking the first part of the query, but I never like to miss an opportunity to use virtual nodes and relationships.

ed sheeran songs
Figure 6. Ed Sheeran songs in the UK Charts of 2019

It’s interesting that he’s had a collaborator on all his other songs in 2019, and all of his songs charted in the top 5. Let’s explore artist collaborations a bit more.

How many collaborators did artists have during the year?

The following query computes the number of collaborators each artist had and then creates a histogram of those counts:

MATCH (artist:Artist)
OPTIONAL MATCH (artist)<-[:ARTIST]-(song)-[:ARTIST]->(otherArtist)
WITH artist, count(otherArtist) AS count
RETURN count, count(*)
ORDER BY count

This data is easiest to interpret as a chart, which we can create using chartgo.com.

collaborations count
Figure 7. Artist Collaborations

Although the majority of artists mostly worked alone or had only one collaborator, there are a reasonable number who had 6 or more collaborators over the year.

In all the queries that we’re written so far, we’ve been looking at the chart position for a song in a single week. We haven’t written any queries that look at how a song’s chart position changes over the weeks. What if we want to do this?

Which song had the biggest climb between weeks?

We’re going to work out which song had the biggest climb between weeks, but first we’ll create relationships between consecutive charts to make it easier to write this query. So for example we’ll create a NEXT relationship between:

  • The Chart nodes with end properties 2019-02-14 and 2019-02-21

  • The Chart nodes with end properties 2019-02-21 and 2019-02-28

And so on.

The following query collects all the Chart nodes in order by date and then creates NEXT relationships between consecutive nodes using the apoc.nodes.link procedure:

MATCH (chart:Chart)
WITH chart ORDER BY chart.start
WITH collect(chart) AS charts
CALL apoc.nodes.link(charts, "NEXT")
RETURN count(*)

The Neo4j Browser visualisation below shows the linked list that we’ve now created:

charts linked list
Figure 8. Linked list of consecutive Chart nodes

We can now write the following query to return the 5 songs that made the biggest climb in the charts in consecutive weeks:

MATCH (song:Song)-[inChart:IN_CHART]->(chart)<-[:NEXT]-(previousChart),
      (previousChart)<-[inChartPrevious:IN_CHART]-(song),
      (song)-[:ARTIST]->(artist)
WITH song, collect(artist.name) AS artists,
     inChartPrevious.position AS firstPosition,
     inChart.position AS secondPosition,
     previousChart, chart
RETURN song.title AS song, artists, firstPosition, secondPosition,
       firstPosition - secondPosition AS change,
       toString(previousChart.end) AS firstChart,
       toString(chart.end) AS secondChart
ORDER BY change DESC
LIMIT 5
Table 6. Which song had the biggest climb between weeks?
song artists firstPosition secondPosition change firstChart secondChart

"Bruises"

["Lewis Capaldi"]

82

11

71

"2019-10-03"

"2019-10-10"

"Playing Games"

["Summer Walker", "Bryson Tiller"]

93

25

68

"2019-10-10"

"2019-10-17"

"Girls Like You"

["Maroon 5", "Cardi B"]

97

34

63

"2019-01-03"

"2019-01-10"

"Lose You To Love Me"

["Selena Gomez"]

65

3

62

"2019-10-31"

"2019-11-07"

"Play"

["Years & Years", "Jax Jones"]

80

19

61

"2019-01-03"

"2019-01-10"

There are some big climbers there, but I think it’ll be more interesting to look at number 1 songs in this context.

Which songs went straight in at number 1?

Let’s start by seeing whether any of our number 1 songs went straight in at number 1. The following query will tell us the answer to this question:

// Find songs that charted at position 1
MATCH (song:Song)-[inChart:IN_CHART {position: 1}]->(chart)

// And didn't have an entry on a chart before this one
WHERE not((chart)<-[:NEXT*]-()<-[:IN_CHART]-(song))

WITH song, chart ORDER BY chart.end
RETURN song.title AS song,
       [(song)-[:ARTIST]->(artist) | artist.name] AS artists,
       collect(toString(chart.end)) AS chart
Table 7. Which songs went straight in at number 1?
song artists chart

"Sweet But Psycho"

["Ava Max"]

["2019-01-03"]

"7 Rings"

["Ariana Grande"]

["2019-01-31"]

"Break Up With Your Girlfriend I’M Bored"

["Ariana Grande"]

["2019-02-21"]

"Vossi Bop"

["Stormzy"]

["2019-05-09"]

"I Don’T Care"

["Ed Sheeran", "Justin Bieber"]

["2019-05-23"]

"I Love Sausage Rolls"

["LadBaby"]

["2019-12-26"]

6 songs went straight in at the top. That means we still have 6 songs that came in at a lower position and then climbed to number 1.

Which songs didn’t go straight to number 1?

The following query will help us figure out what position they charted on their first entry:

// Find the first week that the song got to number 1
MATCH (song:Song)-[inChart:IN_CHART {position: 1}]->(chart)
WITH song, chart ORDER BY song, chart.end
WITH song, collect(chart)[0] AS firstWeekAtNumber1

// Find the earliest week that the song charted
MATCH (firstWeekAtNumber1)<-[:NEXT*]-(previous)<-[previousInChart:IN_CHART]-(song)
WITH song, firstWeekAtNumber1, previous, previousInChart
ORDER BY song, previous.end
WITH song, firstWeekAtNumber1,
     collect(previousInChart.position)[0] AS firstPosition,
     collect(previous)[0] AS firstWeek

RETURN song.title AS song,
       [(song)-[:ARTIST]->(artist) | artist.name] AS artists,
       toString(firstWeekAtNumber1.end) AS firstWeekAtNumber1,
       firstPosition AS firstPosition,
       toString(firstWeek.end) AS firstWeek,
       [path = (firstWeekAtNumber1)<-[:NEXT*]-(firstWeek) | length(path)][0] AS weeksToReachNumber1
Table 8. Which songs didn’t go straight to number 1?
song artists firstWeekAtNumber1 firstPosition firstWeek weeksToReachNumber1

"Someone You Loved"

["Lewis Capaldi"]

"2019-03-07"

66

"2019-01-17"

7

"Dance Monkey"

["Tones and I"]

"2019-10-10"

82

"2019-08-15"

8

"Take Me Back To London"

["Ed Sheeran", "Stormzy"]

"2019-09-05"

3

"2019-07-25"

6

"Senorita"

["Shawn Mendes", "Camila Cabello"]

"2019-07-18"

2

"2019-07-04"

2

"Beautiful People"

["Khalid", "Ed Sheeran"]

"2019-07-25"

3

"2019-07-11"

2

"Old Town Road"

["Billy Ray Cyrus", "Lil Nas X"]

"2019-04-25"

67

"2019-04-04"

3

Interestingly the longest running number 1, Dance Monkey, started at the lowest position. And not only that, it took almost 2 months for it to get up to the number 1 spot. So if you start with a low chart position, all is not lost!

What’s interesting about this QuickGraph?

This post has gone on for much longer than I intended, but I’ll conclude with some thoughts on what makes this QuickGraph interesting.

There’s a lot of analysis of the charts in the UK, but it’s often done along one dimension, be that a song or an artist. By modelling the data as a graph we can easily query the data along multiple dimensions. I especially enjoyed writing the queries that looked at how songs did across multiple charts.

We could certainly extend our analysis further by adding in Wiki data, as Jesús Barrasa did in his QuickGraph #10, but that can wait for another blog post!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket