· neo4j

Neo4j vs Relational: Refactoring - Extracting node/table

In my previous blog post I showed how to add a new property/field to a node with a label/record in a table for a football transfers dataset that I've been playing with.

After introducing this 'nationality' property I realised that I now had some duplication in the model:

2016 05 22 10 15 15

players.nationality and clubs.country are referring to the same countries but they've both got them stored as strings so we can't ensure the integrity of our countries and ensure that we're referring to the same country.

We have the same issue in the graph model as well:

2016 05 22 10 40 40

This time Player.nationality and Club.country refer to the same countries.

We can solve our problem by introducing a countries table in the relational model and a set of nodes with a 'Country' label in the graph model. Let's start with relational.

This is the model we're driving towards:

2016 05 22 10 50 43

The first thing we need to do is create a countries table and populate it:


CREATE TABLE countries (
    "code" character varying(3) NOT NULL PRIMARY KEY,
    "name" character varying(50) NOT NULL
);

INSERT INTO countries VALUES('MNE', 'Montenegro');
INSERT INTO countries VALUES('SWZ', 'Swaziland');
...

Next let's update the clubs table to reference the countries table:


ALTER TABLE clubs
ADD COLUMN country_id character varying(3)
REFERENCES countries(code);

And let's run a query to populate that column:


UPDATE clubs AS cl
SET country_id = c.code
FROM clubs
INNER JOIN countries AS c 
ON c.name = clubs.country
WHERE cl.id = clubs.id;

This query iterates over all the clubs, queries the country table to find the country id for that row and then stores it in the 'country_id' field. Finally we can remove the 'country' field:


ALTER TABLE clubs
DROP COLUMN country;

Now we do the same drill for the players table:


ALTER TABLE players
ADD COLUMN country_id character varying(3)
REFERENCES countries(code);

UPDATE players AS p
SET country_id = c.code
FROM players
INNER JOIN countries AS c 
ON c.name = players.nationality
WHERE p.id = players.id;

ALTER TABLE players
DROP COLUMN nationality;

Now it's time for the graph. This is the model we want to get to:

2016 05 22 10 51 49

First we'll create the countries:


CREATE CONSTRAINT ON (c:Country)
ASSERT c.id IS UNIQUE

LOAD CSV WITH HEADERS FROM "file:///countries.csv"
AS row
MERGE (country:Country {id: row.countryCode})
ON CREATE SET country.name = row.country

And now let's get clubs and players to point at those countries nodes and get rid of their respective nationality/country properties:


MATCH (club:Club)
MATCH (country:Country {name: club.country})
MERGE (club)-[:PART_OF]->(country)
REMOVE club.country

MATCH (player:Player)
MATCH (country:Country {name: player.nationality})
MERGE (player)-[:PLAYS_FOR]->(country)
REMOVE player.nationality

And that's it, we can now write queries against our new model.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket