11 May 2018

Predicting movie genres with node2Vec and Tensorflow

In my previous post we looked at how to get up and running with the node2Vec algorithm, and in this post we’ll learn how we can feed graph embeddings into a simple Tensorflow model.

Recall that node2Vec takes in a list of edges (or relationships) and gives us back an embedding (array of numbers) for each node.

This time we’re going to run the algorithm over a movies recommendation dataset from the Neo4j Sandbox. If you want to follow along you’ll need to launch the 'recommendations' sandbox. You should see something like this once it’s started:

Building an edge list

Our first job is to generate a file containing edges that we can feed to node2Vec. The following code will do the trick:

PYTHON import csv
from neo4j.v1 import GraphDatabase, basic_auth

host = "bolt://localhost" # replace this with your Sandbox host
password = "neo" # replace this with your Sandbox password

driver = GraphDatabase.driver(, auth=basic_auth("neo4j", password))

with driver.session() as session, open("graph/movies.edgelist", "w") as edges_file:
    result = session.run("""\
    MATCH (m:Movie)--(other)
    RETURN id(m) AS source, id(other) AS target
    """)

    writer = csv.writer(edges_file, delimiter=" ")

    for row in result:
        writer.writerow([row["source"], row["target"]])

We’re getting back all the relationships from movies to anything else. The id() function returns the internal identifier for each of the nodes returned in the MATCH clause.

We end up with a file that looks like this:

BASH $ cat graph/movies.edgelist | head -n10
0 1
0 2
0 3
0 4
0 6
0 9774
0 9780
0 9785
0 9791
0 9790

Running node2Vec

Now we’re ready to run node2Vec to get embeddings for each of our nodes. I explained how to install the Stanford implementation of the algorithm in the previous post so revisit that one if you want to follow along.

Running the following script will create an embedding of 100 elements for each node. The length of the random walk component of the algorithm will be 80 hops. These numbers are guesswork so it might be that some other values would work better.

BASH ./node2vec -i:graph/movies.edgelist -o:emb/movies.emb -l:80 -d:100 -p:0.3 -dr -v

Our output should look like this:

BASH $ cat emb/movies.emb  | head -n5
32314 100
14243 0.002886 -0.0009521 0.0224149 0.0211797 -0.0234808 0.0465503 -0.0242661 -0.0108606 0.0124484 -0.0342895 0.0115387 0.0243817 0.00805514 0.0080248 0.0051944 0.025798 -0.00788376 -0.0251952 -0.048099 0.0127707 0.0194209 0.00763978 -0.0130131 -0.0230401 0.0147994 -0.00373403 0.00196932 -0.000321203 0.0118537 0.00496018 -0.0114329 0.00832536 0.00903396 -0.00277039 0.0143092 0.0031493 -0.016161 -0.0124357 0.00809057 0.0129928 -0.0158231 0.0282883 -0.0114194 0.00480747 -0.000219177 0.0172819 -0.0402172 -0.0281593 -0.00179042 0.0272349 -0.00990981 -0.00709573 -0.0323773 0.0208203 -0.0316696 0.00452456 -0.0253643 0.00837943 -0.0234769 0.00504737 0.0120566 -0.00537971 -0.00255093 0.0209391 0.020711 0.016786 0.0271043 0.00830118 0.0131981 0.00775244 -0.00629482 -0.0149914 0.0269024 -0.00378804 0.00938015 0.0264441 0.00919089 0.0158234 0.00182008 0.00721888 0.0101987 -0.00948434 -0.00220668 0.00522284 -0.00246156 0.0209852 -0.0178339 -0.0028442 0.00806226 -0.0066889 -0.00723828 0.0461259 0.00875541 0.0062631 0.0104947 -0.0140804 0.0241079 -0.026269 0.0136609 -0.00429287
0 0.14308 0.573671 0.696905 0.382039 -0.63165 0.690953 -0.127411 -0.325989 0.219356 -0.810672 0.0769492 0.69869 0.475756 0.56377 0.0867368 0.243866 -0.495199 0.10085 -0.759924 0.500157 0.693239 0.508205 -0.861352 -0.919493 0.537185 0.176292 0.146458 -0.413557 0.408049 0.0471144 -0.14392 0.20072 0.470296 0.197094 0.238148 -0.296311 -0.275903 0.248415 0.00279996 0.204512 0.415326 0.658922 0.576313 0.364277 -0.515041 0.387918 -0.566269 -0.46252 0.125421 0.734148 -0.0903596 -0.313846 -0.741623 0.428222 -0.534838 0.0441242 -0.6724 0.513446 -0.71289 -0.0973926 0.269651 0.179949 0.495639 0.56465 0.79435 0.708972 0.947383 -0.124569 0.420107 0.559592 -0.19149 -0.440302 0.657272 -0.0282963 -0.100281 0.251455 0.251974 0.646931 -0.29645 -0.175667 -0.459531 -0.269747 -0.182972 0.345404 0.0459218 0.676589 -0.38854 0.0952728 0.623066 0.323733 0.426018 0.569366 0.670057 0.796834 0.778748 -0.850252 0.674652 -0.857651 0.466597 -0.184641
6881 -0.0069373 -0.0182842 -0.00250641 -0.014827 0.0238218 -0.0157951 0.0105609 0.0134582 -0.000478559 0.0159776 0.0134371 -0.0187747 -0.0202883 -0.0192247 0.00997851 -0.0037014 0.0228678 -0.0110792 0.0168576 -0.00694089 -0.0279912 -0.00846954 0.032771 0.0334678 -0.00983781 -0.0208949 -0.00198547 0.0237212 -0.00677631 -0.000485477 -0.00211323 0.00385676 -0.00488553 -0.00991322 0.00616214 0.0130523 -0.00123142 -0.00466244 -0.000123816 -0.00937801 -0.0159647 -0.0217866 -0.0255627 -0.00232269 0.00521148 0.00464528 0.00146591 0.0133423 -0.0116764 -0.0231797 0.001886 0.0194517 0.00968319 -0.013389 0.00238838 -0.00150677 0.0145389 -0.00695154 0.0159505 0.0100921 -0.0126679 -0.00703088 -0.0191637 -0.0116676 -0.0216134 -0.0155858 -0.0348786 0.00824809 -0.0160402 -0.0156108 0.00222316 0.00565494 -0.0203139 -0.00169975 0.0151738 0.000105578 -0.00196151 -0.0137246 0.00352464 0.00306386 0.03115 -0.000200159 0.00276833 -0.013697 -0.00134814 -0.0108099 0.014244 -0.00102111 -0.016965 -0.00954926 -0.0113966 -0.0187811 -0.0243734 -0.032553 -0.0277708 0.0208017 -0.0119544 0.0282654 -0.00253335 0.00411899
9192 -0.0418675 -0.0748957 -0.0349762 -0.0232418 0.0997403 -0.080086 0.0267859 0.00144486 -0.0365908 0.109965 0.0383626 -0.0673113 -0.0882356 -0.0720804 0.0537256 0.012303 0.0671034 -0.00490223 0.094077 -0.0517411 -0.100004 -0.0413149 0.182336 0.156172 -0.0471807 -0.052455 -0.00689152 0.115385 -0.0458524 0.0432888 -0.000982105 -0.00449935 -0.0189335 -0.0173315 -0.011708 0.0074369 -0.000439496 -0.0179856 0.0218096 -0.046571 -0.0639089 -0.146104 -0.0937443 0.0258947 0.049675 0.011432 0.0380184 0.08526 -0.0637881 -0.134194 -0.0146248 0.125638 0.0744331 -0.0212793 0.0790646 0.0457259 0.0857937 -0.0283405 0.0413922 0.0467135 -0.0517976 -0.032048 -0.105124 -0.0454983 -0.133484 -0.0705029 -0.170737 0.0544806 -0.0954872 -0.108356 0.0558612 0.0175526 -0.0754161 -0.0118317 0.0332381 -0.016054 0.0207899 -0.0376508 0.0160112 0.0250085 0.140147 0.0043331 -0.0227054 -0.0986077 -0.0284764 -0.0404766 0.0583791 -0.00450137 -0.0354385 -0.0644349 -0.0747603 -0.0906304 -0.128325 -0.175729 -0.0978647 0.119311 -0.0859975 0.137831 0.00438728 0.000827815

Storing node2Vec embeddings

Next we’re going to store these embeddings back into Neo4j so that we can use them later. The following code will do the trick:

with open("movies.emb", "r") as movies_file, driver.session() as session:
    next(movies_file)
    reader = csv.reader(movies_file, delimiter=" ")

    params = []
    for row in reader:
        movie_id = row[0]
        params.append({
            "id": int(movie_id),
            "embedding": [float(item) for item in row[1:]]
        })

    session.run("""\
    UNWIND {params} AS param
    MATCH (m:Movie) WHERE id(m) = param.id
    SET m.embedding = param.embedding
    """, {"params": params})

Now we’re ready to do some machine learning.

Loading our data into Pandas

Each of the movies in our dataset has one or more genres and we’re going to create a Tensorflow model that tries to predict the genres for a given movie.

We’ll first load our dataset into a Pandas DataFrame. Each row will represent a movie and will have 100 columns containing the embeddings we created above, as well as another column containing an array indicating which genres it has. That array will be a variant of a one hot encoding (multi hot encoding?!).

The following code will construct our DataFrame:

import pandas as pd

movies_genres_query = """\
MATCH (genre:Genre)
WITH genre ORDER BY genre.name
WITH collect(id(genre)) AS genres
MATCH (m:Movie)-[:IN_GENRE]->(genre)
WITH genres, id(m) AS source, m.embedding AS embedding, collect(id(genre)) AS target
RETURN source, embedding, [g in genres | CASE WHEN g in target THEN 1 ELSE 0 END] AS genres
"""

with driver.session() as session:
    result = session.run(movies_genres_query)
    df = pd.DataFrame([dict(row) for row in result])

What does our DataFrame look like at the moment?

>>> everything.head()
                                           embedding  source  \
0  [0.14308, 0.573671, 0.696905, 0.382039, -0.631...       0
1  [-0.0069373, -0.0182842, -0.00250641, -0.01482...    6881
2  [-0.000760393, -0.00128913, -0.0139862, -0.006...    8435
3  [-0.00199855, -0.00427414, -0.0113419, -0.0027...    4009
4  [0.00647937, -0.00590963, -0.00502976, 0.00046...    3057

                                              genres
0  [0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
1  [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
2  [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
3  [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
4  [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...

At the moment the embeddings are in an array within one column. We’ll fix that in the next step.

Splitting training/test sets

We’re going to split our DataFrame into training and test sets. We’ll just do a simple 90/10 split here but if we were doing this for real we’d want to do something more sophisticated such as K-Fold validation.

train_index = int(len(df) * 0.9)
train_data = df[:train_index]
test_data = df[train_index:]

train_x = train_data.ix[:, "embedding"]
train_x = pd.DataFrame(np.array([np.array(item) for item in train_x.values]))
train_x.columns = [str(col) for col in train_x.columns.get_values()]

train_y = train_data.ix[:, 'genres']

# separate test data
test_x = test_data.ix[:, "embedding"]
test_x = pd.DataFrame(np.array([np.array(item) for item in test_x.values]))
test_x.columns = [str(col) for col in train_x.columns.get_values()]

test_y = test_data.ix[:, 'genres']

The reason for this slightly insane data massaging is so that we can easily feed our data into a Tensorflow DataSet in the next section. If there’s an easier way to do this please let me know in the comments and I’ll update the example.

Building Tensorflow model

Now we’re ready to create our Tensorflow model. First up, our imports:

import tensorflow as tf
from tensorflow.python.feature_column import feature_column_lib
from tensorflow.python.training import training_util
from tensorflow.python.training.ftrl import FtrlOptimizer

We’re going to create a simple multi label classifier because a movie can have more than one genre associated with it.

feature_columns = [tf.feature_column.numeric_column(key=key)
                   for key in train_x.keys()]

LEARNING_RATE = 0.3
loss_reduction = tf.losses.Reduction.SUM_OVER_BATCH_SIZE

head = tf.contrib.estimator.multi_label_head(20, weight_column=none,
                                             label_vocabulary=none,
                                             loss_reduction=loss_reduction)

classifier = tf.estimator.Estimator(model_fn=model_fn)

classifier.train(
    input_fn=lambda: train_input_fn(train_x, train_y, 100),
    steps=2000)

eval_result = classifier.evaluate(
    input_fn=lambda: eval_input_fn(test_x, test_y, 100))

print(eval_result)
print('\nTest set AUC: {auc:0.3f}\n'.format(**eval_result))

We called a few functions in the previous section so let’s look into those in more detail. We have two functions for passing data into the train and eval functions of our classifier:

def train_input_fn(features, labels, batch_size):
    labels = tf.constant(np.array([np.array(item)
                                   for item in labels.values]))
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
    return dataset.shuffle(1000).repeat().batch(batch_size)


def eval_input_fn(features, labels, batch_size):
    labels = tf.constant(np.array([np.array(item)
                                   for item in labels.values]))

    features = dict(features)
    inputs = (features, labels)
    dataset = tf.data.Dataset.from_tensor_slices(inputs)

    assert batch_size is not none, "batch_size must not be None"
    return dataset.batch(batch_size)

I took most of this directly from the Tensorflow documentation and tests, except the first line which I ended up with after a lot of trial and error.

We also have a model function:

def model_fn(features, labels, mode, config):
    def train_op_fn(loss):
        opt = FtrlOptimizer(learning_rate=LEARNING_RATE)
        return opt.minimize(loss,
                            global_step=training_util.get_global_step())

    def logit_fn(features):
        return feature_column_lib.linear_model(
            features=features,
            feature_columns=feature_columns,
            units=head.logits_dimension)

    return head.create_estimator_spec(
        features=features,
        mode=mode,
        logits=logit_fn(features=features),
        labels=labels,
        train_op_fn=train_op_fn)

I got most of this code from reverse engineering the LinearClassifier that I wrote about in a previous post.

Now it’s time to actually run this thing! This is the output we get:

$ python movies_multi_label_custom_estimator.py
{'auc': 0.7855524, 'auc_precision_recall': 0.354057, 'average_loss': 0.29183877, 'loss': 0.2946557, 'global_step': 2000}

Test set AUC: 0.786

The auc_precision_recall is probably the most interesting one here and we’re not doing very well in that respect, which is perhaps not surprising given the simplicity of our model. The next step would be to replace the model with a more complex Deep Neural Network that might be able to find some patterns in the embeddings. I’ll leave that for another post, but we’ll concluded by looking at how we could use this model to predict which genres a movie should have.

The following code does the trick:

genres_query = """\
MATCH (genre:Genre)
WITH genre ORDER BY genre.name
RETURN collect(genre.name) AS genres
"""

with driver.session() as session:
    result = session.run(genres_query)
    genres = result.peek()["genres"]

movies_genres_predict_query = """\
MATCH (genre:Genre)
WITH genre ORDER BY genre.name
WITH collect(id(genre)) AS genres
MATCH (m:Movie)-[:IN_GENRE]->(genre)
WITH genres, m.title AS movie, id(m) AS source, m.embedding AS embedding, collect(id(genre)) AS target
RETURN source, movie, embedding, [g in genres | CASE WHEN g in target THEN 1 ELSE 0 END] AS genres
ORDER BY rand()
LIMIT 3
"""

with driver.session() as session:
    result = session.run(movies_genres_predict_query)
    predict_df = pd.DataFrame([dict(row) for row in result])

expected_df = predict_df[["genres", "source", "movie"]]

predict_x = predict_df.ix[:, "embedding"]
predict_x = pd.DataFrame(np.array([np.array(item) for item in predict_x.values]))
predict_x.columns = [str(col) for col in predict_x.columns.get_values()]

predictions = classifier.predict(
    input_fn=lambda: predict_input_fn(predict_x, 100))

for pred_dict, expec in zip(predictions, expected_df.as_matrix()):
    expected, source, movie = expec
    print("Movie: {0}".format(movie))
    for idx, label in enumerate(expected):
        print(label, genres[idx], pred_dict["probabilities"][idx])
    print("--")

And if we run it we’ll see this output:

Movie: Someone Like You
0 (no genres listed) 0.013719821
0 Action 0.16868685
0 Adventure 0.119703874
0 Animation 0.049339224
0 Children 0.06293545
1 Comedy 0.3655233
0 Crime 0.119378164
0 Documentary 0.05467411
0 Drama 0.47860712
0 Fantasy 0.07273513
0 Film-Noir 0.021085216
0 Horror 0.09513151
0 IMAX 0.022415701
0 Musical 0.044788927
0 Mystery 0.060783025
1 Romance 0.16936173
0 Sci-Fi 0.08559853
0 Thriller 0.18862277
0 War 0.042249877
0 Western 0.024629481
--
Movie: Cyrus
0 (no genres listed) 0.013714926
0 Action 0.16862866
0 Adventure 0.11968786
0 Animation 0.04933899
0 Children 0.06292239
1 Comedy 0.36660013
0 Crime 0.1192929
0 Documentary 0.05456238
1 Drama 0.47862184
0 Fantasy 0.072761126
0 Film-Noir 0.021083448
0 Horror 0.09523407
0 IMAX 0.022410033
0 Musical 0.044797856
0 Mystery 0.060746994
1 Romance 0.1696445
0 Sci-Fi 0.085707955
0 Thriller 0.18862824
0 War 0.042215005
0 Western 0.024633398
--
Movie: Hugo
0 (no genres listed) 0.013695647
0 Action 0.16891125
0 Adventure 0.11970882
0 Animation 0.049290065
1 Children 0.062867686
0 Comedy 0.36590067
0 Crime 0.11941985
0 Documentary 0.05452773
1 Drama 0.4786238
0 Fantasy 0.07270951
0 Film-Noir 0.021058151
0 Horror 0.09517987
0 IMAX 0.022390224
0 Musical 0.04475778
1 Mystery 0.06075114
0 Romance 0.169593
0 Sci-Fi 0.08562877
0 Thriller 0.18874636
0 War 0.04222341
0 Western 0.024609018
--

The movies we’ve used to test the model already have a node2Vec embedding. I’m not sure how we’d make predictions for a brand new movie that wasn’t included in our initial embedding calculation. That’s my next thing to research but let me know if you have any ideas.

If you want to play with this all the code is in my tensorflow-playground repository

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.