12 Feb 2015

Python/gensim: Creating bigrams over How I met your mother transcripts

As part of my continued playing around with How I met your mother transcripts I wanted to identify plot arcs and as a first step I wrote some code using the gensim and nltk libraries to identify bigrams (two word phrases).

There’s an easy to follow tutorial in the gensim docs showing how to go about this but I needed to do a couple of extra steps to get my text data from a CSV file into the structure gensim expects.

Let’s first remind ourselves what the sentences CSV file looks like:

$ head -n 15 data/import/sentences.csv  | tail
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"
10,1,1,1,"Narrator: It was way back in 2005. I was twenty-seven just starting to make it as an architect and living in New York with my friend Marshall, my best friend from college. My life was good and then Uncle Marshall went and screwed the whole thing up."
11,1,1,1,Marshall: (Opens ring) Will you marry me.
12,1,1,1,"Ted: Yes, perfect! And then you're engaged, you pop the champagne! You drink a toast! You have s*x on the kitchen floor... Don't have s*x on our kitchen floor."
13,1,1,1,"Marshall: Got it. Thanks for helping me plan this out, Ted."
14,1,1,1,"Ted: Dude, are you kidding? It's you and Lily! I've been there for all the big moments of you and Lily. The night you met. Your first date... other first things."

We need to transform those sentences into an array of words for each line and feed it into gensim’s models.Phrase object:

import nltk
import csv
import string
from gensim.models import Phrases
from gensim.models import Word2Vec
from nltk.corpus import stopwords

sentences = []
bigram = Phrases()
with open("data/import/sentences.csv", "r") as sentencesfile:
    reader = csv.reader(sentencesfile, delimiter = ",")
    reader.next()
    for row in reader:
        sentence = [word.decode("utf-8")
                    for word in nltk.word_tokenize(row[4].lower())
                    if word not in string.punctuation]
        sentences.append(sentence)
        bigram.add_vocab([sentence])

We’re used nltk’s word_tokezine function to create our array of words and then we’ve got a clause to make sure we remove any words which are punctuation otherwise they will dominate our phrases.

We can take a quick peek at some of the phrases that have been created like so:

>>> list(bigram[sentences])[:5]
[[u'pilot'], [u'scene', u'one'], [u'title', u'the', u'year_2030'], [u'narrator_kids', u'i', u"'m", u'going', u'to', u'tell', u'you', u'an_incredible', u'story.', u'the', u'story', u'of', u'how', u'i', u'met_your', u'mother'], [u'son', u'are', u'we', u'being', u'punished', u'for', u'something']]

gensim uses an underscore character to indicate when it’s joined two words together and in this sample we’ve got three phrases - 'narrator_kids', 'met_you' and 'an_incredible'.

We can now populate a Counter with our phrases and their counts and find out the most common phrases. One thing to note is that I’ve chosen to get rid of stopwords at this point rather than earlier because I didn’t want to generate 'false bigrams' where there was actually a stop word sitting in between.

bigram_counter = Counter()
for key in bigram.vocab.keys():
    if key not in stopwords.words("english"):
        if len(key.split("_")) > 1:
            bigram_counter[key] += bigram.vocab[key]

for key, counts in bigram_counter.most_common(20):
    print '{0: <20} {1}'.format(key.encode("utf-8"), counts)

i_'m                 4607
it_'s                4288
you_'re              2659
do_n't               2436
that_'s              2044
in_the               1696
gon_na               1576
you_know             1497
i_do                 1464
this_is              1407
and_i                1389
want_to              1071
it_was               1053
on_the               1052
at_the               1035
we_'re               1033
i_was                1018
of_the               1014
ca_n't               1010
are_you              994

Most of the phrases aren’t really that interesting and I had better luck feeding the phrases into a Word2Vec model and repeating the exercise:

bigram_model = Word2Vec(bigram[sentences], size=100)
bigram_model_counter = Counter()
for key in bigram_model.vocab.keys():
    if key not in stopwords.words("english"):
        if len(key.split("_")) > 1:
            bigram_model_counter[key] += bigram_model.vocab[key].count

for key, counts in bigram_model_counter.most_common(50):
    print '{0: <20} {1}'.format(key.encode("utf-8"), counts)

do_n't               2436
gon_na               1576
ca_n't               1010
did_n't              704
come_on              499
end_of               460
kind_of              396
from_2030            394
my_god               360
they_'re             351
'm_sorry             349
does_n't             341
end_flashback        327
all_right            308
've_been             303
'll_be               301
of_course            289
a_lot                284
right_now            279
new_york             270
look_at              265
trying_to            238
tell_me              196
a_few                195
've_got              189
wo_n't               174
so_much              172
got_ta               168
each_other           166
my_life              157
talking_about        157
talk_about           154
what_happened        151
at_least             141
oh_god               138
wan_na               129
supposed_to          126
give_me              124
last_night           121
my_dad               120
more_than            119
met_your             115
excuse_me            112
part_of              110
phone_rings          109
get_married          107
looks_like           105
'm_sorry.            104
said_``              101

The first 20 phrases or so aren’t particularly interesting although we do have 'new_york' in there which is good as that’s where the show is set. If we go further we’ll notice phrases like 'my_dad', 'get_married' and 'last_night' which may all explain interesting parts of the plot.

Having the data in the Word2Vec model allows us to do some other fun queries too. e.g.

>>> bigram_model.most_similar(['marshall', 'lily'], ['ted'], topn=10)
[(u'robin', 0.5474381446838379), (u'go_ahead', 0.5138797760009766), (u'zoey', 0.505358874797821), (u'karen', 0.48617005348205566), (u'cootes', 0.4757827818393707), (u'then', 0.45426881313323975), (u'lewis', 0.4510520100593567), (u'natalie.', 0.45070385932922363), (u'vo', 0.4189065098762512), (u'players', 0.4149518311023712)]

>>> bigram_model.similarity("ted", "robin")
0.51928683064927905

>>> bigram_model.similarity("barney", "robin")
0.62980405583219112

>>> bigram_model.most_similar(positive=['getting_married'])
[(u'so_glad', 0.780311107635498), (u'kidding', 0.7683225274085999), (u'awake', 0.7682262659072876), (u'lunch.', 0.7591195702552795), (u'ready.', 0.7372316718101501), (u'single.', 0.7350872755050659), (u'excited_about', 0.725479006767273), (u'swamped', 0.7252731323242188), (u'boyfriends', 0.7127221822738647), (u'believe_this.', 0.71015864610672)]

>>> bigram_model.most_similar(positive=['my_dad'])
[(u'my_mom', 0.7994954586029053), (u'somebody', 0.7758427262306213), (u'easier', 0.7305313944816589), (u'hot.', 0.7282992601394653), (u'pregnant.', 0.7103987336158752), (u'nobody', 0.7059557437896729), (u'himself.', 0.7046393156051636), (u'physically', 0.7044381499290466), (u'young_lady', 0.69412761926651), (u'at_bernie', 0.682607889175415)]

I’m not quite at the stage where I can automatically pull out the results of a gensim model and do something with it but it is helping me to see some of the main themes in the show.

Next up I’ll try out trigrams and then TF/IDF over the bigrams to see which are the most important on a per episode basis. I also need to dig into Word2Vec to figure out why it comes up with different top phrases than the Phrases model.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.