Python NLTK/Neo4j: Analysing the transcripts of How I Met Your Mother
After reading Emil’s blog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I met your mother’s transcripts would be a good place to start.
I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.
I wanted to learn a bit of Python and my colleague Nigel pointed me towards the requests and BeautifulSoup libraries to help me with my task. The script to grab the transcripts looks like this:
import requests
from bs4 import BeautifulSoup
from soupselect import select
episodes = {}
for i in range(1,3):
page = open("data/transcripts/page-" + str(i) + ".html", 'r')
soup = BeautifulSoup(page.read())
for row in select(soup, "td.topic-titles a"):
parts = row.text.split(" - ")
episodes[parts[0]] = {"title": parts[1], "link": row.get("href")}
for key, value in episodes.iteritems():
parts = key.split("x")
season = int(parts[0])
episode = int(parts[1])
filename = "data/transcripts/S%d-Ep%d" %(season, episode)
print filename
with open(filename, 'wb') as handle:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get("http://transcripts.foreverdreaming.org" + value["link"], headers = headers)
if response.ok:
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
the files containing the lists of episodes are named 'page-1' and 'page-2'
The code is reasonably simple - we find all the links inside the table, put them in a dictionary and then iterate through the dictionary and download the files to disk. The code to save the file is a bit of a monstrosity but there didn’t seem to be a 'save' method that I could use.
Having downloaded the files, I thought through all sorts of clever things I could do, including generating a bag of words model for each episode or performing sentiment analysis on each sentence which I’d learnt about from a Kaggle tutorial.
In the end I decided to start simple and extract all the words from the transcripts and count many times a word occurred in a given episode.
I ended up with the following script which created a dictionary of (episode -> words + occurrences):
import csv
import nltk
import re
from bs4 import BeautifulSoup
from soupselect import select
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import word_tokenize
def count_words(words):
tally=Counter()
for elem in words:
tally[elem] += 1
return tally
episodes_dict = {}
with open('data/import/episodes.csv', 'r') as episodes:
reader = csv.reader(episodes, delimiter=',')
reader.next()
for row in reader:
print row
transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read()
soup = BeautifulSoup(transcript)
rows = select(soup, "table.tablebg tr td.post-body div.postbody")
raw_text = rows[0]
[ad.extract() for ad in select(raw_text, "div.ads-topic")]
[ad.extract() for ad in select(raw_text, "div.t-foot-links")]
text = re.sub("[^a-zA-Z]", " ", raw_text.text.strip())
words = [w for w in nltk.word_tokenize(text) if not w.lower() in stopwords.words("english")]
episodes_dict[row[0]] = count_words(words)
Next I wanted to explore the data a bit to see which words occurred across episodes or which word occurred most frequently and realised that this would be a much easier task if I stored the data somewhere.
s/somewhere/in Neo4j
Neo4j’s query language, Cypher, has a really nice ETL-esque tool called 'LOAD CSV' for loading in CSV files (as the name suggests!) so I added some code to save my words to disk:
with open("data/import/words.csv", "w") as words:
writer = csv.writer(words, delimiter=",")
writer.writerow(["EpisodeId", "Word", "Occurrences"])
for episode_id, words in episodes_dict.iteritems():
for word in words:
writer.writerow([episode_id, word, words[word]])
This is what the CSV file contents look like:
$ head -n 10 data/import/words.csv
EpisodeId,Word,Occurrences
165,secondly,1
165,focus,1
165,baby,1
165,spiders,1
165,go,4
165,apartment,1
165,buddy,1
165,Exactly,1
165,young,1
Now we need to write some Cypher to get the data into Neo4j:
// words
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (word:Word {value: row.Word})
// episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MERGE (episode:Episode {id: TOINT(row.EpisodeId)})
// words to episodes
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row
MATCH (word:Word {value: row.Word})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (word)-[:USED_IN_EPISODE {times: TOINT(row.Occurrences) }]->(episode);
Having done that we can write some simple queries to explore the words used in How I met your mother:
MATCH (word:Word)-[r:USED_IN_EPISODE]->(episode)
RETURN word.value, COUNT(episode) AS episodes, SUM(r.times) AS occurrences
ORDER BY occurrences DESC
LIMIT 10
==> +-------------------------------------+
==> | word.value | episodes | occurrences |
==> +-------------------------------------+
==> | "Ted" | 207 | 11437 |
==> | "Barney" | 208 | 8052 |
==> | "Marshall" | 208 | 7236 |
==> | "Robin" | 205 | 6626 |
==> | "Lily" | 207 | 6330 |
==> | "m" | 208 | 4777 |
==> | "re" | 208 | 4097 |
==> | "know" | 208 | 3489 |
==> | "Oh" | 197 | 3448 |
==> | "like" | 208 | 2498 |
==> +-------------------------------------+
==> 10 rows
The main 5 characters occupy the top 5 positions which is probably what you’d expect. I’m not sure why 'm' and 're' are in the next two position s - I expect that might be scraping gone wrong!
Our next query might focus around checking which character is referred to the post in each episode:
WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN episode.id, topWord.word AS word, topWord.times AS occurrences
LIMIT 10
==> +---------------------------------------+
==> | episode.id | word | occurrences |
==> +---------------------------------------+
==> | 72 | "Barney" | 75 |
==> | 143 | "Ted" | 16 |
==> | 43 | "Lily" | 74 |
==> | 156 | "Ted" | 12 |
==> | 206 | "Barney" | 23 |
==> | 50 | "Marshall" | 51 |
==> | 113 | "Ted" | 76 |
==> | 178 | "Barney" | 21 |
==> | 182 | "Barney" | 22 |
==> | 67 | "Ted" | 84 |
==> +---------------------------------------+
==> 10 rows
If we dig into it further there’s actually quite a bit of variety in the number of times the top character in each episode is mentioned which again probably says something about the data:
WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters
MATCH (word:Word) WHERE word.value IN mainCharacters
MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word)
WITH episode, word, r
ORDER BY episode.id, r.times DESC
WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord
RETURN MIN(topWord.times), MAX(topWord.times), AVG(topWord.times), STDEV(topWord.times)
==> +-------------------------------------------------------------------------------------+
==> | MIN(topWord.times) | MAX(topWord.times) | AVG(topWord.times) | STDEV(topWord.times) |
==> +-------------------------------------------------------------------------------------+
==> | 3 | 259 | 63.90865384615385 | 42.36255207691068 |
==> +-------------------------------------------------------------------------------------+
==> 1 row
Obviously this is a very simple way of deriving structure from text, here are some of the things I want to try out next:
-
Detecting common phrases/memes/phrases used in the show (e.g. the yellow umbrella) - this should be possible by creating different length n-grams and then searching for those phrases across the corpus.
-
Pull out scenes - some of the transcripts use the keyword 'scene' to denote this although some of them don’t. Depending how many transcripts contain scene demarkations perhaps we could train a classifier to detect where scenes should be in the transcripts which don’t have scenes.
-
Analyse who talks to each other or who talks about each other most frequently
-
Create a graph of conversations as my colleagues Max and Michael have previously blogged about.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.