3 Apr 2015

How I met your mother: Story arcs

After weeks of playing around with various algorithms to extract story arcs in How I met your mother I’ve come to the conclusion that I don’t yet have the skills to completely automate this process so I’m going to change my approach.

The new plan is to treat the outputs of the algorithms as suggestions for possible themes but then have a manual step where I extract what I think are interesting themes in the series.

A theme can consist of a single word or a phrase and the idea is that once a story arc is identified we’ll search over the corpus and find the episodes where that phrase occurs.

We can then generate a CSV file of (story arc) -> (episodeId), store that into our HIMYM graph and use the story arc as another factor for episode similarity.

I ended up with the following script to work out which episodes contained a story arc:

#!/bin/bash

find_term() {
  arc=${1}
  searchTerm=${2}
  episodes=$(grep --color -iE "${searchTerm}" data/import/sentences.csv | awk -F"," '{ print $2  }' | sort | uniq)
  for episode in ${episodes}; do
    echo ${arc},${episode}
  done
}

find_term "Bro Code" "bro code"
find_term "Legendary" "legen(.*)ary"
find_term "Slutty Pumpkin" "slutty pumpkin"
find_term "Magician's Code" "magician's code"
find_term "Thanksgiving" "thanksgiving"
find_term "The Playbook" "playbook"
find_term "Slap Bet" "slap bet"
find_term "Wrestlers and Robots" "wrestlers"
find_term "Robin Sparkles" "sparkles"
find_term "Blue French Horn" "blue french horn"
find_term "Olive Theory" "olive"
find_term "Thank You Linus" "thank you, linus"
find_term "Have you met...?" "have you met"
find_term "Laser Tag" "laser tag"
find_term "Goliath National Bank" "goliath national bank"
find_term "Challenge Accepted" "challenge accepted"
find_term "Best Man" "best man"

If we run this script we’ll see something like the following:

$ ./scripts/arcs.sh
Bro Code,14
Bro Code,155
Bro Code,170
Bro Code,188
Bro Code,201
Bro Code,61
Bro Code,64
Legendary,117
Legendary,120
Legendary,122
Legendary,136
Legendary,137
Legendary,15
Legendary,152
Legendary,157
Legendary,162
Legendary,171
...
Best Man,208
Best Man,30
Best Man,32
Best Man,41
Best Man,42

I pulled out these themes by eyeballing the output of the following scripts:

TF/IDF - calculates TF/IDF scores for ngrams. This helps find important themes in the context of a single episode. I then did some manual searching to see how many of those themes existed in other episodes
Weighted Term Frequency - this returns a weighted term frequency for ngrams of different lengths. The weights are determined by the skewed random discrete distribution I wrote about earlier in the week. I ran it with different skews and ngram lengths.
Named entity extraction - this pulls out any phrases that are named entities. It mostly pulled out names of people (which I used as a stop word list in some other algorithms) but also revealed a couple of themes.
Topic modelling - I used mallet to extract topics across the corpus. Most of them didn’t make much sense to me but there were a few which identified themes that I recognised.

I can’t remember off the top of my head if any obvious themes have been missed so if you know HIMYM better than me let me know and I’ll try and work out why those didn’t surface.

Next I want to see how these scripts fare against some other TV shows and see how quickly I can extract themes for those. It’d also be cool if I can make the whole process a bit more automated.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.