1 Mar 2015

Python: Detecting the speaker in HIMYM using Parts of Speech (POS) tagging

Over the last couple of weeks I’ve been experimenting with different classifiers to detect speakers in HIMYM transcripts and in all my attempts so far the only features I’ve used have been words.

This led to classifiers that were overfitted to the training data so I wanted to generalise them by introducing parts of speech of the words in sentences which are more generic.

First I changed the function which generates the features for each word to also contain the parts of speech of the previous and next words as well as the word itself:

def pos_features(sentence, sentence_pos, i):
    features = {}

    features["word"] = sentence[i]
    features["word-pos"] = sentence_pos[i][1]

    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-word-pos"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-word-pos"] = sentence_pos[i-1][1]

    if i == len(sentence) - 1:
        features["next-word"] = "<END>"
        features["next-word-pos"] = "<END>"
    else:
        features["next-word"] = sentence[i+1]
        features["next-word-pos"] = sentence_pos[i+1][1]

    return features

Next we need to tweak our calling code to calculate the parts of speech tags for each sentence and pass it in:

featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )

I’m using nltk to do this and although it’s slower than some alternatives, the data set is small enough that it’s not an issue.

Now it’s time to train a Decision Tree with the new features. I created three variants - one with both words and POS; one with only words; one with only POS.

I took a deep copy of the training/test data sets and then removed the appropriate keys:

def get_rid_of(entry, *keys):
    for key in keys:
        del entry[key]

import copy

# Word based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')

tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')

c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)

# POS based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')

tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')

c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)

The full code is on my github but these were the results I saw:

$ time python scripts/detect_speaker.py
Classifier              speaker precision    speaker recall    non-speaker precision    non-speaker recall
--------------------  -------------------  ----------------  -----------------------  --------------------
Decision Tree All In             0.911765          0.939394                 0.997602              0.996407
Decision Tree Words              0.911765          0.939394                 0.997602              0.996407
Decision Tree POS                0.90099           0.919192                 0.996804              0.996008

There’s still not much in it - the POS one has slightly more false positives and false positives when classifying speakers but on other runs it performed better.

If we take a look at the decision tree that’s been built for the POS one we can see that it’s all about POS now as you’d expect:

>>> print(c.pseudocode(depth=2))
if next-word-pos == '$': return False
if next-word-pos == "''": return False
if next-word-pos == ',': return False
if next-word-pos == '-NONE-': return False
if next-word-pos == '.': return False
if next-word-pos == ':':
  if prev-word-pos == ',': return False
  if prev-word-pos == '.': return False
  if prev-word-pos == ':': return False
  if prev-word-pos == '<START>': return True
  if prev-word-pos == 'CC': return False
  if prev-word-pos == 'CD': return False
  if prev-word-pos == 'DT': return False
  if prev-word-pos == 'IN': return False
  if prev-word-pos == 'JJ': return False
  if prev-word-pos == 'JJS': return False
  if prev-word-pos == 'MD': return False
  if prev-word-pos == 'NN': return False
  if prev-word-pos == 'NNP': return False
  if prev-word-pos == 'NNS': return False
  if prev-word-pos == 'POS': return False
  if prev-word-pos == 'PRP': return False
  if prev-word-pos == 'PRP$': return False
  if prev-word-pos == 'RB': return False
  if prev-word-pos == 'RP': return False
  if prev-word-pos == 'TO': return False
  if prev-word-pos == 'VB': return False
  if prev-word-pos == 'VBD': return False
  if prev-word-pos == 'VBG': return False
  if prev-word-pos == 'VBN': return True
  if prev-word-pos == 'VBP': return False
  if prev-word-pos == 'VBZ': return False
if next-word-pos == '<END>': return False
if next-word-pos == 'CC': return False
if next-word-pos == 'CD':
  if word-pos == '$': return False
  if word-pos == ',': return False
  if word-pos == ':': return True
  if word-pos == 'CD': return True
  if word-pos == 'DT': return False
  if word-pos == 'IN': return False
  if word-pos == 'JJ': return False
  if word-pos == 'JJR': return False
  if word-pos == 'JJS': return False
  if word-pos == 'NN': return False
  if word-pos == 'NNP': return False
  if word-pos == 'PRP$': return False
  if word-pos == 'RB': return False
  if word-pos == 'VB': return False
  if word-pos == 'VBD': return False
  if word-pos == 'VBG': return False
  if word-pos == 'VBN': return False
  if word-pos == 'VBP': return False
  if word-pos == 'VBZ': return False
  if word-pos == 'WDT': return False
  if word-pos == '``': return False
if next-word-pos == 'DT': return False
if next-word-pos == 'EX': return False
if next-word-pos == 'IN': return False
if next-word-pos == 'JJ': return False
if next-word-pos == 'JJR': return False
if next-word-pos == 'JJS': return False
if next-word-pos == 'MD': return False
if next-word-pos == 'NN': return False
if next-word-pos == 'NNP': return False
if next-word-pos == 'NNPS': return False
if next-word-pos == 'NNS': return False
if next-word-pos == 'PDT': return False
if next-word-pos == 'POS': return False
if next-word-pos == 'PRP': return False
if next-word-pos == 'PRP$': return False
if next-word-pos == 'RB': return False
if next-word-pos == 'RBR': return False
if next-word-pos == 'RBS': return False
if next-word-pos == 'RP': return False
if next-word-pos == 'TO': return False
if next-word-pos == 'UH': return False
if next-word-pos == 'VB': return False
if next-word-pos == 'VBD': return False
if next-word-pos == 'VBG': return False
if next-word-pos == 'VBN': return False
if next-word-pos == 'VBP': return False
if next-word-pos == 'VBZ': return False
if next-word-pos == 'WDT': return False
if next-word-pos == 'WP': return False
if next-word-pos == 'WRB': return False
if next-word-pos == '``': return False

I like that it’s identified the ':' pattern:</p> _~python if next-word-pos == ':': ... if prev-word-pos == '': return True _~

Next I need to drill into the types of sentence structures that it’s failing on and work out some features that can handle those. I still need to see how well a random forest of decision trees would as well.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.