Python: Simplifying the creation of a stop word list with defaultdict
I’ve been playing around with topics models again and recently read a paper by David Mimno which suggested the following heuristic for working out which words should go onto the stop list:
A good heuristic for identifying such words is to remove those that occur in more than 5-10% of documents (most common) and those that occur fewer than 5-10 times in the entire corpus (least common).
I decided to try this out on the HIMYM dataset that I’ve been working on over the last couple of months.
I started out with the following code to build a dictionary of words, their total occurrences and the episodes they’d been used in:
import csv
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
episodes = defaultdict(str)
with open("sentences.csv", "r") as file:
reader = csv.reader(file, delimiter = ",")
reader.next()
for row in reader:
episodes[row[1]] += row[4]
vectorizer = CountVectorizer(analyzer='word', min_df = 0, stop_words = 'english')
matrix = vectorizer.fit_transform(episodes.values())
features = vectorizer.get_feature_names()
words = {}
for doc_id, doc in enumerate(matrix.todense()):
for word_id, score in enumerate(doc.tolist()[0]):
word = features[word_id]
if not words.get(word):
words[word] = {}
if not words[word].get("score"):
words[word]["score"] = 0
words[word]["score"] += score
if not words[word].get("episodes"):
words[word]["episodes"] = set()
if score > 0:
words[word]["episodes"].add(doc_id)
This works fine but the code inside the last for block is ugly and most of it is handling the case when parts of a dictionary aren’t yet initialised which is defaultdict territory. You’ll notice I am using defaultdict in the first part of the code but not yet the second as I’d struggled to get it working.
This was my first attempt to make the 'words' variable based on it:
>>> words = defaultdict({})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: first argument must be callable
We can see why this doesn’t work if we try to evaluate '{}' as a function which is what defaultdict does internally:
>>> {}()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'dict' object is not callable
Instead what we need is to pass in 'dict':
>>> dict()
{}
>>> words = defaultdict(dict)
>>> words
defaultdict(<type 'dict'>, {})
That simplifies the first bit of the loop:
words = defaultdict(dict)
for doc_id, doc in enumerate(matrix.todense()):
for word_id, score in enumerate(doc.tolist()[0]):
word = features[word_id]
if not words[word].get("score"):
words[word]["score"] = 0
words[word]["score"] += score
if not words[word].get("episodes"):
words[word]["episodes"] = set()
if score > 0:
words[word]["episodes"].add(doc_id)
We’ve still got a couple of other places to simplify though which we can do by defining a custom function and passing that into defaultdict:
def default_dict_function():
return {"score": 0, "episodes": set()}
>>> words = defaultdict(default_dict_function)
>>> words
defaultdict(<function default_dict_function at 0x10963fcf8>, {})
And here’s the final product:
def default_dict_function():
return {"score": 0, "episodes": set()}
words = defaultdict(default_dict_function)
for doc_id, doc in enumerate(matrix.todense()):
for word_id, score in enumerate(doc.tolist()[0]):
word = features[word_id]
words[word]["score"] += score
if score > 0:
words[word]["episodes"].add(doc_id)
After this we can write out the words to our stop list:
with open("stop_words.txt", "w") as file:
writer = csv.writer(file, delimiter = ",")
for word, value in words.iteritems():
# appears in > 10% of episodes
if len(value["episodes"]) > int(len(episodes) / 10):
writer.writerow([word.encode('utf-8')])
# less than 10 occurences
if value["score"] < 10:
writer.writerow([word.encode('utf-8')])
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.