Python: Learning about defaultdict's handling of missing keys
While reading the scikit-learn code I came across a bit of code that I didn’t understand for a while but in retrospect is quite neat.
This is the code snippet that intrigued me:
vocabulary = defaultdict()
vocabulary.default_factory = vocabulary.__len__
Let’s quickly see how it works by adapting an example from scikit-learn:
>>> from collections import defaultdict
>>> vocabulary = defaultdict()
>>> vocabulary.default_factory = vocabulary.__len__
>>> vocabulary["foo"]
0
>>> vocabulary.items()
dict_items([('foo', 0)])
>>> vocabulary["bar"]
1
>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1)])
What seems to happen is that when we try to find a key that doesn’t exist in the dictionary an entry gets created with a value equal to the number of items in the dictionary.
Let’s check if that assumption is correct by explicitly adding a key and then trying to find one that doesn’t exist:
>>> vocabulary["baz"] = "Mark
>>> vocabulary["baz"]
'Mark'
>>> vocabulary["python"]
3
Now let’s see what the dictionary contains:
>>> vocabulary.items()
dict_items([('foo', 0), ('bar', 1), ('baz', 'Mark'), ('python', 3)])
All makes sense so far. If we look at the source code we can see that this is exactly what’s going on:
"""
__missing__(key) # Called by __getitem__ for missing key; pseudo-code:
if self.default_factory is None: raise KeyError((key,))
self[key] = value = self.default_factory()
return value
"""
pass
scikit-learn uses this code to store a mapping of features to their column position in a matrix, which is a perfect use case.
All in all, very neat!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.