Python: Sorting lists of dictionaries with sortedcontainers
I was recently working on a Kafka streams data generator, where I only wanted to publish events once the time on those events had been reached. To solve this problem I needed a sorted list and in this blog post we’re going to explore how I went about doing this.
Note
|
I’ve created a video showing how to do this on my YouTube channel, Learn Data with Mark, so if you prefer to consume content through that medium, I’ve embedded it below: |
Unsorted lists
We’re going to start out by created a list of dictionaries that contain a timestamp and name, as shown below:
import datetime as dt
now = dt.datetime.now()
items = [
{"timestamp": now, "name": "Mark"},
{"timestamp": now + dt.timedelta(hours=1), "name": "Dunith"},
{"timestamp": now - dt.timedelta(minutes=27), "name": "Michael"}
]
If we want to print out this list, we could do so using a for loop:
for item in items:
print(item)
We’d see the following output:
{'timestamp': datetime.datetime(2022, 11, 24, 7, 36, 31, 757235), 'name': 'Mark'}
{'timestamp': datetime.datetime(2022, 11, 24, 8, 36, 31, 757235), 'name': 'Dunith'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 9, 31, 757235), 'name': 'Michael'}
As you’d probably expect, this is sorted in insertion order.
If we want to sort the results we could wrap items
in the sorted
function, like this:
for item in sorted(items, key=lambda x: x["timestamp"]):
print(item)
In this code sample we’re sorting by the timestamp
property, and if we run this code, we’ll see the following output:
{'timestamp': datetime.datetime(2022, 11, 24, 7, 9, 31, 757235), 'name': 'Michael'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 36, 31, 757235), 'name': 'Mark'}
{'timestamp': datetime.datetime(2022, 11, 24, 8, 36, 31, 757235), 'name': 'Dunith'}
This time Michael comes first, then Mark, and finally Dunith.
When you want to do some adhoc sorting, this is a reasonable approach, but I wanted a list that’s always sorted.
sortedcontainers
Enter the sortedcontainers library, a collections library written in Python that’s super fast and supports sorted lists, maps, and dictionaries.
We’re only going to use the list, so let’s import that:
from sortedcontainers import SortedList
When we create an instance of SortedList
we need to pass in a lamba expression that describes our sorting key:
sorted_items = SortedList(key=lambda x: x["timestamp"])
We can have it sort an existing list using the update
function:
sorted_items.update(items)
Let’s print that list:
for item in sorted_items:
print(item)
{'timestamp': datetime.datetime(2022, 11, 24, 7, 9, 31, 757235), 'name': 'Michael'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 36, 31, 757235), 'name': 'Mark'}
{'timestamp': datetime.datetime(2022, 11, 24, 8, 36, 31, 757235), 'name': 'Dunith'}
So far so good. We can also add items to the list:
sorted_items.add(
{"timestamp": now + dt.timedelta(minutes=12), "name": "Jennifer"}
)
for item in sorted_items:
print(item)
{'timestamp': datetime.datetime(2022, 11, 24, 7, 9, 31, 757235), 'name': 'Michael'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 36, 31, 757235), 'name': 'Mark'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 48, 31, 757235), 'name': 'Jennifer'}
{'timestamp': datetime.datetime(2022, 11, 24, 8, 36, 31, 757235), 'name': 'Dunith'}
And as we’d expect, the Jennifer list item gets sorted between Mark and Dunith.
The classic list operations are sorted as well.
{"timestamp": now, "name": "Mark"} in sorted_items
True
for item in (sorted_items + sorted_items):
print(item)
{'timestamp': datetime.datetime(2022, 11, 24, 7, 9, 31, 757235), 'name': 'Michael'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 9, 31, 757235), 'name': 'Michael'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 36, 31, 757235), 'name': 'Mark'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 36, 31, 757235), 'name': 'Mark'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 48, 31, 757235), 'name': 'Jennifer'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 48, 31, 757235), 'name': 'Jennifer'}
{'timestamp': datetime.datetime(2022, 11, 24, 8, 36, 31, 757235), 'name': 'Dunith'}
{'timestamp': datetime.datetime(2022, 11, 24, 8, 36, 31, 757235), 'name': 'Dunith'}
sorted_items.remove({"timestamp": now, "name": "Mark"})
for item in sorted_items:
print(item)
{'timestamp': datetime.datetime(2022, 11, 24, 7, 9, 31, 757235), 'name': 'Michael'}
{'timestamp': datetime.datetime(2022, 11, 24, 7, 48, 31, 757235), 'name': 'Jennifer'}
{'timestamp': datetime.datetime(2022, 11, 24, 8, 36, 31, 757235), 'name': 'Dunith'}
In Summary
So far my experience with the sortedcontainers library is great - it’s done everything I expected and is really easy to use. Give it a try!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.