Pandas: Filter column value in array/list - ValueError: The truth value of a Series is ambiguous
The UK government publishes Coronavirus vaccinations data on coronavirus.data.gov.uk, but I wanted to create some different visualisations so I downloaded the data and have been playing with it in the mneedham/covid-vaccines GitHub repository.
I massaged the data so that I have rows in a Pandas DataFrame representing the numbers of first doses, second doses, and total doses done each day. I then wanted to filter this DataFrame based on the type of dose, but initially got a bit stuck. In this blog post we’ll explore how I got around this problem.
The DataFrame that we’re working with looks like the one below:
df = pd.DataFrame({
"date": ["2021-03-25", "2021-03-25", "2021-03-25"],
"dose": ["firstDose", "secondDose", "totalDoses"],
"vaccinations": [324942, 234382, 559324]
})
date | dose | vaccinations | |
---|---|---|---|
0 |
2021-03-25 |
firstDose |
324942 |
1 |
2021-03-25 |
secondDose |
234382 |
2 |
2021-03-25 |
totalDoses |
559324 |
I wanted to filter the DataFrame to return only the rows with a dose
value of firstDose
or secondDose
.
One way of doing this would be to combine filters for firstDose
or secondDose
.
I initially tried to do that using the Python or
clause:
df[(df.dose == "firstDose") or (df.dose == "secondDose")]
But that throws the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/markhneedham/.local/share/virtualenvs/covid-vaccines-xEbcGJTy/lib/python3.8/site-packages/pandas/core/generic.py", line 1442, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Instead we have to use the |
operator, as shown below:
df[(df.dose == "firstDose") | (df.dose == "secondDose")]
date | dose | vaccinations | |
---|---|---|---|
0 |
2021-03-25 |
firstDose |
324942 |
1 |
2021-03-25 |
secondDose |
234382 |
That works fine if we’re filtering on two values, but if we wanted to add a 3rd or 4th filter, our expression would get more complicated. Instead it would be cool if we could filter on a list of values. I initially tried to do this using the following code:
df[df.dose in ["firstDose", "secondDose"]]
But that takes us back to the same error that we had before:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/markhneedham/.local/share/virtualenvs/covid-vaccines-xEbcGJTy/lib/python3.8/site-packages/pandas/core/generic.py", line 1442, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This time we can use Pandas' isin
function to solve the problem, as shown below:
df[df.dose.isin(["firstDose", "secondDose"])]
date | dose | vaccinations | |
---|---|---|---|
0 |
2021-03-25 |
firstDose |
324942 |
1 |
2021-03-25 |
secondDose |
234382 |
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.