Pandas: Compare values in DataFrame to previous days
I’m still playing around with Covid vaccine data, this time exploring how the number of doses varies week by week. I want to know how many more (or less) vaccines have been done on a given day compared to that same day last week.
We’ll be using Pandas in this blog post, so let’s first install that library and import it:
pip install pandas
import pandas as pd
And now let’s create a DataFrame containing a subset of the data that I’m working with:
df = pd.DataFrame([
{'date': '2021-01-11', 'vaccinations': 165844.0},
{'date': '2021-01-18', 'vaccinations': 208641.0},
{'date': '2021-01-25', 'vaccinations': 281725.0},
{'date': '2021-02-01', 'vaccinations': 352935.0},
{'date': '2021-02-08', 'vaccinations': 356291.0}
])
date | vaccinations |
---|---|
2021-01-11 |
165844.0 |
2021-01-18 |
208641.0 |
2021-01-25 |
281725.0 |
2021-02-01 |
352935.0 |
2021-02-08 |
356291.0 |
I’ve filtered the data to include the data for 5 Mondays start from the beginning of January. In the real data set we have the data for every day from January until today.
I want to add a column that shows the number of vaccinations done on the previous week. So:
-
for
2021-01-18
, we’d have the value for2021-01-11
-
for
2021-01-25
, we’d have the value for2021-02-18
and so on.
It took me a while to find the function to do this, but it turns out that pandas.Series.shift
is what we want.
This function shifts the index by the desired number of periods.
We’ll pass in periods=1
because we want to shift every row down by 1:
df["vaccinations"].shift(periods=1)
Note
|
I should note that for the real DataFrame I had data for every day and therefore wanted to compare the data from 7 rows earlier in the DataFrame,which meant I had to use |
vaccinations |
---|
NaN |
165844.0 |
208641.0 |
281725.0 |
352935.0 |
We can then add a new column to our DataFrame using the following code:
df.loc[:, "lastWeekVaccinations"] = df["vaccinations"].shift(periods=1)
date | vaccinations | lastWeekVaccinations |
---|---|---|
2021-01-11 |
165844.0 |
NaN |
2021-01-18 |
208641.0 |
165844.0 |
2021-01-25 |
281725.0 |
208641.0 |
2021-02-01 |
352935.0 |
281725.0 |
2021-02-08 |
356291.0 |
352935.0 |
If we then wanted to compute the difference between this week and last week, we could compute this by writing the following code:
df["vaccinations"] - df["lastWeekVaccinations"]
Or we could use the pandas.Series.diff
function, which achieves the same thing:
df.loc[:, "lastWeekVaccinationsDiff"] = df["vaccinations"].diff(periods=1)
date | vaccinations | lastWeekVaccinations | lastWeekVaccinationsDiff |
---|---|---|---|
2021-01-11 |
165844.0 |
NaN |
NaN |
2021-01-18 |
208641.0 |
165844.0 |
42797.0 |
2021-01-25 |
281725.0 |
208641.0 |
73084.0 |
2021-02-01 |
352935.0 |
281725.0 |
71210.0 |
2021-02-08 |
356291.0 |
352935.0 |
3356.0 |
We can also compute the percentage change between the weeks using pandas.Series.pct_change
:
df.loc[:, "lastWeekVaccinationsChange"] = df["vaccinations"].pct_change(periods=1)
date | vaccinations | lastWeekVaccinations | lastWeekVaccinationsDiff | lastWeekVaccinationsChange |
---|---|---|---|---|
2021-01-11 |
165844.0 |
NaN |
NaN |
NaN |
2021-01-18 |
208641.0 |
165844.0 |
42797.0 |
0.2580557632473892 |
2021-01-25 |
281725.0 |
208641.0 |
73084.0 |
0.3502858977861494 |
2021-02-01 |
352935.0 |
281725.0 |
71210.0 |
0.25276422042772206 |
2021-02-08 |
356291.0 |
352935.0 |
3356.0 |
0.009508833071245393 |
With these functions I’ve been able to do exactly what I wanted and could then create a chart based on this data.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.