Pandas: Exclude columns using regex
After a few months of using ClickHouse, I’ve got quite used to using the SELECT <expr> EXCEPT
modifier, which lets you remove columns based on a regular expression.
I wanted to do something similar when working with some data in Pandas and in this blog we’ll explore how to do that.
We’re gonna be working with a CSV file of UK energy and gas tariffs for one of the energy providers. The CSV file looks like this:
energyType | standingCharge | unitRate | startDate | endDate |
---|---|---|---|---|
electricity |
47.56 |
29.38 |
2024-01-01 |
2024-03-30 |
gas |
29.60 |
7.45 |
2024-01-01 |
2024-03-30 |
electricity |
44.961 |
26.758 |
2023-10-01 |
2023-12-31 |
gas |
27.726 |
6.594 |
2023-10-01 |
2023-12-31 |
electricity |
44.961 |
29.428 |
2023-07-01 |
2023-09-30 |
gas |
27.726 |
7.179 |
2023-07-01 |
2023-09-30 |
electricity |
44.961 |
49.117 |
2023-04-01 |
2023-06-30 |
gas |
27.726 |
11.990 |
2023-04-01 |
2023-06-30 |
electricity |
40.646 |
65.555 |
2023-01-01 |
2023-03-30 |
gas |
27.128 |
16.229 |
2023-01-01 |
2023-03-30 |
electricity |
40.646 |
50.630 |
2022-10-01 |
2022-12-31 |
gas |
27.128 |
14.021 |
2022-10-01 |
2022-12-31 |
We can read it with Pandas like this:
import pandas as pd
tariffs = pd.read_csv("tariffs.csv")
print(tariffs.to_string(index=False))
energyType standingCharge unitRate startDate endDate
electricity 47.560 29.380 2024-01-01 2024-03-30
gas 29.600 7.450 2024-01-01 2024-03-30
electricity 44.961 26.758 2023-10-01 2023-12-31
gas 27.726 6.594 2023-10-01 2023-12-31
electricity 44.961 29.428 2023-07-01 2023-09-30
gas 27.726 7.179 2023-07-01 2023-09-30
electricity 44.961 49.117 2023-04-01 2023-06-30
gas 27.726 11.990 2023-04-01 2023-06-30
electricity 40.646 65.555 2023-01-01 2023-03-30
gas 27.128 16.229 2023-01-01 2023-03-30
electricity 40.646 50.630 2022-10-01 2022-12-31
gas 27.128 14.021 2022-10-01 2022-12-31
Now, let’s say we only want to return fields that contain the term Date
.
We can use the loc
function where we only include columns that contain this term as the column argument:
print(
(tariffs
.loc[:, tariffs.columns.str.contains("Date")]
.to_string(index=False)
)
)
startDate endDate
2024-01-01 2024-03-30
2024-01-01 2024-03-30
2023-10-01 2023-12-31
2023-10-01 2023-12-31
2023-07-01 2023-09-30
2023-07-01 2023-09-30
2023-04-01 2023-06-30
2023-04-01 2023-06-30
2023-01-01 2023-03-30
2023-01-01 2023-03-30
2022-10-01 2022-12-31
2022-10-01 2022-12-31
And what about if we want to get rid of the Date
fields?
We can add the ~
prefix:
print(
(tariffs
.loc[:, ~tariffs.columns.str.contains("Date")]
.to_string(index=False)
)
)
energyType standingCharge unitRate
electricity 47.560 29.380
gas 29.600 7.450
electricity 44.961 26.758
gas 27.726 6.594
electricity 44.961 29.428
gas 27.726 7.179
electricity 44.961 49.117
gas 27.726 11.990
electricity 40.646 65.555
gas 27.128 16.229
electricity 40.646 50.630
gas 27.128 14.021
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.