5 Jan 2024

Pandas: Exclude columns using regex

After a few months of using ClickHouse, I’ve got quite used to using the SELECT <expr> EXCEPT modifier, which lets you remove columns based on a regular expression. I wanted to do something similar when working with some data in Pandas and in this blog we’ll explore how to do that.

We’re gonna be working with a CSV file of UK energy and gas tariffs for one of the energy providers. The CSV file looks like this:

Table 1. tariffs.csv
energyType	standingCharge	unitRate	startDate	endDate
electricity	47.56	29.38	2024-01-01	2024-03-30
gas	29.60	7.45	2024-01-01	2024-03-30
electricity	44.961	26.758	2023-10-01	2023-12-31
gas	27.726	6.594	2023-10-01	2023-12-31
electricity	44.961	29.428	2023-07-01	2023-09-30
gas	27.726	7.179	2023-07-01	2023-09-30
electricity	44.961	49.117	2023-04-01	2023-06-30
gas	27.726	11.990	2023-04-01	2023-06-30
electricity	40.646	65.555	2023-01-01	2023-03-30
gas	27.128	16.229	2023-01-01	2023-03-30
electricity	40.646	50.630	2022-10-01	2022-12-31
gas	27.128	14.021	2022-10-01	2022-12-31

We can read it with Pandas like this:

import pandas as pd

tariffs = pd.read_csv("tariffs.csv")
print(tariffs.to_string(index=False))

Output

 energyType  standingCharge  unitRate  startDate    endDate
electricity          47.560    29.380 2024-01-01 2024-03-30
        gas          29.600     7.450 2024-01-01 2024-03-30
electricity          44.961    26.758 2023-10-01 2023-12-31
        gas          27.726     6.594 2023-10-01 2023-12-31
electricity          44.961    29.428 2023-07-01 2023-09-30
        gas          27.726     7.179 2023-07-01 2023-09-30
electricity          44.961    49.117 2023-04-01 2023-06-30
        gas          27.726    11.990 2023-04-01 2023-06-30
electricity          40.646    65.555 2023-01-01 2023-03-30
        gas          27.128    16.229 2023-01-01 2023-03-30
electricity          40.646    50.630 2022-10-01 2022-12-31
        gas          27.128    14.021 2022-10-01 2022-12-31

Now, let’s say we only want to return fields that contain the term Date. We can use the loc function where we only include columns that contain this term as the column argument:

print(
    (tariffs
      .loc[:, tariffs.columns.str.contains("Date")]
      .to_string(index=False)
    )
)

Output

 startDate    endDate
2024-01-01 2024-03-30
2024-01-01 2024-03-30
2023-10-01 2023-12-31
2023-10-01 2023-12-31
2023-07-01 2023-09-30
2023-07-01 2023-09-30
2023-04-01 2023-06-30
2023-04-01 2023-06-30
2023-01-01 2023-03-30
2023-01-01 2023-03-30
2022-10-01 2022-12-31
2022-10-01 2022-12-31

And what about if we want to get rid of the Date fields? We can add the ~ prefix:

print(
    (tariffs
      .loc[:, ~tariffs.columns.str.contains("Date")]
      .to_string(index=False)
    )
)

Output

 energyType  standingCharge  unitRate
electricity          47.560    29.380
        gas          29.600     7.450
electricity          44.961    26.758
        gas          27.726     6.594
electricity          44.961    29.428
        gas          27.726     7.179
electricity          44.961    49.117
        gas          27.726    11.990
electricity          40.646    65.555
        gas          27.128    16.229
electricity          40.646    50.630
        gas          27.128    14.021

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.