DuckDB: Ingest a bunch of CSV files from GitHub
Jeff Sackmann’s tennis_atp repository is one of the best collections of tennis data and I wanted to ingest the ATP Tour singles matches using the DuckDB CLI. In this blog post we’ll learn how to do that.
Usually when I’m ingesting data into DuckDB I’ll specify the files that I want to ingest using the wildcard syntax. In this case that would mean running a query like this:
CREATE OR REPLACE TABLE matches AS
SELECT *
FROM "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_*.csv"
But we can’t use that technique for files on GitHub because it’s not a file system. If we run the query above, we’ll get the following error message:
Error: Invalid Error: HTTP Error: Unable to connect to URL "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_*.csv": 404 (Not Found)
So, instead, we need to construct a list of URLs, which we can do using a combination of the generate_series
function that I covered in my last blog post and the list_transform
function.
We’ll use generate_series
to create a list of the years, like this:
SELECT range(1968, 1970)
range(1968, 1970) |
---|
[1968, 1969] |
And then we can use list_transform
to map or project over each value to construct a URL:
SELECT list_transform(
range(1968, 1970),
y -> 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_' || y || '.csv'
) AS files;
files |
---|
[https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1968.csv, https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1969.csv] |
And then finally we can bring it all together to ingest the files from 1968 until 2023:
CREATE OR REPLACE TABLE matches AS
SELECT * FROM read_csv_auto(
list_transform(
range(1968, 2023),
y -> 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_' || y || '.csv'
),
types={'winner_seed': 'VARCHAR', 'loser_seed': 'VARCHAR'}
);
It’ll take a few seconds, but once it’s done we can do an exploratory query to make sure everything was ingested:
SELECT count(*)
FROM matches;
count(*) |
---|
188191 |
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.