DuckDB/Python: Cannot combine LEFT and RIGHT relations of different connections!
I’ve been playing around with DuckDB over the weekend and ran into an interesting problem while using the Relational API in the Python package. We’re going to explore that in this blog post.
Set up
To get started, let’s install DuckDB:
pip install duckdb
And now let’s open a Python shell and import the package:
import duckdb
Next, let’s create a DuckDB connection and import the httpfs
module, which we’ll use in just a minute:
con = duckdb.connect('file.db')
con.sql("INSTALL httpfs")
con.sql("LOAD httpfs")
Importing ATP matches
We’re going to import some matches from the ATP tennis tour using the following code:
csv_files = [
f"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{year}.csv"
for year in range(1968,2024)
]
con.execute("""
CREATE OR REPLACE TABLE matches AS
SELECT * FROM read_csv_auto($1, types={'winner_seed': 'VARCHAR', 'loser_seed': 'VARCHAR'})
""", [csv_files])
We can check that the table has been populated by running the following command:
con.table("matches").project("tourney_date, winner_name, loser_name, score").limit(5)
┌──────────────┬─────────────────┬─────────────────┬─────────┐
│ tourney_date │ winner_name │ loser_name │ score │
│ int64 │ varchar │ varchar │ varchar │
├──────────────┼─────────────────┼─────────────────┼─────────┤
│ 19680708 │ Douglas Smith │ Peter Ledbetter │ 6-1 7-5 │
│ 19680708 │ Louis Pretorius │ Maurice Pollock │ 6-1 6-1 │
│ 19680708 │ Cecil Pedlow │ John Mulvey │ 6-2 6-2 │
│ 19680708 │ Tom Okker │ Unknown Fearmon │ 6-1 6-1 │
│ 19680708 │ Armistead Neely │ Harry Sheridan │ 6-2 6-4 │
└──────────────┴─────────────────┴─────────────────┴─────────┘
Nadal vs Djokovic
I wanted to find the matches between Rafael Nadal and Novak Djokovic and in order to keep using the Relational API, I needed to use the sql
function to do this, which meant I couldn’t pass in any parameters.
I therefore ended up with the following code:
players_df = pd.DataFrame({"player1": ["Rafael Nadal"], "player2": ["Novak Djokovic"]})
players = duckdb.from_df(players_df).set_alias("players")
I then wrote the following code to find the matches won by Nadal:
matches = con.table("matches")
matches.join(players,
condition="matches.winner_name = players.player1 AND matches.loser_name = players.player2",
how="inner"
)
Unfortunately, this threw an error:
duckdb.Error: Cannot combine LEFT and RIGHT relations of different connections!
The mistake I made here is that the players
and matches
relations were created on different DuckDB connections, as the error suggests.
If we want to get them on the same connection, we need to update the following line:
players = duckdb.from_df(players_df)
To instead read:
players = con.from_df(players_df).set_alias("players")
This will make sure that the players
relation is part of the same connection as matches
.
I hadn’t realised that calling duckdb.from_df
actually adds the relation to the default DuckDB connection.
If we then run the join query again:
(matches
.join(players,
condition="matches.winner_name = players.player1 AND matches.loser_name = players.player2",
how="inner")
.project("tourney_name, tourney_date, round")
.limit(5)
)
We’ll get the following output:
┌──────────────────────┬──────────────┬─────────┐
│ tourney_name │ tourney_date │ round │
│ varchar │ int64 │ varchar │
├──────────────────────┼──────────────┼─────────┤
│ Roland Garros │ 20060529 │ QF │
│ Indian Wells Masters │ 20070305 │ F │
│ Rome Masters │ 20070507 │ QF │
│ Roland Garros │ 20070528 │ SF │
│ Wimbledon │ 20070625 │ SF │
└──────────────────────┴──────────────┴─────────┘
Success!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.