dbt-duckdb: KeyError: "'winner_seed'"
I’ve been building a little demo with dbt and DuckDB to transform CSV files from Jeff Sackmann’s tennis dataset and ran into an error that initially puzzled me. In this blog post, we’ll learn how to deal with it.
But first things first, we’re going to install dbt-duckdb
as well as the latest version of DuckDB, which at the time of writing is 0.9.0.
pip install dbt-duckdb duckdb
I then cloned Mehdi Ouazza’s demo project and adjusted it to work with my dataset.
I want to create a pipeline that takes one CSV file and removes all the columns that start with w_
or l_
.
The following query returns one row from the 2023 CSV file with the winner_seed
and loser_seed
fields explicitly coerced to a VARCHAR
type:
.mode line
FROM read_csv_auto('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2023.csv',
types={'winner_seed': 'VARCHAR', 'loser_seed': 'VARCHAR'}
)
LIMIT 1;
tourney_id = 2023-9900
tourney_name = United Cup
surface = Hard
draw_size = 18
tourney_level = A
tourney_date = 20230102
match_num = 300
winner_id = 126203
winner_seed = 3
winner_entry =
winner_name = Taylor Fritz
winner_hand = R
winner_ht = 193
winner_ioc = USA
winner_age = 25.1
loser_id = 126610
loser_seed = 5
loser_entry =
loser_name = Matteo Berrettini
loser_hand = R
loser_ht = 196
loser_ioc = ITA
loser_age = 26.7
score = 7-6(4) 7-6(6)
best_of = 3
round = F
minutes = 135
w_ace = 15
w_df = 2
w_svpt = 85
w_1stIn = 52
w_1stWon = 45
w_2ndWon = 16
w_SvGms = 12
w_bpSaved = 0
w_bpFaced = 0
l_ace = 7
l_df = 2
l_svpt = 97
l_1stIn = 62
l_1stWon = 47
l_2ndWon = 15
l_SvGms = 12
l_bpSaved = 9
l_bpFaced = 9
winner_rank = 9
winner_rank_points = 3355
loser_rank = 16
loser_rank_points = 2375
You can see that there are a lot of fields that begin with w_
and l_
, but I’m not particularly interested in them for my use case.
So, I created a source that contained the read_csv_auto
part of the query as the external_location
:
version: 2
- name: github_2023
meta:
external_location: >
read_csv_auto(
'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2023.csv',
types={'winner_seed': 'VARCHAR', 'loser_seed': 'VARCHAR'}
)
tables:
- name: matches_file
Once I’d done that, I created a model to query that CSV file, removed the fields, and wrote the output to a new CSV file. The model file looks like this:
{{ config(materialized='external', location='output/matches_2023.csv', format='csv')}} (1)
SELECT
COLUMNS(col -> NOT regexp_matches(col, 'w_.*') AND NOT regexp_matches(col, 'l_.*'))
FROM {{ source('github', 'matches_file') }} (2)
1 | Write the results to output/matches_2023.sql |
2 | Query the CSV file that we defined in the sources file |
I then ran the following command:
dbt run
Which resulted in the following error:
16:03:38 Running with dbt=1.6.4
16:03:38 Registered adapter: duckdb=1.6.0
16:03:38 Encountered an error:
"'winner_seed'"
16:03:38 Traceback (most recent call last):
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/cli/requires.py", line 87, in wrapper
result, success = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/cli/requires.py", line 72, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/cli/requires.py", line 143, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/cli/requires.py", line 172, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/cli/requires.py", line 219, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/cli/requires.py", line 246, in wrapper
manifest = ManifestLoader.get_full_manifest(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/parser/manifest.py", line 316, in get_full_manifest
manifest = loader.load()
^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/parser/manifest.py", line 505, in load
patcher.construct_sources()
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/parser/sources.py", line 85, in construct_sources
parsed = self.parse_source(patched)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/parser/sources.py", line 189, in parse_source
parsed_source.relation_name = self._get_relation_name(parsed_source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/parser/sources.py", line 290, in _get_relation_name
return str(relation_cls.create_from(self.root_project, node))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/adapters/base/relation.py", line 259, in create_from
return cls.create_from_source(node, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/data-pipeline-fzeuMXbM-py3.11/lib/python3.11/site-packages/dbt/adapters/duckdb/relation.py", line 34, in create_from_source
ext_location = ext_location_template.format_map(source_config.as_dict())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: "'winner_seed'"
It seems to be treating the {}
in the external_location
as if it were a Jinja template variable, which isn’t what we want.
Luckily the docs describe a way to work around this, by specifying formatter: oldstyle
:
version: 2
- name: github_2023
meta:
external_location: >
read_csv_auto(
'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2023.csv',
types={'winner_seed': 'VARCHAR', 'loser_seed': 'VARCHAR'}
)
formatter: oldstyle
tables:
- name: matches_file
Once we make that change the pipeline runs as expected and writes the results to output/matches_2023.csv
.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.