DuckDB: Generate dummy data with user defined functions (UDFs)
In the 0.8 release of DuckDB, they added functionality that lets you add your own functions when using the Python package I wanted to see if I could use it to generate dummy data so that’s what we’re going to do in this blog post.
Note
|
I’ve created a video showing how to do this on my YouTube channel, Learn Data with Mark, so if you prefer to consume content through that medium, I’ve embedded it below: |
We’re going to do this using DuckDB’s Python package. We’ll install that, along with the Faker library, by running the following:
pip install duckdb faker
I then created a function to generate a fake person:
import faker
fake = faker.Faker()
def generate_person():
person = {
'name': fake.name(),
'city': fake.city(),
'state': fake.state(),
'zip_code': fake.zipcode(),
'country': fake.country(),
'email': fake.email(),
'job': fake.job(),
'company': fake.company(),
'ssn': fake.ssn(),
'birthdate': fake.date_of_birth(),
'phone_number': fake.phone_number()
}
return person
Now we need to create a DuckDB database and register the function, which we’ll do with the following code:
import duckdb
from duckdb.typing import *
con.create_function(
'generate_person',
generate_person,
[],
duckdb.struct_type({
'name': 'VARCHAR',
'city': 'VARCHAR',
'state': 'VARCHAR',
'zip_code': 'VARCHAR',
'country': 'VARCHAR',
'email': 'VARCHAR',
'job': 'VARCHAR',
'company': 'VARCHAR',
'ssn': 'VARCHAR',
'birthdate': 'DATE',
'phone_number': 'VARCHAR'
})
)
A dictionary in Python maps to the duckdb.struct_type
type in DuckDB.
We can then pass in a map of the fields and their data types.
We can then use that function in a query like this:
con.sql("""
CREATE OR REPLACE TABLE people AS
SELECT person.* FROM (
SELECT generate_person(random()) AS person
FROM generate_series(1,10000)
)
""")
When I ran this script it was returning the same person repeatedly, which makes me think the function was being cached.
To work around that, I add a seed
parameter to the function, resulting in the following code:
def generate_person(seed):
person = {
'name': fake.name(),
'city': fake.city(),
'state': fake.state(),
'zip_code': fake.zipcode(),
'country': fake.country(),
'email': fake.email(),
'job': fake.job(),
'company': fake.company(),
'ssn': fake.ssn(),
'birthdate': fake.date_of_birth(),
'phone_number': fake.phone_number()
}
return person
con.create_function(
'generate_person',
generate_person,
[DOUBLE],
duckdb.struct_type({
'name': 'VARCHAR',
'city': 'VARCHAR',
'state': 'VARCHAR',
'zip_code': 'VARCHAR',
'country': 'VARCHAR',
'email': 'VARCHAR',
'job': 'VARCHAR',
'company': 'VARCHAR',
'ssn': 'VARCHAR',
'birthdate': 'DATE',
'phone_number': 'VARCHAR'
})
)
con.sql("""
CREATE OR REPLACE TABLE people AS
SELECT person.* FROM (
SELECT generate_person(random()) AS person
FROM generate_series(1,10000)
)
""")
And now it works!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.