A newbie's guide to querying Wikidata
After reading one of Jesús Barrasa’s recent QuickGraph posts about enriching a knowledge graph with data from Wikidata, I wanted to learn how to query the Wikidata API so that I could pull in the data for my own QuickGraphs.
I want to look up information about tennis players, and one of my favourite players is Nick Kyrgios, so this blog post is going to be all about him.
So what is Wikidata?
Wikidata is a collaboratively edited knowledge base. It is a source of open data that you may want to use in your projects. Wikidata offers a query service for integrations.
QuickGraph#10 Enrich your Neo4j Knowledge Graph by querying Wikidata
The query service can be accessed by navigating to query.wikidata.org If we go there, we’ll see the following screen:
There are a bunch of examples that we can pick from, but we’re going to start with something even simpler than that. Wikidata stores data in triples of the form (subject, predicate, object), so we’ll start with a query that returns one such triple:
SELECT * WHERE {
?subject ?predicate ?object
}
LIMIT 1
subject | predicate | object |
---|---|---|
Admittedly it’s not a very interesting triple, but we’re off and running. What we actually want to do is find triples about Nick Kyrgios, so let’s update our query to do that:
SELECT * WHERE {
?subject ?predicate "Nick Kyrgios"@en
}
This query finds any triples in Wikidata that have an object that matches the English language string "Nick Kyrgios". If we run this query, we’ll get the following results:
subject | predicate |
---|---|
So there are two triples that find Nick.
I think we’ll filter our query further to keep the one that returns a Wikidata URI, which means that we need to update the predicate in our query to be rdfs:label
.
Let’s do that:
SELECT *
WHERE {
?person rdfs:label 'Nick Kyrgios'@en
}
person |
---|
We can navigate to that URI to see all the statements about Nick Kyrgios. One piece of information that we’d like to extract is his date of birth:
We can see from this screenshot that date of birth can be accessed via the property P569
.
To do this we’ll construct the predicate wdt:P569
, where:
In line with the SPARQL model of everything as a triple, the wdt: namespace contains manifestations of properties as simple predicates that can directly connect an item to a value.
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries
Let’s update our query to return date of birth:
SELECT *
WHERE {
?person rdfs:label 'Nick Kyrgios'@en .
?person wdt:P569 ?dateOfBirth
}
We can simplify this query by using the ;
syntax to construct multiple statements around ?person
.
A more concise version is shown below:
SELECT *
WHERE {
?person rdfs:label 'Nick Kyrgios'@en ;
wdt:P569 ?dateOfBirth
}
person | dateOfBirth |
---|---|
1995-04-27T00:00:00Z |
Next we’d like to pull in the country of citizenship, which is property P27
.
SELECT *
WHERE {
?person rdfs:label 'Nick Kyrgios'@en ;
wdt:P569 ?dateOfBirth;
wdt:P27 ?country
}
person | dateOfBirth | country |
---|---|---|
1995-04-27T00:00:00Z |
This query returns the entity representing Australia, but what if we want to return the name of that page rather than the URI?
We can return this by using the rdfs:label
predicate:
SELECT *
WHERE {
?person rdfs:label 'Nick Kyrgios'@en ;
wdt:P569 ?dateOfBirth;
wdt:P27 ?country .
?country rdfs:label ?countryName
}
person | dateOfBirth | country | countryName |
---|---|---|---|
1995-04-27T00:00:00Z |
Australia |
||
1995-04-27T00:00:00Z |
Awıstralya |
||
1995-04-27T00:00:00Z |
Awstralska |
||
1995-04-27T00:00:00Z |
अस्ट्रेलिया |
||
… |
|||
1995-04-27T00:00:00Z |
Аѵстралїꙗ |
||
1995-04-27T00:00:00Z |
Австрали |
||
1995-04-27T00:00:00Z |
Awstralia |
Wow, that returned a lot more rows than we were expecting! The problem is that we’ve returned country names in every single language when actually we only want the English version.
We can fix that by applying a filter on the language of countryName
:
SELECT *
WHERE {
?person rdfs:label 'Nick Kyrgios'@en ;
wdt:P569 ?dateOfBirth;
wdt:P27 ?country .
?country rdfs:label ?countryName
filter(lang(?countryName) = "en")
}
person | dateOfBirth | country | countryName |
---|---|---|---|
1995-04-27T00:00:00Z |
Australia |
That’s more like it! But we’re still returning the URI for Australia when we only want the country name.
We can fix that by changing the fields returned in our SELECT
statement, or we could use the []
operator to go from the person to country name in one statement, without needing to bind the country
variable.
The following query does this:
SELECT *
WHERE { ?person wdt:P106 wd:Q10833314 ;
rdfs:label 'Nick Kyrgios'@en ;
wdt:P569 ?dateOfBirth ;
wdt:P27 [ rdfs:label ?countryName ] .
filter(lang(?countryName) = "en")
}
person | dateOfBirth | countryName |
---|---|---|
1995-04-27T00:00:00Z |
Australia |
That’s all the data that we want to extract for now, but if we wanted to get more stuff it wouldn’t be too difficult to extend our query.
And thanks to Jesus for his help with understanding the SPARQL syntax enough to get my queries working.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.