SPARQL: OR conditions in a WHERE clause using the UNION clause
This is part 4 of my series of posts about querying the Wikidata API, in which I learn how to use SPARQL’s UNION
clause to handle an OR condition in a WHERE
clause.
But first, some context!
After running queries against the Wikidata SPARQL API to pull the date of birth and nationality of tennis players into the Australian Open Graph, I noticed that several players hadn’t actually been updated. I ran some exploratory queries to work out why, and realised that the problem was that some players had anglicised names in my dataset, whereas Wikidata uses their real name.
For example in my dataset we have 'Nicolas Escude', whereas in Wikidata we have 'Nicolas Escudé'.
I wanted to figure out if there was a way to find players by an alternate name, or another name that they were known by, and found a StackOverflow answer that suggested using the skos.altLabel
predicate.
If we wanted to return the date of birth and nationality for the alternate name 'Nicolas Escude', we can write the following query:
SELECT *
WHERE {
?person wdt:P106 wd:Q10833314 ;
skos:altLabel "Nicolas Escude"@en;
rdfs:label ?playerName ;
wdt:P569 ?dateOfBirth;
wdt:P27 [ rdfs:label ?countryName ] .
filter(lang(?countryName) = "en")
filter(lang(?playerName) = "en")
}
If we run this query, we’ll see the following output:
person | playerName | dateOfBirth | countryName |
---|---|---|---|
Nicolas Escudé |
1976-04-03T00:00:00Z |
France |
So far so good. But not every player has an alternate name and we don’t know whether the name that we have is an alternate name or actual name.
For example, let’s try and find the data of birth and nationality for the alternate name 'Roger Federer':
SELECT *
WHERE {
?person wdt:P106 wd:Q10833314 ;
skos:altLabel "Roger Federer"@en;
rdfs:label ?playerName ;
wdt:P569 ?dateOfBirth;
wdt:P27 [ rdfs:label ?countryName ] .
filter(lang(?countryName) = "en")
filter(lang(?playerName) = "en")
}
If we run that query, we’ll see the following output:
person | playerName | dateOfBirth | countryName |
---|---|---|---|
No matching records found |
Not so good.
So we actually want to be able to match triples if either rdfs:label
or skos:altLabel
equal the name of the player.
I thought there might be a way to write an OR
clause within the WHERE
block, but as I understand it, the way to achieve this in SPARQL is via a UNION
statement.
I found Query #8 in SPARQL By Example helpful for understanding how to write such a query.
Let’s write a query that does this:
SELECT *
WHERE {
{ ?person rdfs:label "Nicolas Escude"@en }
UNION
{?person skos:altLabel "Nicolas Escude"@en}
?person wdt:P106 wd:Q10833314 ;
rdfs:label ?playerName ;
wdt:P569 ?dateOfBirth;
wdt:P27 [ rdfs:label ?countryName ] ;
skos:altLabel ?alternateName
filter(lang(?alternateName) = "en")
filter(lang(?countryName) = "en")
filter(lang(?playerName) = "en")
}
And now let’s try it out for 'Nicolas Escude':
person | playerName | dateOfBirth | countryName | alternateName |
---|---|---|---|---|
Nicolas Escudé |
1976-04-03T00:00:00Z |
France |
Nicolas Escude |
Cool, that works. And what about for 'Roger Federer'?
SELECT *
WHERE {
{ ?person rdfs:label "Roger Federer"@en }
UNION
{?person skos:altLabel "Roger Federer"@en}
?person wdt:P106 wd:Q10833314 ;
rdfs:label ?playerName ;
wdt:P569 ?dateOfBirth;
wdt:P27 [ rdfs:label ?countryName ] ;
skos:altLabel ?alternateName
filter(lang(?alternateName) = "en")
filter(lang(?countryName) = "en")
filter(lang(?playerName) = "en")
}
person | playerName | dateOfBirth | countryName | alternateName |
---|---|---|---|---|
No matching records found |
D’oh, still no good.
The problem this time is that we have the statement ?player skos:altLabel ?alternateLabel
in our WHERE
clause, which means we’ll only get back results where the player has an alternate name.
So if we ran this query for 'Rafael Nadal', it would actually work because he has several alternate names:
SELECT *
WHERE {
{ ?person rdfs:label "Rafael Nadal"@en }
UNION
{?person skos:altLabel "Rafael Nadal"@en}
?person wdt:P106 wd:Q10833314 ;
rdfs:label ?playerName ;
wdt:P569 ?dateOfBirth;
wdt:P27 [ rdfs:label ?countryName ] ;
skos:altLabel ?alternateName
filter(lang(?alternateName) = "en")
filter(lang(?countryName) = "en")
filter(lang(?playerName) = "en")
}
person | playerName | dateOfBirth | countryName | alternateName |
---|---|---|---|---|
Rafael Nadal |
1986-06-03T00:00:00Z |
Spain |
Rafa |
|
Rafael Nadal |
1986-06-03T00:00:00Z |
Spain |
Rafa Nadal |
|
Rafael Nadal |
1986-06-03T00:00:00Z |
Spain |
Rafael Nadal Parera |
But let’s go back to Federer. We’re going to update the query to make the alternate name predicate optional, as shown below:
SELECT *
WHERE {
{ ?person rdfs:label "Roger Federer"@en }
UNION
{?person skos:altLabel "Roger Federer"@en}
?person wdt:P106 wd:Q10833314 ;
rdfs:label ?playerName ;
wdt:P569 ?dateOfBirth;
wdt:P27 [ rdfs:label ?countryName ] .
OPTIONAL {
?person skos:altLabel ?alternateName
filter(lang(?alternateName) = "en")
}
filter(lang(?countryName) = "en")
filter(lang(?playerName) = "en")
}
person | playerName | dateOfBirth | countryName | alternateName |
---|---|---|---|---|
Roger Federer |
1981-08-08T00:00:00Z |
Switzerland |
Sweet! And if we run this query for 'Nicolas Escude' or 'Rafael Nadal' we get the same results as before - good times!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.