· wikidata sparql

SPARQL: OR conditions in a WHERE clause using the UNION clause

This is part 4 of my series of posts about querying the Wikidata API, in which I learn how to use SPARQL’s UNION clause to handle an OR condition in a WHERE clause.

sparql or conditions
Figure 1. Using SPARQL’s UNION clause

But first, some context!

After running queries against the Wikidata SPARQL API to pull the date of birth and nationality of tennis players into the Australian Open Graph, I noticed that several players hadn’t actually been updated. I ran some exploratory queries to work out why, and realised that the problem was that some players had anglicised names in my dataset, whereas Wikidata uses their real name.

For example in my dataset we have 'Nicolas Escude', whereas in Wikidata we have 'Nicolas Escudé'.

I wanted to figure out if there was a way to find players by an alternate name, or another name that they were known by, and found a StackOverflow answer that suggested using the skos.altLabel predicate. If we wanted to return the date of birth and nationality for the alternate name 'Nicolas Escude', we can write the following query:

SELECT *
WHERE {
  ?person wdt:P106 wd:Q10833314 ;
          skos:altLabel "Nicolas Escude"@en;
          rdfs:label ?playerName ;
          wdt:P569 ?dateOfBirth;
          wdt:P27 [ rdfs:label ?countryName ] .

  filter(lang(?countryName) = "en")
  filter(lang(?playerName) = "en")
}

If we run this query, we’ll see the following output:

Table 1. Results
person playerName dateOfBirth countryName

http://www.wikidata.org/entity/Q470879

Nicolas Escudé

1976-04-03T00:00:00Z

France

So far so good. But not every player has an alternate name and we don’t know whether the name that we have is an alternate name or actual name.

For example, let’s try and find the data of birth and nationality for the alternate name 'Roger Federer':

SELECT *
WHERE {
  ?person wdt:P106 wd:Q10833314 ;
          skos:altLabel "Roger Federer"@en;
          rdfs:label ?playerName ;
          wdt:P569 ?dateOfBirth;
          wdt:P27 [ rdfs:label ?countryName ] .

  filter(lang(?countryName) = "en")
  filter(lang(?playerName) = "en")
}

If we run that query, we’ll see the following output:

Table 2. Results
person playerName dateOfBirth countryName

No matching records found

Not so good. So we actually want to be able to match triples if either rdfs:label or skos:altLabel equal the name of the player.

I thought there might be a way to write an OR clause within the WHERE block, but as I understand it, the way to achieve this in SPARQL is via a UNION statement. I found Query #8 in SPARQL By Example helpful for understanding how to write such a query.

Let’s write a query that does this:

SELECT *
WHERE {
  { ?person rdfs:label "Nicolas Escude"@en }
  UNION
  {?person skos:altLabel "Nicolas Escude"@en}

  ?person wdt:P106 wd:Q10833314 ;
          rdfs:label ?playerName ;
          wdt:P569 ?dateOfBirth;
          wdt:P27 [ rdfs:label ?countryName ] ;
          skos:altLabel ?alternateName

  filter(lang(?alternateName) = "en")
  filter(lang(?countryName) = "en")
  filter(lang(?playerName) = "en")

}

And now let’s try it out for 'Nicolas Escude':

Table 3. Results
person playerName dateOfBirth countryName alternateName

http://www.wikidata.org/entity/Q470879

Nicolas Escudé

1976-04-03T00:00:00Z

France

Nicolas Escude

Cool, that works. And what about for 'Roger Federer'?

SELECT *
WHERE {
  { ?person rdfs:label "Roger Federer"@en }
  UNION
  {?person skos:altLabel "Roger Federer"@en}

  ?person wdt:P106 wd:Q10833314 ;
          rdfs:label ?playerName ;
          wdt:P569 ?dateOfBirth;
          wdt:P27 [ rdfs:label ?countryName ] ;
          skos:altLabel ?alternateName

  filter(lang(?alternateName) = "en")
  filter(lang(?countryName) = "en")
  filter(lang(?playerName) = "en")
}
Table 4. Results
person playerName dateOfBirth countryName alternateName

No matching records found

D’oh, still no good. The problem this time is that we have the statement ?player skos:altLabel ?alternateLabel in our WHERE clause, which means we’ll only get back results where the player has an alternate name. So if we ran this query for 'Rafael Nadal', it would actually work because he has several alternate names:

SELECT *
WHERE {
  { ?person rdfs:label "Rafael Nadal"@en }
  UNION
  {?person skos:altLabel "Rafael Nadal"@en}

  ?person wdt:P106 wd:Q10833314 ;
          rdfs:label ?playerName ;
          wdt:P569 ?dateOfBirth;
          wdt:P27 [ rdfs:label ?countryName ] ;
          skos:altLabel ?alternateName

  filter(lang(?alternateName) = "en")
  filter(lang(?countryName) = "en")
  filter(lang(?playerName) = "en")
}
Table 5. Results
person playerName dateOfBirth countryName alternateName

http://www.wikidata.org/entity/Q10132

Rafael Nadal

1986-06-03T00:00:00Z

Spain

Rafa

http://www.wikidata.org/entity/Q10132

Rafael Nadal

1986-06-03T00:00:00Z

Spain

Rafa Nadal

http://www.wikidata.org/entity/Q10132

Rafael Nadal

1986-06-03T00:00:00Z

Spain

Rafael Nadal Parera

But let’s go back to Federer. We’re going to update the query to make the alternate name predicate optional, as shown below:

SELECT *
WHERE {
  { ?person rdfs:label "Roger Federer"@en }
  UNION
  {?person skos:altLabel "Roger Federer"@en}

  ?person wdt:P106 wd:Q10833314 ;
          rdfs:label ?playerName ;
          wdt:P569 ?dateOfBirth;
          wdt:P27 [ rdfs:label ?countryName ] .

  OPTIONAL {
    ?person skos:altLabel ?alternateName
            filter(lang(?alternateName) = "en")
  }

  filter(lang(?countryName) = "en")
  filter(lang(?playerName) = "en")
}
Table 6. Results
person playerName dateOfBirth countryName alternateName

http://www.wikidata.org/entity/Q1426

Roger Federer

1981-08-08T00:00:00Z

Switzerland

Sweet! And if we run this query for 'Nicolas Escude' or 'Rafael Nadal' we get the same results as before - good times!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket