Panini Linked Data - Using SPARQL

NOTE: This post was originally published as part of a series of posts on the blog of Talis Systems. However, since that website is eventually going to go offline, I have decided to move the post over here. New parts of the series will be published on datalysator.com. All references and links to Kasabi are no longer working (the service has unfortunately been taken offline), and will be replaced with alternatives.

Again

Well, the English penalty curse persists, despite all those practice sessions – England is out of the EURO. That’s quite disappointing (if you’re an England supporter), but at least it looks like we have some interesting semi-finals ahead of us later this week!

Where were we?

In the previous post, we created some RDF data about EURO2012 Panini stickers from tab-separated value (TSV) files gathered from various websites. We used the Vertere conversion tool to do this. The data we get from this is already a good starting point, but we could add some more data to have a richer dataset to use for an interesting app. On the one hand, there is some implicit information in the data which we should make explicit to make it usable. On the other hand, there is other data out there on the Web of Data which we might want to link to, in order to enrich our own dataset. In this post, we’re going to address the first issue: taking the dataset as it is and deriving some additional data from it. This may sound like reasoning to some of you, but in fact we’re going to do something much simpler: we’re going to use a couple of SPARQL queries to create new data based on the old. Also in this post, we’re going to publish our dataset online, using the Kasabi data marketplace.

Reviewing what we have

Let’s have a look at the data we have at this stage. We have:

  • a resource for each sticker, with
    • a label (from the original Panini data),
    • the sticker type (also derived from the Panini source data), and
    • a code (from the stickermanager.com data), most often a country code. Since it’s not always a country code, we use the generic skos:notation property for this.
  • a resource for each sticker type
    • a label for each type
    • each type is a subclass of a generic panini:Sticker class

Here is an example of our data in Turtle notation:

<http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/241>
    a panini:Player_sticker ;
    rdfs:label "Mesut Özil" ;
    skos:notation "GER" ;
.

panini:Player_sticker
    a owl:Class ;
    rdfs:label "Player Sticker"@en ;
    rdfs:subClassOf panini:Sticker ;
.	

What could we Add?

Of course, this data is all about the stickers themselves. However, each sticker actually depicts something in the real world: a player, a team, a stadium, etc. What I’d like to do is create resources for these things as well and link the stickers to them. So, for each player sticker, I would like to have a resource for the actual player itself. For each team sticker, I want a resource for the team itself, etc. The figure below shows what I have in mind; the next paragraphs will show how to achieve this using SPARQL.

Panini Sticker Linked Data

Creating New Data with SPARQL

SPARQL is mostly known as a query language for RDF. However, by using the CONSTRUCT keyword, we can also create new RDF based on our existing RDF graph. So how would we create the additional triples shown above? First, I need to decide which vocabulary to use for the real-world objects I want to model.

Players

For the players, I’ll use the FOAF vocabulary, since that is widely used as a starting point for modelling people. FOAF also provides me with the foaf:depicts property, which says that something (e.g., a Panini sticker) depicts something else (e.g., a football player). With that out of the way, here is the query we need:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX panini: <http://data.kasabi.com/dataset/panini-stickers/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

CONSTRUCT {
    ?player a foaf:Person ;
        foaf:name ?sticker_label .

    ?sticker panini:depicts ?player .    
} WHERE {
    ?sticker a panini:Player_sticker ;
        rdfs:label ?sticker_label .

    BIND (
        IRI(
            CONCAT(
                "http://data.kasabi.com/dataset/panini-stickers/player/", 
                ENCODE_FOR_URI(?sticker_label)
        )
    ) AS ?player) .
}

Let’s have a look at what’s going on here. The WHERE clause is where we grab the bits of our data that we need to construct the new triples. We’re looking for things that are panini:Player_stickers (we don’t want to do anything with the teams, stadiums or other kinds of stickers in this query) and their labels. Then I’m using a nice new addition to SPARQL 1.1: the BIND AS feature. This is a really useful addition to SPARQL, as it greatly increases your possibilities in creating new RDF with SPARQL In fact, I don’t think there is a way to do the same thing in SPARQL 1.0.

Working from the centre of the expression outwards, I’m taking the sticker label (e.g., “Fernando Torres”) and URI-encode it (to turn the space into a %20), then I concatenate that with a URI that I want to use as the namespace for players (http://data.kasabi.com/dataset/panini-stickers/player/), turn the results into an IRI (a fancy URI that can contain non-ASCII characters) and finally bind the whole thing to the ?player variable. That’s the URI I want to use for my new player resource.

The CONSTRUCT clause is very straight-forward – all we’re doing is use the variables that have been bound in the WHERE clause. I’m saying that ?player is a foaf:Person and that their name is the same as the sticker’s label. Then I’m saying that the sticker depicts the player. That’s all!

Teams

For the teams, I could do the same. However, I want the names of the team resources to be something like “Germany national football team”, with the name of the country always in English. The reason for this is that I think it will make it easier to find a matching resource in an external dataset like DBPedia this way (DBPedia contains all of the English language Wikipedia, but not all other languages). However, the Panini sticker labels use the national language of the team they are showing – e.g., the German team sticker is called “Deutschland” and the Swedish one “Sverige”. I’m sure there is a clever way of automatically translating these names to English – however, since we’re only talking about 16 teams, I have simply written the basic team data by hand. E.g., for the German team I wrote this Turtle, consisting of a type (I use schema.org for this), the desired label in English and a country code:

team:Germany a schema:SportsTeam ;
    rdfs:label "Germany national football team"@en ;
    skos:notation "GER" ;
.

Now I can use the country code to match team stickers with teams, and create the depiction link, as shown in the query below:

PREFIX panini: <http://data.kasabi.com/dataset/panini-stickers/schema/>
PREFIX schema: <http://schema.org/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT {
    ?sticker panini:depicts ?team .
} WHERE {
    ?sticker a panini:Team_sticker_puzzle ;
        skos:notation ?country_code .

    ?team a schema:SportsTeam ;
        skos:notation ?country_code .
}

And that’s it for using some very simple SPARQL queries to create RDF resources for the real-world objects our stickers depict. I know I have only dealt with players and teams, but we could write similar queries for the stadiums or events of the history stickers in the back of the album. I’ll leave that as an exercise to you.

Tools

I have used the arq command line tool (part of the Jena framework) to run all these queries locally, without having to set up my own RDF store. Even if you don’t use Jena for development, this little tool in itself is incredibly useful for data conversion processes such as the one described in this blog post. I really recommend you check it out, if you don’t already use it! Another important tool in my Linked Data toolbox is of course rapper, which is part of the librdf framework. rapper is great for converting RDF from one format into another. You can find all queries and examples of how to use Vertere, arq and rapper to convert the Panini TSV data from start to finish in the Examples folder of the Vertere github project.

Publishing our Data with Kasabi

Now that we have our data – some converted with Vertere from TSV files, some constructed using SPARQL, and some hand-written – we can publish it. The Kasabi Information Marketplace is a neat way of doing this, as gives you a nice landing page for your data, a place to document it, specify a license and a couple of useful APIs out of the box. Most importantly for the next episode in this series of blog posts, it gives us a fully featured SPARQL endpoint for our data! However, you also get a keyword search API, a lookup API, a reconciliation API and the ability to add custom APIs to your dataset as well.

Panini Dataset on Kasabi

Once you have registered for a free Kasabi account, you can go ahead and create a new dataset, add a description, logo (if you have one), documentation, etc. For uploading the data, you have a number of options: you can either point Kasabi to a URI on the Web, use one of the Kasabi client libraries (currently Ruby, JavaScript, PHP and Python), or simply copy & paste the data directly from your text editor into the browser. I used the latter approach, since it is the most convenient one when all you have is a small dataset that isn’t going to change much.

Conclusion

That’s it – in this episode we have enriched our dataset with some additional triples about players and teams and we published it to provide it with a home on the Web – and a SPARQL endpoint. In the next episode, just in time for the EURO2012 final, I will show you how to add some additional links to other datasets (after all, this is supposed to be Linked Data!) using the tools provided by the LATC Project. Until then, enjoy the semis!


Again by an untrained eye on Flickr, licensed under CC BY-NC 2.0.

Categories: slider, Tutorial

Leave a Reply