Panini Linked Data - Using SPARQL

NOTE: This post was originally published as part of a series of posts on the blog of Talis Systems. However, since that website is eventually going to go offline, I have decided to move the post over here. New parts of the series will be published on datalysator.com. All references and links to Kasabi are no longer working (the service has unfortunately been taken offline), and will be replaced with alternatives.

Again

Well, the English penalty curse persists, despite all those practice sessions – England is out of the EURO. That’s quite disappointing (if you’re an England supporter), but at least it looks like we have some interesting semi-finals ahead of us later this week!

Where were we?

In the previous post, we created some RDF data about EURO2012 Panini stickers from tab-separated value (TSV) files gathered from various websites. We used the Vertere conversion tool to do this. The data we get from this is already a good starting point, but we could add some more data to have a richer dataset to use for an interesting app. On the one hand, there is some implicit information in the data which we should make explicit to make it usable. On the other hand, there is other data out there on the Web of Data which we might want to link to, in order to enrich our own dataset. In this post, we’re going to address the first issue: taking the dataset as it is and deriving some additional data from it. This may sound like reasoning to some of you, but in fact we’re going to do something much simpler: we’re going to use a couple of SPARQL queries to create new data based on the old. Also in this post, we’re going to publish our dataset online, using the Kasabi data marketplace.

Reviewing what we have

Let’s have a look at the data we have at this stage. We have:

  • a resource for each sticker, with
    • a label (from the original Panini data),
    • the sticker type (also derived from the Panini source data), and
    • a code (from the stickermanager.com data), most often a country code. Since it’s not always a country code, we use the generic skos:notation property for this.
  • a resource for each sticker type
    • a label for each type
    • each type is a subclass of a generic panini:Sticker class

Here is an example of our data in Turtle notation:

<http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/241>
    a panini:Player_sticker ;
    rdfs:label "Mesut Özil" ;
    skos:notation "GER" ;
.

panini:Player_sticker
    a owl:Class ;
    rdfs:label "Player Sticker"@en ;
    rdfs:subClassOf panini:Sticker ;
.	

What could we Add?

Of course, this data is all about the stickers themselves. However, each sticker actually depicts something in the real world: a player, a team, a stadium, etc. What I’d like to do is create resources for these things as well and link the stickers to them. So, for each player sticker, I would like to have a resource for the actual player itself. For each team sticker, I want a resource for the team itself, etc. The figure below shows what I have in mind; the next paragraphs will show how to achieve this using SPARQL.

Panini Sticker Linked Data

Creating New Data with SPARQL

SPARQL is mostly known as a query language for RDF. However, by using the CONSTRUCT keyword, we can also create new RDF based on our existing RDF graph. So how would we create the additional triples shown above? First, I need to decide which vocabulary to use for the real-world objects I want to model.

Players

For the players, I’ll use the FOAF vocabulary, since that is widely used as a starting point for modelling people. FOAF also provides me with the foaf:depicts property, which says that something (e.g., a Panini sticker) depicts something else (e.g., a football player). With that out of the way, here is the query we need:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX panini: <http://data.kasabi.com/dataset/panini-stickers/schema/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

CONSTRUCT {
    ?player a foaf:Person ;
        foaf:name ?sticker_label .

    ?sticker panini:depicts ?player .    
} WHERE {
    ?sticker a panini:Player_sticker ;
        rdfs:label ?sticker_label .

    BIND (
        IRI(
            CONCAT(
                "http://data.kasabi.com/dataset/panini-stickers/player/", 
                ENCODE_FOR_URI(?sticker_label)
        )
    ) AS ?player) .
}

Let’s have a look at what’s going on here. The WHERE clause is where we grab the bits of our data that we need to construct the new triples. We’re looking for things that are panini:Player_stickers (we don’t want to do anything with the teams, stadiums or other kinds of stickers in this query) and their labels. Then I’m using a nice new addition to SPARQL 1.1: the BIND AS feature. This is a really useful addition to SPARQL, as it greatly increases your possibilities in creating new RDF with SPARQL In fact, I don’t think there is a way to do the same thing in SPARQL 1.0.

Working from the centre of the expression outwards, I’m taking the sticker label (e.g., “Fernando Torres”) and URI-encode it (to turn the space into a %20), then I concatenate that with a URI that I want to use as the namespace for players (http://data.kasabi.com/dataset/panini-stickers/player/), turn the results into an IRI (a fancy URI that can contain non-ASCII characters) and finally bind the whole thing to the ?player variable. That’s the URI I want to use for my new player resource.

The CONSTRUCT clause is very straight-forward – all we’re doing is use the variables that have been bound in the WHERE clause. I’m saying that ?player is a foaf:Person and that their name is the same as the sticker’s label. Then I’m saying that the sticker depicts the player. That’s all!

Teams

For the teams, I could do the same. However, I want the names of the team resources to be something like “Germany national football team”, with the name of the country always in English. The reason for this is that I think it will make it easier to find a matching resource in an external dataset like DBPedia this way (DBPedia contains all of the English language Wikipedia, but not all other languages). However, the Panini sticker labels use the national language of the team they are showing – e.g., the German team sticker is called “Deutschland” and the Swedish one “Sverige”. I’m sure there is a clever way of automatically translating these names to English – however, since we’re only talking about 16 teams, I have simply written the basic team data by hand. E.g., for the German team I wrote this Turtle, consisting of a type (I use schema.org for this), the desired label in English and a country code:

team:Germany a schema:SportsTeam ;
    rdfs:label "Germany national football team"@en ;
    skos:notation "GER" ;
.

Now I can use the country code to match team stickers with teams, and create the depiction link, as shown in the query below:

PREFIX panini: <http://data.kasabi.com/dataset/panini-stickers/schema/>
PREFIX schema: <http://schema.org/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT {
    ?sticker panini:depicts ?team .
} WHERE {
    ?sticker a panini:Team_sticker_puzzle ;
        skos:notation ?country_code .

    ?team a schema:SportsTeam ;
        skos:notation ?country_code .
}

And that’s it for using some very simple SPARQL queries to create RDF resources for the real-world objects our stickers depict. I know I have only dealt with players and teams, but we could write similar queries for the stadiums or events of the history stickers in the back of the album. I’ll leave that as an exercise to you.

Tools

I have used the arq command line tool (part of the Jena framework) to run all these queries locally, without having to set up my own RDF store. Even if you don’t use Jena for development, this little tool in itself is incredibly useful for data conversion processes such as the one described in this blog post. I really recommend you check it out, if you don’t already use it! Another important tool in my Linked Data toolbox is of course rapper, which is part of the librdf framework. rapper is great for converting RDF from one format into another. You can find all queries and examples of how to use Vertere, arq and rapper to convert the Panini TSV data from start to finish in the Examples folder of the Vertere github project.

Publishing our Data with Kasabi

Now that we have our data – some converted with Vertere from TSV files, some constructed using SPARQL, and some hand-written – we can publish it. The Kasabi Information Marketplace is a neat way of doing this, as gives you a nice landing page for your data, a place to document it, specify a license and a couple of useful APIs out of the box. Most importantly for the next episode in this series of blog posts, it gives us a fully featured SPARQL endpoint for our data! However, you also get a keyword search API, a lookup API, a reconciliation API and the ability to add custom APIs to your dataset as well.

Panini Dataset on Kasabi

Once you have registered for a free Kasabi account, you can go ahead and create a new dataset, add a description, logo (if you have one), documentation, etc. For uploading the data, you have a number of options: you can either point Kasabi to a URI on the Web, use one of the Kasabi client libraries (currently Ruby, JavaScript, PHP and Python), or simply copy & paste the data directly from your text editor into the browser. I used the latter approach, since it is the most convenient one when all you have is a small dataset that isn’t going to change much.

Conclusion

That’s it – in this episode we have enriched our dataset with some additional triples about players and teams and we published it to provide it with a home on the Web – and a SPARQL endpoint. In the next episode, just in time for the EURO2012 final, I will show you how to add some additional links to other datasets (after all, this is supposed to be Linked Data!) using the tools provided by the LATC Project. Until then, enjoy the semis!


Again by an untrained eye on Flickr, licensed under CC BY-NC 2.0.

Categories: slider, Tutorial

Linked Data for Panini Stickers

NOTE: This post was originally published as part of a series of posts on the blog of Talis Systems. However, since that website is eventually going to go offline, I have decided to move the post over here. New parts of the series will be published on datalysator.com. All references and links to Kasabi are no longer working (the service has unfortunately been taken offline), and will be replaced with alternatives.

Stick it like Beckham…

Danish Fan at EURO2012

The EURO2012 is in full swing – with the group stage almost over (I’m watching England v Ukraine as I’m writing this), we have seen most of the usual suspects proceed into the knockout stage (Spain, Germany, Italy), some unusual suspects (Greece!), some unexpected eliminations (vaarwel Oranje, do svidanya Russia), we have had to say goodbye to hosts Poland, to Denmark and to the Boys in Green, who at least gave us some wonderful singing.

But of course, for the little kid in (some) of us, an event like the EURO2012 also means something else – Panini stickers! Little packs of five random stickers, all the players, teams, stadiums, etc. – 540 stickers in total, waiting to be carefully put into the collector’s album. Collectors sort their stickers, make and exchange “have” and “need” lists, rejoice when they complete a team and when they finally have the whole set. Numbers, names of players, teams, countries, cities, stadiums – at the end of the day, it’s all data. So, in the spirit of Leigh Dodds’ recent post on Open Data for (Big) Kids, what better subject to choose for a little EURO2012-themed Linked Data project! This is the first in a four-part series of posts about how to do such a project. Meant as a tutorial for creating linked data, it is going to illustrate all the steps from finding and converting the source data, linking and enriching it with the wider Web of Data and finally creating a little app on top of the data. I’m going to work on this as I go along, but if you want to take a peek at the dataset already now, you can do so over here on Kasabi.

Getting the Data

So, the first step in creating a Linked Data application is figuring out where to get our source data from – obviously, we don’t want to just type it in by hand. There are plenty of collector sites on the Web which have lists of all stickers with their names and numbers, but the main and authoritative source of information would be the Panini site itself. There is a nice list of stickers with numbers, names and types of stickers here. Unfortunately, it’s only an HTML page, not something more useful like a CSV file (2 or 3-star data would have been lovely). Also, the whole list is dispersed over several pages, rather than on one single page. Of course we could write a fancy screen scraper using on of the libraries available on ScraperWiki, but with just a handful of pages that aren’t going to change, we’ll simply copy and paste from the browser into a text editor. A couple of pages later, we have a nice tab-separated list of stickers. The nice thing about the original Panini list is that it has all names in their original language. We can get another, similar list from Stickermanager, which has the nice additional feature of a country code for each sticker. Again, getting the data out is a matter of copying it from the browser into a text file.

Copying data from an HTML table into a text editor

I have named the output files from the two web sites euro2012_paninigroup.tsv and euro2012_stickermanager.tsv. Below are some example rows from our euro2012_paninigroup.tsv file, which I’ll use to explain the conversion process in the next paragraph (I have added some additional tabs for readability).

50	Polska			Glitter - Badge
74	Robert Lewandowski	Player Sticker
250	Mesut Özil		Sticker - In action
486	England			Team Sticker (puzzle)
535	2004 Hellas		Uefa European Championship History

Converting the Data with Vertere

Now that we have the data, we need to think about how to convert it. Copying directly from an HTML table in browser window into a text file gets us tab-separated values, or TSV. This is a format which, just like its close relative CSV, is essentially a tabular format – a spreadsheet. There are plenty of useful tools for converting spreadsheet-like data into RDF. I have chosen Rob Styles‘ PHP-based Vertere (from the Latin verb for turn or translate), which takes a mapping template in the Turtle RDF language and uses that to drive the conversion. The nice thing about it is that the mapping template already looks pretty close to the RDF we want to get as the output. Vertere is available on Github.

So how does Vertere work? In a nutshell, it looks at each row of our source table (TSV or CSV file) in sequence and uses the information in the mapping file to turn it into an RDF resource, using the values in the columns to create statements about the resource. We define both attributes with literal values, and relationships linking to other resources.

Each row in our Panini file has the unique number of the sticker (74), the label (Robert Lewandowski) and its type (Player Sticker). Let’s see how we would convert this using Vertere. Before I can start saying anything about how to map the individual rows, I need to give some house-keeping facts about the conversion in general. Those are needed to get the process going (I’m ignoring the prefix definitions for now):

<#> a :Spec ; 
    :format :TSV ;
    :header_rows 0 ;
    :resource <#sticker>, <#sticker_type> ;
    :base_uri "http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/" ;
.

What is happening here? I’m saying that this document (<#>) is a mapping specification (:Spec), that the input file is in TSV format, that it has no header rows, what the base URI is that I want to use for all resources this conversion will create, and that I will specify two rules to create resources from each row (<#sticker> and <#sticker_type>). These conversion rules are the heart of the Vertere process, and as you can see I can apply multiple of them to each row. What does one of these rules looks like?

<#sticker> a :Resource ;
    :identity [ 
        :source_column 1 ; 
    ] ;
    :attribute [
        :property rdfs:label ; 
        :source_column 2 ;
    ] ;
    :relationship [ 
        :property rdf:type ; 
        :object_from <#sticker_type> ;
    ] ;
.

The <#sticker> rule is pretty straight-forward. By using :identity, I’m telling Vertere to use the value in column 1 (column numbering starts at 1, not 0!) for creating the URI for this resource. If we look at our source data, we see that column 1 contains the sticker number, so the URI for the Robert Lewandowski sticker would e.g. be http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/74 (the base URI comes from the spec definition above). For each sticker resource, we create two triples. First we want to use the value in column 2 (the sticker label) to say what the rdfs:label of the sticker is. This is a literal value, so we use the :attribute property. We also want to say what the type of this sticker is (a player sticker, a team sticker, etc.). The easiest way to specify a type for a resource is to use Vertere’s :type property. However, this only works if the type is always the same. In this example, we want to assign different types for different kinds of stickers, so we use :relationship to link this resource to the resource created in a different rule (<#sticker_type>). Let’s look at that rule:

<#sticker_type> a :Resource ;
    :identity [
        :source_column 3 ;
        :base_uri "http://data.kasabi.com/dataset/panini-stickers/schema/" ;
        :process ( :regex :normalise :title_case ) ;
        :regex_match "[() -]+";
        :regex_output " ";
    ] ;
    :type owl:Class ;
    :attribute	[ 
        :property rdfs:label; 
        :source_column 3 ;
        :language "en" ;
    ] ;
    :relationship [ 
        :property rdfs:subClassOf ; 
        :object panini:Sticker ;
    ] ;
.

This rule creates another resource for each row, only this time it’s not for the sticker, but for its type. The identity is based on the value in column 3 this time. There are a number of things happening in the :identity statement: first of all, we want to use a different base URI for the sticker types. This isn’t strictly speaking necessary, but it’s a good idea to keep your instances and classes in different namespaces. Then, we don’t want to just take the column 3 value as it is and append it to the base URI (we’d end up with an invalid URI), but we want to :process it into something more appropriate. We want to apply three operations on the value: a :regex search and replace, a :normalise operation and a :title_case (class names should start with an upper case letter), in that order. The :regex is further specified to match any sequence of one or more parentheses, spaces or hyphens and replace them with a single space. As an example, these three operations will turn “Team Sticker (puzzle)” into “Team_sticker_puzzle“. In combination with the new base URI, the URI of the resource created by this rule will be http://data.kasabi.com/dataset/panini-stickers/schema/Team_sticker_puzzle. By the way, since Vertere is in PHP, it’s very simple to extend its functionality by adding new such operations (or other functionality).

The rest of the rule is pretty simple again: we’re saying that the type of each resource will be owl:Class, we assign the value of column 3 as an rdfs:label (also assigning a language tag – datatypes work similarly), and say that each new class is an rdfs:subClassOf a panini:Sticker class.

That’s it! If we run this template through Vertere, each row in the source TSV file will result in two RDF resources. E.g., the row for the Mesut Özil sticker (250 Mesut Özil Sticker - In action) will result in the following RDF:

<http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/250>
    a panini:Sticker_in_action ;
    rdfs:label "Mesut Özil" ;
.

panini:Sticker_in_action
    a owl:Class ;
    rdfs:label "Sticker - In action"@en ;
    rdfs:subClassOf panini:Sticker ;
.

You may have noticed that, by applying both rules to every row, the different sticker types will be created over and over – that’s no problem, however, because any duplicate triples in the output will be removed by Vertere. The complete mapping file is available on github, as well as the mapping file for the second sticker list (for getting the country codes).

Conclusion

So, in this post I have shown you how to take some tabular data from the Web and use the Vertere conversion tool to map it to RDF. But we’re not done yet, so in the next post I will explain how to add some additional data using SPARQL queries, and how to publish the whole dataset as Linked Data using Kasabi.

PS: The game is over and England has beaten Ukraine 1:0 – congratulations! Though who knows what would have happened if the Ukraine hadn’t been denied that goal… Anyway, on to the quarter finals!


Danish Fan picture by Comrade Foot on Flickr, licensed under CC BY-SA 2.0.

Categories: slider, Tutorial

I have been active in the Linked Data community for a good numbers of years now, since back when it was still called the “Semantic Web Community”. I started toying with a flavour of semantic technologies called Topic Maps in 2003, when I was writing my master thesis at the University of Cologne and then started to really dive into the field when I started a PhD at DERI in early 2004. When I joined DERI was still a very small institute with just about 10 people working and researching away in the little town of Galway on the West coast of Ireland. When I left in 2011, it had turned into the world’s largest Semantic Web research institute with currently about 130 members.

In October 2011 I left academia behind and joined Linked Data pioneers Talis Systems as a technical consultant, eager to put my research background to practical use. After about a year’s worth of data modelling, development and training for a number of clients, Talis unfortunately decided to focus all their efforts on their Talis Aspire education platform and wind down their generic Linked Data consulting, hosting and development services.

So, as of September 2012, I am running Datalysator, my own freelance consulting business around Open, Linked Data and Web Data in general, with a good measure of semantic technologies in the mix. Open Data (and I’m convinced Linked Data is a key ingredient here) is really taking off at the moment, so now is the perfect time for a venture like this. Also, Berlin, where I am now based, seems to be a perfect location: the city is at the forefront of the Open Government Data movement in Germany (Berlin was the first city to have its own Open Data portal), a lot of the community around open, linked and semantic data is gathered here (in different mixes and with different emphasis: OKFN, Wikidata, DBpedia, Fraunhofer Fokus, etc.), and the city is full of exiting startups in the space (just look at uberblic or lokaler).

Exciting times!

Categories: News
« Previous Page