Stick it like Beckham…
The EURO2012 is in full swing – with the group stage almost over (I’m watching England v Ukraine as I’m writing this), we have seen most of the usual suspects proceed into the knockout stage (Spain, Germany, Italy), some unusual suspects (Greece!), some unexpected eliminations (vaarwel Oranje, do svidanya Russia), we have had to say goodbye to hosts Poland, to Denmark and to the Boys in Green, who at least gave us some wonderful singing.
But of course, for the little kid in (some) of us, an event like the EURO2012 also means something else – Panini stickers! Little packs of five random stickers, all the players, teams, stadiums, etc. – 540 stickers in total, waiting to be carefully put into the collector’s album. Collectors sort their stickers, make and exchange “have” and “need” lists, rejoice when they complete a team and when they finally have the whole set. Numbers, names of players, teams, countries, cities, stadiums – at the end of the day, it’s all data. So, in the spirit of Leigh Dodds’ recent post on Open Data for (Big) Kids, what better subject to choose for a little EURO2012-themed Linked Data project! This is the first in a four-part series of posts about how to do such a project. Meant as a tutorial for creating linked data, it is going to illustrate all the steps from finding and converting the source data, linking and enriching it with the wider Web of Data and finally creating a little app on top of the data. I’m going to work on this as I go along, but if you want to take a peek at the dataset already now, you can do so over here on Kasabi.
Getting the Data
So, the first step in creating a Linked Data application is figuring out where to get our source data from – obviously, we don’t want to just type it in by hand. There are plenty of collector sites on the Web which have lists of all stickers with their names and numbers, but the main and authoritative source of information would be the Panini site itself. There is a nice list of stickers with numbers, names and types of stickers here. Unfortunately, it’s only an HTML page, not something more useful like a CSV file (2 or 3-star data would have been lovely). Also, the whole list is dispersed over several pages, rather than on one single page. Of course we could write a fancy screen scraper using on of the libraries available on ScraperWiki, but with just a handful of pages that aren’t going to change, we’ll simply copy and paste from the browser into a text editor. A couple of pages later, we have a nice tab-separated list of stickers. The nice thing about the original Panini list is that it has all names in their original language. We can get another, similar list from Stickermanager, which has the nice additional feature of a country code for each sticker. Again, getting the data out is a matter of copying it from the browser into a text file.
I have named the output files from the two web sites
euro2012_stickermanager.tsv. Below are some example rows from our
euro2012_paninigroup.tsv file, which I’ll use to explain the conversion process in the next paragraph (I have added some additional tabs for readability).
50 Polska Glitter - Badge 74 Robert Lewandowski Player Sticker 250 Mesut Özil Sticker - In action 486 England Team Sticker (puzzle) 535 2004 Hellas Uefa European Championship History
Converting the Data with Vertere
Now that we have the data, we need to think about how to convert it. Copying directly from an HTML table in browser window into a text file gets us tab-separated values, or TSV. This is a format which, just like its close relative CSV, is essentially a tabular format – a spreadsheet. There are plenty of useful tools for converting spreadsheet-like data into RDF. I have chosen Rob Styles‘ PHP-based Vertere (from the Latin verb for turn or translate), which takes a mapping template in the Turtle RDF language and uses that to drive the conversion. The nice thing about it is that the mapping template already looks pretty close to the RDF we want to get as the output. Vertere is available on Github.
So how does Vertere work? In a nutshell, it looks at each row of our source table (TSV or CSV file) in sequence and uses the information in the mapping file to turn it into an RDF resource, using the values in the columns to create statements about the resource. We define both
attributes with literal values, and
relationships linking to other resources.
Each row in our Panini file has the unique number of the sticker (
74), the label (
Robert Lewandowski) and its type (
Player Sticker). Let’s see how we would convert this using Vertere. Before I can start saying anything about how to map the individual rows, I need to give some house-keeping facts about the conversion in general. Those are needed to get the process going (I’m ignoring the prefix definitions for now):
<#> a :Spec ; :format :TSV ; :header_rows 0 ; :resource <#sticker>, <#sticker_type> ; :base_uri "http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/" ; .
What is happening here? I’m saying that this document (
<#>) is a mapping specification (
:Spec), that the input file is in TSV format, that it has no header rows, what the base URI is that I want to use for all resources this conversion will create, and that I will specify two rules to create resources from each row (
<#sticker_type>). These conversion rules are the heart of the Vertere process, and as you can see I can apply multiple of them to each row. What does one of these rules looks like?
<#sticker> a :Resource ; :identity [ :source_column 1 ; ] ; :attribute [ :property rdfs:label ; :source_column 2 ; ] ; :relationship [ :property rdf:type ; :object_from <#sticker_type> ; ] ; .
<#sticker> rule is pretty straight-forward. By using
:identity, I’m telling Vertere to use the value in column 1 (column numbering starts at 1, not 0!) for creating the URI for this resource. If we look at our source data, we see that column 1 contains the sticker number, so the URI for the Robert Lewandowski sticker would e.g. be http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/74 (the base URI comes from the spec definition above). For each sticker resource, we create two triples. First we want to use the value in column 2 (the sticker label) to say what the
rdfs:label of the sticker is. This is a literal value, so we use the
:attribute property. We also want to say what the type of this sticker is (a player sticker, a team sticker, etc.). The easiest way to specify a type for a resource is to use Vertere’s
:type property. However, this only works if the type is always the same. In this example, we want to assign different types for different kinds of stickers, so we use
:relationship to link this resource to the resource created in a different rule (
<#sticker_type>). Let’s look at that rule:
<#sticker_type> a :Resource ; :identity [ :source_column 3 ; :base_uri "http://data.kasabi.com/dataset/panini-stickers/schema/" ; :process ( :regex :normalise :title_case ) ; :regex_match "[() -]+"; :regex_output " "; ] ; :type owl:Class ; :attribute [ :property rdfs:label; :source_column 3 ; :language "en" ; ] ; :relationship [ :property rdfs:subClassOf ; :object panini:Sticker ; ] ; .
This rule creates another resource for each row, only this time it’s not for the sticker, but for its type. The identity is based on the value in column 3 this time. There are a number of things happening in the
:identity statement: first of all, we want to use a different base URI for the sticker types. This isn’t strictly speaking necessary, but it’s a good idea to keep your instances and classes in different namespaces. Then, we don’t want to just take the column 3 value as it is and append it to the base URI (we’d end up with an invalid URI), but we want to
:process it into something more appropriate. We want to apply three operations on the value: a
:regex search and replace, a
:normalise operation and a
:title_case (class names should start with an upper case letter), in that order. The
:regex is further specified to match any sequence of one or more parentheses, spaces or hyphens and replace them with a single space. As an example, these three operations will turn “
Team Sticker (puzzle)” into “
Team_sticker_puzzle“. In combination with the new base URI, the URI of the resource created by this rule will be http://data.kasabi.com/dataset/panini-stickers/schema/Team_sticker_puzzle. By the way, since Vertere is in PHP, it’s very simple to extend its functionality by adding new such operations (or other functionality).
The rest of the rule is pretty simple again: we’re saying that the type of each resource will be
owl:Class, we assign the value of column 3 as an
rdfs:label (also assigning a language tag – datatypes work similarly), and say that each new class is an
That’s it! If we run this template through Vertere, each row in the source TSV file will result in two RDF resources. E.g., the row for the Mesut Özil sticker (
250 Mesut Özil Sticker - In action) will result in the following RDF:
<http://data.kasabi.com/dataset/panini-stickers/euro/2012/sticker/250> a panini:Sticker_in_action ; rdfs:label "Mesut Özil" ; . panini:Sticker_in_action a owl:Class ; rdfs:label "Sticker - In action"@en ; rdfs:subClassOf panini:Sticker ; .
You may have noticed that, by applying both rules to every row, the different sticker types will be created over and over – that’s no problem, however, because any duplicate triples in the output will be removed by Vertere. The complete mapping file is available on github, as well as the mapping file for the second sticker list (for getting the country codes).
So, in this post I have shown you how to take some tabular data from the Web and use the Vertere conversion tool to map it to RDF. But we’re not done yet, so in the next post I will explain how to add some additional data using SPARQL queries, and how to publish the whole dataset as Linked Data using Kasabi.
PS: The game is over and England has beaten Ukraine 1:0 – congratulations! Though who knows what would have happened if the Ukraine hadn’t been denied that goal… Anyway, on to the quarter finals!