In the last few weeks I’ve been putting together a dataset for NoTube partners (and others) to use, based on the BBC /programmes data but in a downloadable chunk. We are hoping that this will be useful dataset for us to test our recommendations approaches on for easy comparison, and perhaps also useful for others who’d like a medium-size RDF dataset for testing things on. At the same time, danbri’s been working on an archive.org-based dataset so we can try similar techniques on different datasets.
The BBC dataset is not quite ready yet – but here I’m going to briefly describe a technique my NoTube colleague Véronique Malaisé has used in order to enhance the dataset using DBPedia links.
The initial dataset I’ve chosen is quite basic, consisting of programme url, title, channel, description, scheduled datetime, plus the Lonclass categories for the programme. Lonclass is a very large and complex classification system based on UDC, which is used by BBC archives. It plays a key role in a demonstrator we’ve been working on which I’ll describe in another post soon.
It’s interesting here because these are manually added classifications and there are lots of them, so you can think of them as tags on the content, created by experts. Véronique and I were interested in mapping them to Wikipedia / DBPedia, both because it makes an interesting additional information source for showing to the end user, and because it enriches the dataset for various kinds of semantic recommendations techniques we want to use. Our belief is that it will complement other work going on in NoTube by Ontotext, using Lupedia to identify ‘named entities’ (basically proper nouns) in text, and so we’ll end up with an increasingly interlinked and interesting dataset.
The core BBC dataset is 23,684 programmes covering BBC 1-4 over 2.5 years mid-2007-2009. There are about 300,000 programme-to-classification pairs and 36,761 distinct classification terms.
Véro’s experiments used DBPedia because of their conveniently available dumps of titles and URLs. You could equally use Wikipedia title dumps (look for ‘enwiki’ on the Wikipedia dumps page and then find ‘all-titles-in-ns0.gz’) but they do need tidying up, and with DBPedia that’s already done; plus there’s the benefit of being able to get RDF versions of parts of the the wikipedia data back for semantic applications.
Véro is an expert on natural language parsing among other things, but as a first pass, her script just looks up exact matches and simple plurals on the titles. It’s worth reporting these early results because they’re great: this simple exact match script matched 12904 of 36761 Lonclass terms – i.e. just over one third got a hit.
The next stage is to see if this hit rate can be improved by parsing subterms. We also need to evaluate whether the results are the correct senses of the word, and whether it’s even useful to have these terms available, for human readership or for semantic analysis.
If you are interested, here’s a list of programme ids and their dbpedia terms that were generated using this method.
The format is tab-separated, one per line:
b00gd41j http://en.wikipedia.org/wiki/Machine_Gun
and contains programme ids – here’s an example in html and in rdf
– matched to wikipedia urls.
The equivalent DBPedia URL can be found by replacing part of the url, e.g.
http://en.wikipedia.org/wiki/Machine_Gun becomes http://dbpedia.org/data/Machine_gun.n3 (view the head of http://dbpedia.org/page/Machine_gun for more formats).
Update – the dataset is available here.