LAK11 – Week 3: The Semantic Web: the Web as Database?

Keyword during week 3 of the LAK11 course is the Semantic Web (or, a bit less informative, web 3.0). A bunch of text materials, ranging from the very accessible, to the more technical were all out there to help us grab these sprawling concepts.

Tim Berners-Lee, the father of the internet, recalls in his TED talk (highly recommended ,btw) that he wrote his proposal in 1989 to set up a linked information system out of frustration for his work as a software engineer at CERN.  Being confronted with all kinds of different data formats, information systems and isolated information, he wrote the proposal and the code for the internet.  Today he experiences a similar kind of frustration, the frustration of not finding what he’s looking for.  From this  frustration, he advocates the creation of a semantic web, on top of the current web.

                                          Source: The Economist

The number of web pages is amounting to billions, soon even trillions of pages.  The currently used HTML format stores data as text files, making it unsuitable for data analysis.  Search engines like Google, confronted with a diminishing advantage of its search algorithms, struggle to render meaningful results with these quantities.  The semantic web is the tool to bring more structure in the internet, actually making it a bit more like a database.  Tim Berners-Lee defines the Semantic Web as “a web of data that can be processed directly and indirectly by machines.”

The idea is that people publish their data in a more standardized format.  The way to do this is by using ontologies, a fixed way of describing concepts (like metadata) .  For example, if everyone were using the same words to describe “rice”, it would be easier to connect information from different websites.  Moreover, and that’s another important rationale for the semantic web, it would make it easier for machines to access the information and perform all kinds of queries on it.

This extract from Wolfgang Greller’s blog illustrates nicely the difference between a traditional search engine and a semantic web query.

What semantic web technologies can do is relatively simple to show. Take this example: You know that your friend John has a brother living in South America, but you can’t remember his name. Typing “brother of John” into a traditional search engine won’t work. All it will return is documents that contain the words ‘brother’ and ‘John’ or the exact phrase ‘brother of John’. The Semantic Web “knows” about relations, hence it would return a result saying ‘brother of John’ = ‘Kendon’. It works in exactly the same way for ‘capital of France’ = ‘Paris’; or ‘other words for red’ = ‘crimson’, ‘ruby’, etc. Semantic search engines can do this, based on a vocabulary of relations. This not only stores the words themselves, but also the way in which they relate to each other, i.e. ‘goose’ is a sub-item to ‘bird’.

Of course, if only you or me were to put his data online in a standardized format, it wouldn’t make any sense, since there would be nothing to link to.  A critical mass of data published in a standardized format is necessary for scale advantages to come into play.  The more connected data become, the more powerful the Semantic Web gets. This concept is called Linked Data.  In a linked data model, things are uniquely identified, capable of being looked up, provide useful information when looked, and themselves link to other uniquely identified things. Because this interconnected data is structured, it allows computers to make complex connections.

A range of government funded initiatives are being built, some of them can be found on Freebase. The astronomy database, for example, not only lets you retrieve information about galaxies, but you can also make a graph with a distance distribution of galaxies. Openstreetmap.org lets you edit geographical information in a standard format and contribute to an online mapping system.  DBpedia aims at structuring the the information on Wikipedia in a semantic web-friendly format, enabling queries and linking it to other data sources on the web.
p

The DBpedia knowledge base currently describes more than 3.5 million things, out of which 1.67 million are classified in a consistent Ontology, including 364,000 persons, 462,000 places, 99,000 music albums, 54,000 films, 17,000 video games, 148,000 organisations, 169,000 species and 5,200 diseases. The DBpedia data set features labels and abstracts for these 3.5 million things in up to 97 different language (DBpedia homepage)

The Semantic Web is built on top of the current web and uses a XML-based language called RDF to formally define and connect the data.

The Semantic Web is generally built on syntaxes which use Uniform Resource Identifiers (URIs) – similar to URLs – to represent data, usually in triples based structures: i.e. many triples of URI data that can be held in databases, or interchanged on the world Wide Web using a set of particular syntaxes developed especially for the task. These syntaxes are called “Resource Description Framework” (RDF) syntaxes.  (parafrased from The Semantic Web: An Introduction)

For learning, the Semantic Web holds the potential for retrieving more relevant information more easily, both by people and by intelligent agents and tutors. Tagging systems could evolve into ontologies, with everyone using an identical set of tags to describe websites’ content.

However, will the Semantic Web concept succeed in bringing more order and structure to the web?  Will people without database expertise be convinced to spend time entering their data in a standardized format?  Data descriptions may have different meanings to different people.  A small molecule can mean a few molecules to a chemist and 10.000 molecules for a biochemist (Laurence Cuffe on the course’s Moodle forum).  Constructing queries on Linked Data requires proficiency in query syntax such as SPARQL.  As Tanya Elias put it on the forum:


Computers do some things very well: gather data, remember stuff, calculate and crunch data – Syntax-related stuff.  People tend to do other things well: thinking, analyzing, sorting and determining – semantics-related stuff.

I may simply be out in left field, but it seems to me that people spending a whole lot of time and effort developing a syntax that enables people to code semantics in a simple enough way for a computer to understand really has both the people and the machines playing to their weaknesses – in my experience not usually a recipe for a successful outcome.

Will the web retain its open character?  Is there a single right way to categorize information and does it not change continuously?  How does the semantic relate to the social web, for example the collaborative tagging of Delicious and Diigo?  I guess it’s still waiting for more powerful illustrations of the Semantic Web’s potential.  In the meantime however, search engines such as Google already incorporate the Semantic Web into their search algorithms.  Maybe the introduction of the Semantic Web will go largely unnoticed for most users.

Advertisements

One comment on “LAK11 – Week 3: The Semantic Web: the Web as Database?

  1. Stefaan this is a great round up of last week's LAK! I especially like the creative way in which you link all of the theory into one really inspiring post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s