#H800 Does the Semantic Web create a ‘filter bubble’?

Week 21 makes a little detour from Web 2.0 to Web 3.0 (more aptly called the Semantic Web), looking whether the latter terms signals a break with Web 2.0 or more a logical evolution.  The Semantic Web was a core topic in the Learning Analytics open course (LAK11) earlier this year, and I discussed the concept of the Semantic Web in an earlier blog post (here).
An interesting angle is the relation between the Semantic Web and the information overload, experienced by many users, trying to find information with search engines and soldiering to read all kinds of ‘interesting’ and ‘must-read’ articles, blog posts and videos.  Clay Shirky doesn’t use the term ‘information overload’, but prefers to speak about a lack of filtering. Tools to enable improved filtering seem like a good thing.  I use tools such as Twitter, Google Reader and Diigo to filter interesting content from the abundance around.  These filtering tools are based on social networks, using other’s preferences and selections to help me making my own.
This improved filtering is one of the alleged hallmarks of the Semantic web.  Instead of HTML pages, that store layout information, but are unsuitable for data analysis.  A new standard of data storage is proposed that would enable machines to read, interpret and use more easily data from websites.  There are some notable examples of the Semantic Web right under our noses.
Search results for Google, for example, are different from person to person, when you are logged in with your Google account.  And online retailers such as Amazon, change their homepage, suggestions and (notoriously) their prices depending on who’s visiting the website.
Elif Pariser has pointed out that much of this filtering happens beyond our conscience, risking trapping us in ‘content bubbles’ with search engines and recommendation systems systematically selecting for us what we ‘want’ to see, instead of giving an ‘objective’ account of the information available.
The result is a “filter bubble”, which he defines as “a unique universe of information for each of us”, meaning that we are less likely to encounter information online that challenges our existing views or sparks serendipitous connections.
It’s a worthwhile warning, but I wonder whether it’s new to the internet.  Do people not tend to select friends, information sources and books, that correspond with the views they already have?

The Filter Bubble: What the Internet is Hiding From You. By Eli Pariser.Penguin Press; 294 pages; 

LAK11 – Week 3: The Semantic Web: the Web as Database?

Keyword during week 3 of the LAK11 course is the Semantic Web (or, a bit less informative, web 3.0). A bunch of text materials, ranging from the very accessible, to the more technical were all out there to help us grab these sprawling concepts.

Tim Berners-Lee, the father of the internet, recalls in his TED talk (highly recommended ,btw) that he wrote his proposal in 1989 to set up a linked information system out of frustration for his work as a software engineer at CERN.  Being confronted with all kinds of different data formats, information systems and isolated information, he wrote the proposal and the code for the internet.  Today he experiences a similar kind of frustration, the frustration of not finding what he’s looking for.  From this  frustration, he advocates the creation of a semantic web, on top of the current web.

                                          Source: The Economist

The number of web pages is amounting to billions, soon even trillions of pages.  The currently used HTML format stores data as text files, making it unsuitable for data analysis.  Search engines like Google, confronted with a diminishing advantage of its search algorithms, struggle to render meaningful results with these quantities.  The semantic web is the tool to bring more structure in the internet, actually making it a bit more like a database.  Tim Berners-Lee defines the Semantic Web as “a web of data that can be processed directly and indirectly by machines.”

The idea is that people publish their data in a more standardized format.  The way to do this is by using ontologies, a fixed way of describing concepts (like metadata) .  For example, if everyone were using the same words to describe “rice”, it would be easier to connect information from different websites.  Moreover, and that’s another important rationale for the semantic web, it would make it easier for machines to access the information and perform all kinds of queries on it.

This extract from Wolfgang Greller’s blog illustrates nicely the difference between a traditional search engine and a semantic web query.

What semantic web technologies can do is relatively simple to show. Take this example: You know that your friend John has a brother living in South America, but you can’t remember his name. Typing “brother of John” into a traditional search engine won’t work. All it will return is documents that contain the words ‘brother’ and ‘John’ or the exact phrase ‘brother of John’. The Semantic Web “knows” about relations, hence it would return a result saying ‘brother of John’ = ‘Kendon’. It works in exactly the same way for ‘capital of France’ = ‘Paris’; or ‘other words for red’ = ‘crimson’, ‘ruby’, etc. Semantic search engines can do this, based on a vocabulary of relations. This not only stores the words themselves, but also the way in which they relate to each other, i.e. ‘goose’ is a sub-item to ‘bird’.

Of course, if only you or me were to put his data online in a standardized format, it wouldn’t make any sense, since there would be nothing to link to.  A critical mass of data published in a standardized format is necessary for scale advantages to come into play.  The more connected data become, the more powerful the Semantic Web gets. This concept is called Linked Data.  In a linked data model, things are uniquely identified, capable of being looked up, provide useful information when looked, and themselves link to other uniquely identified things. Because this interconnected data is structured, it allows computers to make complex connections.

A range of government funded initiatives are being built, some of them can be found on Freebase. The astronomy database, for example, not only lets you retrieve information about galaxies, but you can also make a graph with a distance distribution of galaxies. Openstreetmap.org lets you edit geographical information in a standard format and contribute to an online mapping system.  DBpedia aims at structuring the the information on Wikipedia in a semantic web-friendly format, enabling queries and linking it to other data sources on the web.

The DBpedia knowledge base currently describes more than 3.5 million things, out of which 1.67 million are classified in a consistent Ontology, including 364,000 persons, 462,000 places, 99,000 music albums, 54,000 films, 17,000 video games, 148,000 organisations, 169,000 species and 5,200 diseases. The DBpedia data set features labels and abstracts for these 3.5 million things in up to 97 different language (DBpedia homepage)

The Semantic Web is built on top of the current web and uses a XML-based language called RDF to formally define and connect the data.

The Semantic Web is generally built on syntaxes which use Uniform Resource Identifiers (URIs) – similar to URLs – to represent data, usually in triples based structures: i.e. many triples of URI data that can be held in databases, or interchanged on the world Wide Web using a set of particular syntaxes developed especially for the task. These syntaxes are called “Resource Description Framework” (RDF) syntaxes.  (parafrased from The Semantic Web: An Introduction)

For learning, the Semantic Web holds the potential for retrieving more relevant information more easily, both by people and by intelligent agents and tutors. Tagging systems could evolve into ontologies, with everyone using an identical set of tags to describe websites’ content.

However, will the Semantic Web concept succeed in bringing more order and structure to the web?  Will people without database expertise be convinced to spend time entering their data in a standardized format?  Data descriptions may have different meanings to different people.  A small molecule can mean a few molecules to a chemist and 10.000 molecules for a biochemist (Laurence Cuffe on the course’s Moodle forum).  Constructing queries on Linked Data requires proficiency in query syntax such as SPARQL.  As Tanya Elias put it on the forum:

Computers do some things very well: gather data, remember stuff, calculate and crunch data – Syntax-related stuff.  People tend to do other things well: thinking, analyzing, sorting and determining – semantics-related stuff.

I may simply be out in left field, but it seems to me that people spending a whole lot of time and effort developing a syntax that enables people to code semantics in a simple enough way for a computer to understand really has both the people and the machines playing to their weaknesses – in my experience not usually a recipe for a successful outcome.

Will the web retain its open character?  Is there a single right way to categorize information and does it not change continuously?  How does the semantic relate to the social web, for example the collaborative tagging of Delicious and Diigo?  I guess it’s still waiting for more powerful illustrations of the Semantic Web’s potential.  In the meantime however, search engines such as Google already incorporate the Semantic Web into their search algorithms.  Maybe the introduction of the Semantic Web will go largely unnoticed for most users.