LAK11 – Week 3: The Semantic Web: the Web as Database?

Keyword during week 3 of the LAK11 course is the Semantic Web (or, a bit less informative, web 3.0). A bunch of text materials, ranging from the very accessible, to the more technical were all out there to help us grab these sprawling concepts.

Tim Berners-Lee, the father of the internet, recalls in his TED talk (highly recommended ,btw) that he wrote his proposal in 1989 to set up a linked information system out of frustration for his work as a software engineer at CERN.  Being confronted with all kinds of different data formats, information systems and isolated information, he wrote the proposal and the code for the internet.  Today he experiences a similar kind of frustration, the frustration of not finding what he’s looking for.  From this  frustration, he advocates the creation of a semantic web, on top of the current web.

                                          Source: The Economist

The number of web pages is amounting to billions, soon even trillions of pages.  The currently used HTML format stores data as text files, making it unsuitable for data analysis.  Search engines like Google, confronted with a diminishing advantage of its search algorithms, struggle to render meaningful results with these quantities.  The semantic web is the tool to bring more structure in the internet, actually making it a bit more like a database.  Tim Berners-Lee defines the Semantic Web as “a web of data that can be processed directly and indirectly by machines.”

The idea is that people publish their data in a more standardized format.  The way to do this is by using ontologies, a fixed way of describing concepts (like metadata) .  For example, if everyone were using the same words to describe “rice”, it would be easier to connect information from different websites.  Moreover, and that’s another important rationale for the semantic web, it would make it easier for machines to access the information and perform all kinds of queries on it.

This extract from Wolfgang Greller’s blog illustrates nicely the difference between a traditional search engine and a semantic web query.

What semantic web technologies can do is relatively simple to show. Take this example: You know that your friend John has a brother living in South America, but you can’t remember his name. Typing “brother of John” into a traditional search engine won’t work. All it will return is documents that contain the words ‘brother’ and ‘John’ or the exact phrase ‘brother of John’. The Semantic Web “knows” about relations, hence it would return a result saying ‘brother of John’ = ‘Kendon’. It works in exactly the same way for ‘capital of France’ = ‘Paris’; or ‘other words for red’ = ‘crimson’, ‘ruby’, etc. Semantic search engines can do this, based on a vocabulary of relations. This not only stores the words themselves, but also the way in which they relate to each other, i.e. ‘goose’ is a sub-item to ‘bird’.

Of course, if only you or me were to put his data online in a standardized format, it wouldn’t make any sense, since there would be nothing to link to.  A critical mass of data published in a standardized format is necessary for scale advantages to come into play.  The more connected data become, the more powerful the Semantic Web gets. This concept is called Linked Data.  In a linked data model, things are uniquely identified, capable of being looked up, provide useful information when looked, and themselves link to other uniquely identified things. Because this interconnected data is structured, it allows computers to make complex connections.

A range of government funded initiatives are being built, some of them can be found on Freebase. The astronomy database, for example, not only lets you retrieve information about galaxies, but you can also make a graph with a distance distribution of galaxies. lets you edit geographical information in a standard format and contribute to an online mapping system.  DBpedia aims at structuring the the information on Wikipedia in a semantic web-friendly format, enabling queries and linking it to other data sources on the web.

The DBpedia knowledge base currently describes more than 3.5 million things, out of which 1.67 million are classified in a consistent Ontology, including 364,000 persons, 462,000 places, 99,000 music albums, 54,000 films, 17,000 video games, 148,000 organisations, 169,000 species and 5,200 diseases. The DBpedia data set features labels and abstracts for these 3.5 million things in up to 97 different language (DBpedia homepage)

The Semantic Web is built on top of the current web and uses a XML-based language called RDF to formally define and connect the data.

The Semantic Web is generally built on syntaxes which use Uniform Resource Identifiers (URIs) – similar to URLs – to represent data, usually in triples based structures: i.e. many triples of URI data that can be held in databases, or interchanged on the world Wide Web using a set of particular syntaxes developed especially for the task. These syntaxes are called “Resource Description Framework” (RDF) syntaxes.  (parafrased from The Semantic Web: An Introduction)

For learning, the Semantic Web holds the potential for retrieving more relevant information more easily, both by people and by intelligent agents and tutors. Tagging systems could evolve into ontologies, with everyone using an identical set of tags to describe websites’ content.

However, will the Semantic Web concept succeed in bringing more order and structure to the web?  Will people without database expertise be convinced to spend time entering their data in a standardized format?  Data descriptions may have different meanings to different people.  A small molecule can mean a few molecules to a chemist and 10.000 molecules for a biochemist (Laurence Cuffe on the course’s Moodle forum).  Constructing queries on Linked Data requires proficiency in query syntax such as SPARQL.  As Tanya Elias put it on the forum:

Computers do some things very well: gather data, remember stuff, calculate and crunch data – Syntax-related stuff.  People tend to do other things well: thinking, analyzing, sorting and determining – semantics-related stuff.

I may simply be out in left field, but it seems to me that people spending a whole lot of time and effort developing a syntax that enables people to code semantics in a simple enough way for a computer to understand really has both the people and the machines playing to their weaknesses – in my experience not usually a recipe for a successful outcome.

Will the web retain its open character?  Is there a single right way to categorize information and does it not change continuously?  How does the semantic relate to the social web, for example the collaborative tagging of Delicious and Diigo?  I guess it’s still waiting for more powerful illustrations of the Semantic Web’s potential.  In the meantime however, search engines such as Google already incorporate the Semantic Web into their search algorithms.  Maybe the introduction of the Semantic Web will go largely unnoticed for most users.

#LAK11 – Week 3 – Signal of the Future?

In this week’s guest lecture Kimberley Arnolds from Purdue University talked about the Signals project.  It is a concrete example of learning analytics (or academic analytics, as she used this term), trying to make sense of the heaps of available data, making it accessible to learners, tutors, administrators in order to improve learning.

In this they aim at going beyond capturing loads of data and reporting on them, but using the data to build a predictive model for student success and acting upon it.  The model is  based on linear regression using both real-time data from technologies (clicks, log-in times etc.) as static data and socio-economic data (former grades, age etc.).  They obtained a nice correlation rate of approx. 60%.

Note: They were some concerns about the validity of using linear regression here since some variables may have a non-normal distribution.  Non-parametric methods may be better.

An important rationale for building the model is to improve student learning, by providing timely and accurate warnings for students at risk (yellow and red lights) and suggest actions to tutors and learners.  A crisis of under-prepared freshmen in the States (that sounds familiar) underscores the need for such a warning system.  Economic reasons are not far away as well, since dropped-out students cost the university in terms of recruitment and marketing money. 
Preliminary data from 2 years show a positive impact on retention rates. Students seem to like the additional feedback, although students at risk do not always respond to the warning signals provided.  Privacy doesn’t appear to be an issue for many students. They may well turn out to be less worried about providing personal details and technology making use of their information than the older generations.
The tool was acquired by SunGard and can only be integrated in BlackBoard, reducing its potential use.  The system also assumes that most learning takes place within a learning-management system (LMS), making it less useful in a distributed environment where learners use multiple on-line and off-line tools for their learning.  Getting all types data from various departments and sources together in a harmonized format also proves challenging, and makes one curious for a cost-benefit analysis of the programme.

#LAK11 – Week 2 – Educational Data Mining

Week 2 of the Learning Analytics (LAK11) course focuses on the emerging field of educational data mining (EDM)
Data mining tries to make senses of huge amounts of data.  These can be in tabular format, but more and more data are in a messier or text-like format, like status updates (Facebook), tweets (Twitter) or the detailed monitoring data of LMS.  Educational data mining takes into account the specificities of educational data, such as their multi-level hierarchy and non-independence.  This is similar to how spatial data mining addresses the specifics of spatial information such as spatial autocorrelation.  In an interesting talk on YouTube the CEO of Cloudera, Mark Olson, lays out the potential of distributed computing such as Hadoop and MapReduce (Google) to deal with machine produced data.  Distributed computing means that data and computing are spread out over many servers. 
“Terabytes are not hard to get, they are hard not to get” (Mark Olson, CEO Cloudera)
The objective of EDM is to use data mining methods to better understand student learning. This can be both from an institutional perspective (increasing efficiency, grouping students, predicting student performance etc.)as from a learner perspective (provide better tutoring, a more personalized learning experience). Teachers can see whether a learning community is being formed or can monitor the effect of learning interventions, for example the introduction of e-portfolio-like instruments.
A fair amount of discussion evolved around the difference between learning analytics and educational data mining, without yielding a clear consensus (even Ryan Baker admitted not to have a clear distinction).  It seems that learning analytics is not only limited to data mining methods, but also involves more qualitative methods.  In this way EDM is a part of learning analytics. Other authors argue that learning analytics includes the feeding back of the information in order to improve learning.
For example, social network analysis (SNA) visualizes social network activity.  In an educational context it can provide information on the degree distribution of the network – a measure of how centralized it is – , on eventual outliers – students that are not connected to others and are at risk for dropping out – and on the existence of subgroups in the network.  A software tool to map forum activity on Moodle, Blackboard and the like is SNAPP, a free software programme developed by the University of Wollongong.  SNAPP allows visualizing the network of interactions resulting from discussion forum posts and replies. It’s a simple browser plugin and doesn’t require manually loading data.  Below a screenshot from a forum discussions and another more integrated network from the homepage.  The difference in interaction is very clear.
Data for EDM come from online learning management systems (LMS) such as Moodle or Blackboard.  Due to more standardized exams, the increased use of LMS and internet-based software results in more and better interoperable data available.  When learning turns more distributed, with learners using a variety of places and platforms, collecting and integrating all data becomes more challenging.  A particular point of interest is the purpose of EDM.  Ryan Baker classified work in EDM as follows:
  • Prediction
    • Classification
    • Regression
    • Density estimation
  • Clustering
    • Relationship mining
    • Association rule mining
    • Sequential pattern mining
    • Causal data mining
  • Distillation of data for human judgment
  • Discovery with models

Discovery with models is perhaps most typical for EDM. It is described by Baker as:
In discovery with models, a model of a phenomenon is developed through any process that can be validated in some fashion, and this model is then used as a component in another analysis, such as prediction or relationship mining. (…) supporting sophisticated analyses such as which learning material sub-categories of students will most benefit from (Beck & Mostow, 2008), how different types of student behavior impact students’ learning in different ways (Cocea et al., 2009) and how variation sin intelligent tutor design impact students’ behavior over time (Jeong & Biswas, 2008).
An interesting case study presented by Baker was an investigation of the “gaming” and “off-task behavior” (like surfing on Twitter or Facebook) during online learning tasks.  Gaming means abusing the system, such as systematic guessing, (clicking quickly on all the answer categories without thinking) or immediately hitting the “help” function.  They built an algorithm to predict when students were gaming or engaging in off-task behavior, and tuned the model by observing students.  Then they related the frequency of gaming with characteristics of a learning environment for a mathematics course.  They included 76 variables in the analysis.  Some variables such as the use of non-mathematical words in the question or the use of abstract language proved to influence significantly the degree of off-task behavior and gaming.  The next step is to improve the online tasks, based on the results of the analysis and to verify if gaming and off-task behavior drop. 
EDM methods don’t provide answers to the why question, but offer clues that can be validated by complementary qualitative analysis.  A hypothesis for the above example could be that abstract language or non-mathematical terms confuse the students or make them bored, triggering them to off-task behavior.

# LAK11 Week 2 The Rise of Big Data in Education

Learning analytics is frequently hailed as the ultimate arrival of a data-driven paradigm in education.  The successful application of randomized trials in medicine, for example, could be transplanted to education.  Randomized trials mean that a sample is randomly divided into a group who gets the treatment and a control group who doesn’t get the treatment (but a placebo).  Analysis of variance reveals if the treatment works or not.  David Ayres, a law professor at Yale Law School, shows in his book Super Crunchers that randomized trials already have wide appeal outside medicine.  eHarmony, a matching agency,  uses large data sets and regression analysis to analyze which combinations of personal traits match together in order to present people with a better match.  Internet firms such as Amazon and Pandora, provide entertainment advice based on existing tastes or purchaes.  IBM’s “Smarter Planet” programme, provides plenty of examples of using analytics in city planning or environmental management. 
Drivers are bigger data sets and better data sets, since data are more frequently collected in real time by machines, making them more reliable than data collected by questionnaires.  Collecting and storing them also becomes more cheaply.  In education randomized trials can analyze the effect of certain learning resources, like a video or an animation, on student learning.  Changes in curriculum structure could be analyzed quantitatively.
Ayres points out that randomized trials and regression often do a better job than experts.  Quantitative analysis of Supreme Court verdicts proved a better predictor than the judgments of experts.  A regression model with three variables developed by Orley, an economics professor at Princeton, for predicting the quality of Bordeaux wines predicted outperformed wine experts.
The main reason is that experts can’t quantify the role of variables.   Of course, experts also know that winter rainfall and temperature affect the quality of wine, but they can’t accurately assign weights to their influence.  The more complex the situation, the better the model performs and the worse experts are in predicting.

For all its successes, though, statistical analysis continues to face tremendous skepticism and even animosity. For one thing, Ayres notes, statistics threaten the “informational monopoly” of experts in various fields. But even to many people without a vested interest, relying on cold, hard numbers rather than human instinct seems soulless.

Learning Analytics raises other critiques as well.  Privacy is a major issue.  Do learners have the right to access the data that are gathered about them, or do they have the right to deny that data are collected about them?  For example, students could reasonably be skeptically towards data being collected about their off-task behavior. Another issue is that there might be a tendency to take only those elements into account that can be measured.  This is called the McNamara fallacy, and in its original form says:
The first step is to measure whatever can be easily measured. This is ok as far as it goes. The second step is to disregard that which can’t be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can’t be measured easily really isn’t important. This is blindness. The fourth step is to say that what can’t be easily measured really doesn’t exist. This is suicide.

Some elements in education are difficult to measure, like whether deeper learning takes place, how learners are engaged with the materials or what the longer term effects are of learning interventions.
Predictive models will also do a worse job when the field or subject is evolving quickly.  In Cambodia for example, characteristics of student populations are changing rapidly.  They become richer, more technologically savvy and more technologically literate.  Predictive models in this context would need to be updated very frequently.
Other critiques formulated on the Moodle forum, and summarized by George Siemens, seemed to be more emotionally inspired and I found them less grounded.  Learning analytics doesn’t mean that all complexity is reduced to numbers, nor that ambiguity is no longer accepted.  Ayes pointed out that models not only predict, but also indicate the precision of the model.  If the phenomenon is very difficult to predict, the correlation will say so.  This counters another critique, that models don’t take the uniqueness of humans into account.  However, models and correlations can be wrongly interpreted or uncertainty in the data can be ignored, but this seems hardly a problem specific to learning analytics alone.  It is a misconception that expressing in numbers automatically conceals uncertainty.
This is not to say that statistics cannot be misused, intentionally or unintentionally.  Viplav Baxi provides an excellent overview on the LAK11 forum:
At the very basic level, there are many arguments for or against statistical analyses and other forms of analytics (such as those generated by “intelligent” systems). The arguments address generalizability (do the analytics imply that we can take general actions and predict outcomes), appropriateness (are the analytics appropriate to generate for the domain under consideration), accuracy (did we have enough information, did we choose the right sample), interpretation (can we rely on automated analytics or do we need manual intervention or both), bias (analytics used to support an underlying set of beliefs), method (were the methods and assumptions correct), predictive power (can the analytics give us sufficient predictive power) and substantiation (are the analytics supported by other empirical evidences). 

An interesting quote was formulated by Chriss Lott, who fears that learning analytics becomes the new buzz-word on the block and may spawn another “cottage industry of repetitive pundits”.  Indeed, who will interpret all the data gathered during an online course?  Teachers, tutors, administration, external companies or… the learners themselves?  Tony Hirst expresses the desire that learning analytics would be used creatively to give learners more control on their learning.
To me, two weeks of reading about learning analytics have offered a tantalizing glimpse of its potential, but without forgetting some concerns.  Or, to end with a quote from Bill Fitzgerald, “I would hope we could outgrow our pursuit of silver bullets”.

#LAK11 Learning Analytics – Week 1

The first week of the LAK11 course was spent on exploring the field of learning analytics.  Its relations with domains such as business intelligence and web analytics were outlined (Elias, 2011) and its close relationship with academic analytics and educational data mining (EDM).
Learning analytics could be defined as the use of educational data to improve learning.  Course facilitator George Siemens (TEKRI, Athabasca University) puts it this way:
Learning analytics is the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs”

Online learning produces a wealth of student data, such as time spent on certain modules, number of logins, number of posts on forums and their social interactions with other students and tutors.  Academic institutions can use these data to improve learning.  For example, they can provide an “early warning system” and predict which students are likely to fail their exams.  These students could then get extra support (McFadden & Dawson, 2010).  By coupling learning data with socio-economic data and demographics, predictions about student success can be made.  Purdue University’s Signals Project is a flagship case in this regard.

Recommender systems, such as used by e-commerce firms as Amazon are regularly mentioned as one of the potential applications of learning analytics.  For example, based on the sources accessed or links in the social network, students could get recommendations about potentially interesting articles, blogs or people.

Looking for meaningful patterns in large sets of educational data is called educational data mining.  Learning analytics, however, includes using these data to intervene in the learning process, like altering the course content, the provision of support or the use of tools.

 Siemens, 2010

Learning analytics could (should) also be student-centered.  This means that students could be granted access to course data. For instance, they see how much time they’ve spent on various course activities and compare it with their peers.  The question what students want generated discussion on the course forums.  The idea, outlined by John Fritz in his presentation was that students take more responsibility for their own learning and strengthen their meta-cognitive abilities.  They could get access to the data, but it would be their responsibility to interpret it and act upon it.  However, most institutions are still in the phase of collecting heaps of data and analyzing them, without really predicting and modeling behavior, or using it to optimize learning.

“Institutions can’t “absolve” students from “at least partial responsibility for their own education. To do so denies both the right of the individual to refuse education and the right of the institution to be selective in its judgments as to who should be further educated. More importantly, it runs counter to the essential notion that effective education requires that individuals take responsibility for their own learning” (p. 144)  

Vincent Tinto, Leaving College: Rethinking the causes and cures of student attrition (1993)

Interesting discussion focused on the issue of distributed analytics.  It is easier to collect data when all student activity is concentrated in one platform or Learning Management System (LMS).  However, when students are stimulated to use a variety of tools, sources and interaction platforms, gathering meaningful data becomes more difficult.  A second issue is the discussion of privacy, in particular when other students also get access to the data.

My first impressions on the MOOC are overwhelming, chaos and quality.  The amount of e-mails and forum posts is staggering and different discussion are taking place simultaneously.  However, it is not really the purpose of a MOOC to participate in everything but rather to be selective.  In that way, a MOOC is a great way to get lectures, information and feedback of some of the leading researchers in the field.  You are stimulated to read the materials and try to make sense of it at your own pace and on your own knowledge level.  A next step is then to create something (like a forum post), share it and get into contact with “likeminded souls”.  We’ll see how that plays out.

What a MOOC is like

A MOOC is a Massive (in various degrees of massiveness), Open and Online Course.  One MOOC has started this week on Learning Analytics & Knowledge (LAK11).  Another one, on Connectivism & Connective Knowledge (CCK11) is starting next week.  MOOCs are offered in various domains from education to ICT to biology.  MOOCs are definitely on the rise.  
MOOCs, what are they and where do they come from?

In 2008, Stephen Downes was teaching a class on learning theory at the University of Manitoba. Rather than limit access to his lectures to the 25 students registered for his course, he allowed the general public to attend virtually. The result was that more than 2300 people participated in his course.

First, they are massive.  They tend to attract hundreds or, for some courses, even more than thousand of participants, although some may participate only passively or drop out before the end of the course.
Second, they are open.  This means that they are free, that there are no entry requirements, that there is no formal trajectory that needs to be followed and that all activity is voluntary.  Besides, there is also no accreditation, apart from the appreciation from fellow learners. Taking a course for credits is sometimes offered optionally for a fee. The courses are very participatory, without fixed assignments, but with an invitation to engage in discussions and build networks.  
Finally, they are online.  All activity takes place online, usually through a combination of synchronous (online lectures, discussions etc. using software platforms through Elluminate) and asynchronous activity (blog posts, forums, e-mail newsletters, twitter messages, status updates).   Software programmes like Moodle and tools like Netvibes allow keeping track of all the activity going on.  
Below a short video from Dave Courmier on the essence of a MOOC.

Are they successful?  I’m trying it out, and keeping you posted.

Another blog in the wall

The direct occasion to start this blog is my decision to enroll in the course “Technology-Enhanced Education” (H800 for insiders) at the Open University in the UK.   The course fits within its Master in Online and Distance Education programme.  The course investigates many ways technology can influence education and how it interacts with the pedagogy of the classroom.  The course starts on February 5 and end until October 31.  60 credit points of the required 180 are at stake.
To get myself ready for the course, I’ve taken a couple of actions:
  • Start up a blog, which you’re now reading;
  • Get myself a Twitter account (@stefaanvw) and get familiar with it (not to say addicted    to);
  • Enroll in two MOOC’s.  MOOC’s are Massive Open and Online Courses.  I’ll discuss them in more detail below.  Here and here are a few examples.
      Get familiar with some popular tools in education, such as Prezi (for presentations, great), Hunch (for buying suggestions, not so great) and Slideshare (sharing PowerPoint presentations, very useful)
The course at the Open University is completely online, so there are no sessions “on location”.  Grading is based on continuous assessment through contributions on forums and blogs, and on 4 major assignments.
Blogging is required for the Open University and recommended for the MOOC’s for various reasons:
–  It’s an instrument for personal reflection and remixing of the sources.  Content in open courses and MOOC’s tend to be offered, not through a centrally structured unit such as a course text, but through a tsunami of links, reports, articles and blogs, and it’s up to the learner to select, read, filter and make sense of it.  A blog can be helpful with this and invite outsiders to read, share and comment.
      –  It’s a communication tool with other students.  Learners come from all over the world.  A blog offers insight in how other learners are trying to make sense of the information and an opportunity to engage in meaningful discussion.
      –   It’s an inventory of learning materials read and processed.  It enables you to trace back what you have read before and what your thoughts were on it.

The blog is intended for fellow learners and tutors at the H800 course, fellow learners at the MOOCs, and of course anyone who might be interested.  Although most of the posts will be related to technology and education, I do want to throw in regular posts on science education and the state of education in Cambodia (and by extension South-East Asia).