Week 2 of the Learning Analytics (LAK11) course focuses on the emerging field of educational data mining (EDM).
Data mining tries to make senses of huge amounts of data. These can be in tabular format, but more and more data are in a messier or text-like format, like status updates (Facebook), tweets (Twitter) or the detailed monitoring data of LMS. Educational data mining takes into account the specificities of educational data, such as their multi-level hierarchy and non-independence. This is similar to how spatial data mining addresses the specifics of spatial information such as spatial autocorrelation. In an interesting talk on YouTube the CEO of Cloudera, Mark Olson, lays out the potential of distributed computing such as Hadoop and MapReduce (Google) to deal with machine produced data. Distributed computing means that data and computing are spread out over many servers.
“Terabytes are not hard to get, they are hard not to get” (Mark Olson, CEO Cloudera)
The objective of EDM is to use data mining methods to better understand student learning. This can be both from an institutional perspective (increasing efficiency, grouping students, predicting student performance etc.)as from a learner perspective (provide better tutoring, a more personalized learning experience). Teachers can see whether a learning community is being formed or can monitor the effect of learning interventions, for example the introduction of e-portfolio-like instruments.
A fair amount of discussion evolved around the difference between learning analytics and educational data mining, without yielding a clear consensus (even Ryan Baker admitted not to have a clear distinction). It seems that learning analytics is not only limited to data mining methods, but also involves more qualitative methods. In this way EDM is a part of learning analytics. Other authors argue that learning analytics includes the feeding back of the information in order to improve learning.
For example, social network analysis (SNA) visualizes social network activity. In an educational context it can provide information on the degree distribution of the network – a measure of how centralized it is – , on eventual outliers – students that are not connected to others and are at risk for dropping out – and on the existence of subgroups in the network. A software tool to map forum activity on Moodle, Blackboard and the like is SNAPP, a free software programme developed by the University of Wollongong. SNAPP allows visualizing the network of interactions resulting from discussion forum posts and replies. It’s a simple browser plugin and doesn’t require manually loading data. Below a screenshot from a forum discussions and another more integrated network from the homepage. The difference in interaction is very clear.
Data for EDM come from online learning management systems (LMS) such as Moodle or Blackboard. Due to more standardized exams, the increased use of LMS and internet-based software results in more and better interoperable data available. When learning turns more distributed, with learners using a variety of places and platforms, collecting and integrating all data becomes more challenging. A particular point of interest is the purpose of EDM. Ryan Baker classified work in EDM as follows:
- Density estimation
- Relationship mining
- Association rule mining
- Sequential pattern mining
- Causal data mining
- Distillation of data for human judgment
- Discovery with models
Discovery with models is perhaps most typical for EDM. It is described by Baker as:
In discovery with models, a model of a phenomenon is developed through any process that can be validated in some fashion, and this model is then used as a component in another analysis, such as prediction or relationship mining. (…) supporting sophisticated analyses such as which learning material sub-categories of students will most benefit from (Beck & Mostow, 2008), how different types of student behavior impact students’ learning in different ways (Cocea et al., 2009) and how variation sin intelligent tutor design impact students’ behavior over time (Jeong & Biswas, 2008).
An interesting case study presented by Baker was an investigation of the “gaming” and “off-task behavior” (like surfing on Twitter or Facebook) during online learning tasks. Gaming means abusing the system, such as systematic guessing, (clicking quickly on all the answer categories without thinking) or immediately hitting the “help” function. They built an algorithm to predict when students were gaming or engaging in off-task behavior, and tuned the model by observing students. Then they related the frequency of gaming with characteristics of a learning environment for a mathematics course. They included 76 variables in the analysis. Some variables such as the use of non-mathematical words in the question or the use of abstract language proved to influence significantly the degree of off-task behavior and gaming. The next step is to improve the online tasks, based on the results of the analysis and to verify if gaming and off-task behavior drop.
EDM methods don’t provide answers to the why question, but offer clues that can be validated by complementary qualitative analysis. A hypothesis for the above example could be that abstract language or non-mathematical terms confuse the students or make them bored, triggering them to off-task behavior.