#H809 Learning Analytics: The Arnold and Pistilli (2012) paper

la photoIn a paper for the Learning Analytics Conference of 2012, Arnold and Pistilli  explore the value of learning analytics in the Course Signals product, a pioneering learning analytics programme at Purdue University.  The researchers used three years of data from a variety of modules.  For some modules learning analytics was used to identify students ‘at risk of failing’, based on a proprietary algorithm that took into account course-related factors such as login data, but also prior study results and demographic factors.  Students  ‘at risk’ were confronted with a yellow or red traffic light on their LMS dashboard.  Based on the information tutors could decide to contact the student by e-mail or phone.  The researchers compared retention rates for cohorts of students who entered university from 2007 until 2009.  They complemented this analysis with feedback from students and instructors.

Modules with use of CS showed increased retention rates – likely due to the use of CS.  These courses also showed lower than average  test results, possible a consequence of the higher retention.  Student feedback indicated that 58% of students wanted to use CS in every course, not a totally convincing number.

The research paper generated following issues/ questions:


  • The correlation doesn’t necessarily point to a causal link (although the relation seems quite intuitive)
  • It’s unclear how courses were selected to be used with CS or not. Possibility of bias?
  • The qualitative side of the research seems neglected.   Interesting information such as the large group of students who are apparently not eager to use CS in every course is not further explored.


  • The underlying algorithm is proprietary and is thus a black box for outsiders, which severely limits its applicability and relevance for others.
  • It’s unclear with exactly what the use of CS is compared.  If  students in non-CS modules get little personal learner support, CS may look like a real improvement.
  • The previous point relates with the need for clear articulation what the objective(s) of CS or learning analytics in general are. Including an analysis of tutor time saved or money saved through retention rates would have given a more honest and complete overview of the benefits that are likely perceived as important, instead of a rather naive focus on retention rates.

Ethical issues

  • It;s unclear if and how informed consent of students is obtained.  Is it part of the ‘small print’ that comes with enrolment?
  • How about false positives and negatives?  Some students may get a continuous red light or face a bombardment of e-mails, if they belong to a demographic or socio-economic group ‘at risk’.  Others may complain when they don’t receive any warnings despite having problems to stay in the course.
  • The authors have been closely involved in the development of the learning analytics programme at Purdue University.  This raises questions about objectivity and underlying motives of the paper.


Arnold, K.E. and Pistilli, M.D. (2012) ‘Course signals at Purdue: using learning analytics to increase student success’, In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, LAK  ’12, New York, NY, USA, ACM, pp. 267–270, [online] Available from: http://doi.acm.org/10.1145/2330601.2330666.

#LAK11 – Week 2 – Educational Data Mining

Week 2 of the Learning Analytics (LAK11) course focuses on the emerging field of educational data mining (EDM)
Data mining tries to make senses of huge amounts of data.  These can be in tabular format, but more and more data are in a messier or text-like format, like status updates (Facebook), tweets (Twitter) or the detailed monitoring data of LMS.  Educational data mining takes into account the specificities of educational data, such as their multi-level hierarchy and non-independence.  This is similar to how spatial data mining addresses the specifics of spatial information such as spatial autocorrelation.  In an interesting talk on YouTube the CEO of Cloudera, Mark Olson, lays out the potential of distributed computing such as Hadoop and MapReduce (Google) to deal with machine produced data.  Distributed computing means that data and computing are spread out over many servers. 
“Terabytes are not hard to get, they are hard not to get” (Mark Olson, CEO Cloudera)
The objective of EDM is to use data mining methods to better understand student learning. This can be both from an institutional perspective (increasing efficiency, grouping students, predicting student performance etc.)as from a learner perspective (provide better tutoring, a more personalized learning experience). Teachers can see whether a learning community is being formed or can monitor the effect of learning interventions, for example the introduction of e-portfolio-like instruments.
A fair amount of discussion evolved around the difference between learning analytics and educational data mining, without yielding a clear consensus (even Ryan Baker admitted not to have a clear distinction).  It seems that learning analytics is not only limited to data mining methods, but also involves more qualitative methods.  In this way EDM is a part of learning analytics. Other authors argue that learning analytics includes the feeding back of the information in order to improve learning.
For example, social network analysis (SNA) visualizes social network activity.  In an educational context it can provide information on the degree distribution of the network – a measure of how centralized it is – , on eventual outliers – students that are not connected to others and are at risk for dropping out – and on the existence of subgroups in the network.  A software tool to map forum activity on Moodle, Blackboard and the like is SNAPP, a free software programme developed by the University of Wollongong.  SNAPP allows visualizing the network of interactions resulting from discussion forum posts and replies. It’s a simple browser plugin and doesn’t require manually loading data.  Below a screenshot from a forum discussions and another more integrated network from the homepage.  The difference in interaction is very clear.
Data for EDM come from online learning management systems (LMS) such as Moodle or Blackboard.  Due to more standardized exams, the increased use of LMS and internet-based software results in more and better interoperable data available.  When learning turns more distributed, with learners using a variety of places and platforms, collecting and integrating all data becomes more challenging.  A particular point of interest is the purpose of EDM.  Ryan Baker classified work in EDM as follows:
  • Prediction
    • Classification
    • Regression
    • Density estimation
  • Clustering
    • Relationship mining
    • Association rule mining
    • Sequential pattern mining
    • Causal data mining
  • Distillation of data for human judgment
  • Discovery with models

Discovery with models is perhaps most typical for EDM. It is described by Baker as:
In discovery with models, a model of a phenomenon is developed through any process that can be validated in some fashion, and this model is then used as a component in another analysis, such as prediction or relationship mining. (…) supporting sophisticated analyses such as which learning material sub-categories of students will most benefit from (Beck & Mostow, 2008), how different types of student behavior impact students’ learning in different ways (Cocea et al., 2009) and how variation sin intelligent tutor design impact students’ behavior over time (Jeong & Biswas, 2008).
An interesting case study presented by Baker was an investigation of the “gaming” and “off-task behavior” (like surfing on Twitter or Facebook) during online learning tasks.  Gaming means abusing the system, such as systematic guessing, (clicking quickly on all the answer categories without thinking) or immediately hitting the “help” function.  They built an algorithm to predict when students were gaming or engaging in off-task behavior, and tuned the model by observing students.  Then they related the frequency of gaming with characteristics of a learning environment for a mathematics course.  They included 76 variables in the analysis.  Some variables such as the use of non-mathematical words in the question or the use of abstract language proved to influence significantly the degree of off-task behavior and gaming.  The next step is to improve the online tasks, based on the results of the analysis and to verify if gaming and off-task behavior drop. 
EDM methods don’t provide answers to the why question, but offer clues that can be validated by complementary qualitative analysis.  A hypothesis for the above example could be that abstract language or non-mathematical terms confuse the students or make them bored, triggering them to off-task behavior.