Understandings and Misunderstandings about RCTs

angus-deatonPolicy makers and the media have shown a remarkable preference for Randomized Controlled Trials or RCTs in recent times. After their breakthrough in medicine, they are increasingly hailed as a way to bring human sciences into the realm of ‘evidence’-based policy. RCTs are believed to be accurate, objective and independent of the expert knowledge that is so widely distrusted these days. Policy makers are attracted by the seemingly ideology-free and theory-free focus on ‘what works’ in the RCT discourse.

Part of the appeal of RCTs lies in their simplicity.  Trials are easily explained along the lines that random selection generates two otherwise identical groups, one treated and one not. All we need is to compare two averages.  Unlike other methods, RCTs don’t require specialized understanding of the subject matter or prior knowledge. As such, it seems a truly general tool that works in the same way in agriculture, medicine, economics and education.

Deaton cautions against this view of RCTs as the magic bullet in social research. In a lengthy but well readable NBER paper he outlines a range of misunderstandings with RCTs. These broadly fall into two categories: problems with the running of RCTs and problems with their interpretation.

Firstly, RCTs require minimal assumptions, prior knowledge or insight in the context. They are non-parametric and no information is needed about the underlying nature of the data (no assumptions about covariates, heterogeneous treatment effects or shape of statistical distributions of the variables).  A crucial disadvantage of this simplicity is that precision is reduced, because no prior knowledge or theories can be used to design a more refined research hypothesis.  Precision is not the same as a lack of bias.  In RCTs treatment and control groups come from the same underlying distribution. Randomization guarantees that the net average balance of other causes (error term) is zero, but only when the RCT is repeated many times on the same population (which is rarely done). I hadn’t realized this before and it’s almost never mentioned in reports.  But it makes sense. In any one trial, the difference in means will be equal to the average treatment effect plus a term that reflects the imbalance in the net effects of the other causes. We do not know the size of this error term, but there is nothing in the randomization that limits its size.

RCTs are based on the fact that the difference in two means is the mean of the individual differences, i.e. the treatment effects.  This is not valid for medians. This focus on the mean makes them sensitive to outliers in the data and to asymmetrical distributions. Deaton shows how an RCT can yield completely different results depending on whether an outlier falls in the treatment or control group.  Many treatment effects are asymmetric, especially when money or health is involved. In a micro-financing scheme, a few talented, but credit-constrained entrepreneurs may experience a large and positive effect, while there is no effect for the majority of borrowers. Similarly, a health intervention may have no effect on the majority, but a large effect on a small group of people.

A key argument in favour of randomization is the ability to blind both those receiving the treatment and those administering it.  In social science, blinding is rarely possible though. Subjects usually know whether they are receiving the treatment or not and can react to their assignment in ways that can affect the outcome other than through the operation of the treatment. This is problematic, not only because of selection bias. Concerns about the placebo, Pygmalion, Hawthorne and John Henry effects are serious.

Deaton recognizes that RCTs have their use within social sciences. When combined with other methods, including conceptual and theoretical development, they can contribute to discovering not “what works,” but why things work.

Unless we are prepared to make assumptions, and to stand on what we know, making statements that will be incredible to some, all the credibility of RCTs is for naught.

Also in cases where there is good reason to doubt the good faith of experimenters, as in some pharmaceutical trials, randomization will be the appropriate response. However, ignoring the prior knowledge in the field should be resisted as a general prescription for scientific research.  Thirdly, an RCT may disprove a general theoretical proposition to which it provides a counterexample. Finally, an RCT, by demonstrating causality in some population can be thought of as proof of concept, that the treatment is capable of working somewhere.

Economists and other social scientists know a great deal, and there are many areas of theory and prior knowledge that are jointly endorsed by large numbers of knowledgeable researchers.  Such information needs to be built on and incorporated into new knowledge, not discarded in the face of aggressive know-nothing ignorance.

The conclusions of RTCs are often wrongly applied to other contexts. RCTs do not have external validity.  Establishing causality does nothing in and of itself to guarantee generalizability. Their results are not applicable outside the trial population. That doesn’t mean that RCTs are useless in other contexts. We can often learn much from coming to understand why replication failed and use that knowledge to make appropriate use of the original findings by looking for how the factors that caused the original result might be expected to operate differently in different settings. However, generalizability can only be obtained by thinking through the causal chain that has generated the RCT result, the underlying structures that support this causal chain, whether that causal chain might operate in a new setting and how it would do so with different joint distributions of the causal variables; we need to know why and whether that why will apply elsewhere.

Bertrand Russell’s chicken provides an excellent example of the limitations to straightforward extrapolation from repeated successful replication.

The bird infers, based on multiple repeated evidence, that when the farmer comes in the morning, he feeds her. The inference serves her well until Christmas morning, when he wrings her neck and serves her for Christmas dinner. Of course, our chicken did not base her inference on an RCT. But had we constructed one for her, we would have obtained exactly the same result.

The results of RCTs must be integrated with other knowledge, including the
practical wisdom of policy makers if they are to be usable outside the context in which they were constructed.

Another limitation of the results of RCTs relates to their scalability. As with other research methods, failure of trial results to replicate at a larger scale is likely to be the rule rather than the exception. Using RCT results is not the same as assuming the same results holds in all circumstances.  Giving one child a voucher to go to private school might improve her future, but doing so for everyone can decrease the quality of education for those children who are left in the public schools.

Knowing “what works” in a trial population is of limited value without understanding the political and institutional environment in which it is set. Jean Drèze notes, based on extensive experience in India, “when a foreign agency comes in with its heavy boots and suitcases of dollars to administer a `treatment,’ whether through a local NGO or government or whatever, there is a lot going on other than the treatment.” There is also the suspicion that a treatment that works does so because of the presence of the “treators,” often from abroad, rather than because of the people who will be called to work it in reality. Unfortunately, there are few RCTs which are replicated after the pilot on the scaled-up version of the experiment.

This readable paper from one of the foremost experts in development economics provides a valuable counterweight to the often unnuanced admiration for everything RCTs.  In a previous post, I discussed Poor Economics from “randomistas” Duflo and Banerjee. For those who want to know more, there is an excellent debate online between Abhijit Banerjee (J-PAL, MIT) and Angus Deaton on the merits of RCTs.

#H809 Can Technology ‘Improve’ Learning? And can we find out?

In education and learning we cannot isolate our research objects from outside influences, unlike in positive sciences.  In a physics experiment we would carefully select variables we want to measure (dependent variables) and variables that we believe could influence those (independent variables).  In education this is not possible.  Even in Randomized Controlled Trials (RCT), put forward by researchers as Duflo and Banerjee (see my post that discusses their wonderful book ‘Poor Economics’) as a superior way to investigate policy effects, we cannot, in my opinion, fully exclude context.

This is why, according to Diana Laurillard, many studies talk about the ‘potential’ of technology in learning, as it conveniently avoids dealing with the messiness of the context. Other studies present positive results, that take place in favourable external circumstances.  Laurillard argues that the question if technology improves education is senseless, because it depends on so many factors:

There is no way past this impasse. The only sensible answer to the question is ‘it depends’, just as it would be for any X in the general form ‘do X’s improve learning?’. Try substituting teachers, books, schools, universities, examinations, ministers of education – any aspect of education whatever, in order to demonstrate the absurdity of the question. (Laurillard, 1997)

In H810 we discussed theories of institutional change and authors such as Douglas North and Ozcan Konur, who highlighted the importance of formal rules, informal constraints and enforcement characteristics to explain policy effects in education.  Laurillard talks about ‘external layers of influence’. A first layer surrounding  student and teacher (student motivation, assessment characteristics, perceptions, available hard- en software, student prior knowledge, teacher motivation to use technology etc.) lies within the sphere of influence of student and teacher.  Wider layers (organisational and institutional policies, culture of education in society, perceived social mobility…) are much harder to influence directly.

That doesn’t mean she believes educational research is impossible.  She dismisses the ‘cottage industry’ model of education (See this article from Sir John Daniel on the topic), in which education is seen as an ‘art’, best left to the skills of the teacher as artist.  Rather, she argues for a change in direction of educational research.

Laurillard dismisses much educational research as ‘replications’ rather than ‘findings’, a statement that echoes the plea from Clayton Christensen to focus more on deductive, predictive rather research than descriptive, correlational studies.  He argues to focus less on detecting correlations and more on theory formation and categorisation of the circumstances in which individual learners can benefit from certain educational interventions.  A body of knowledge advances by testing hypotheses derived from theories.  To end with a quote from the great Richard Feynman (courtesy the fantastic ‘Starts with a Bang‘ blog):

“We’ve learned from experience that the truth will come out. Other experimenters will repeat your experiment and find out whether you were wrong or right. Nature’s phenomena will agree or they’ll disagree with your theory. And, although you may gain some temporary fame and excitement, you will not gain a good reputation as a scientist if you haven’t tried to be very careful in this kind of work.” -Richard Feynman

References

Konur, O. (2006) ‘Teaching disabled students in higher education’, Teaching in Higher Education, 11(3), pp. 351–363.
Laurillard, D. (1997) ‘How Can Learning Technologies Improve Learning?’, Law Technology Journal, 3(2), pp. (c) Warwick Law School; presented at the Higher Education 1998: Transformed by Learning Technology, Swedish–British Workshop 14–17 May 1993, University of Lund, Sweden.
North, D.C. (1994) Institutional Change: A Framework Of Analysis, Economic History, EconWPA

Poor Education

I’ve been reading Poor Economics, from Abhisit Banerjee and Esther Duflo.  The book that has been raking up awards and recommendations in 2011 (The Economist, Financial Times & Goldman Sachs, The Guardian, De Tijd).  It’s an engrossing read that contains loads of interesting information making you want to read it again as soon as you put it down, for fear of losing all the interesting insights the book contains.

The authors, affiliated to MIT and its impact evaluation spin-off J-PAL, take an evidence-based approach to poverty reduction, providing an overview of recent research in various domains of development economics (health, demography, finance, food, entrepreneurship…).  There’s a separate chapter on education.

The book highlights the frequent disagreement with “supply wallahs”, experts who focus on supplying goods and services to combat poverty, versus “demand wallahs”, who favour creating demands for goods and services by the poor themselves and creating free-market conditions.  These two viewpoints are highlighted for different topics of the book.  In education, for example, the supply wallahs focus on providing financial support to build schools, pay teachers and provide conditional cash transfers to parents to send their children to school.  Conversely demand wallahs see more benefit in increasing the (perceived) benefit for parents of sending their children to school by providing relevant skills, informing parents or increasing job opportunities.  When the benefits of education become high enough, enrolment will rise without the state having to push it.  People will send their children to cut-throat private schools (like in Cambodia), or if that is too expensive, they will demand that local governments set up schools.

The authors refer to the 3 I’s as the enemy of an evidence-based approach.  Many policies are driven by ideology, often clash with ignorance of ground-level realities and inertia at the level of the implementer.  Instead of starting from a grand vision to poverty reduction, they focus on evidence collected from (but not exclusively) randomized-trials in developing countries.  This evidence forms pieces of a puzzle that can inform us in designing sensitive development policies and creating incremental improvement in poor people’s lives.

“it is possible to make very significant progress against the biggest problem in the world through the accumulation of a set of small steps, each well thought out, carefully tested, and judiciously implemented…The political constraints are real, and they make it difficult to find big solutions to big problems. But there is considerable slack to improve institutions and policy at the margin…These changes will be incremental, but they will sustain and build on themselves. They can be the start of a quiet revolution”.

Often interventions are based on the intuition and experience of local aid workers, accepted wisdom and (cherry-picked) academic research.  A monitoring & evaluation programme is set in place, but is often more geared towards satisfying donors’ reporting needs than towards creating sound evidence for making informed changes to the project design.

Could we use randomized trials in our education programme in Cambodia?  Our main objective is to reduce the number of drop-outs from primary and lower secondary schools.  For example, we could try out various strategies and measure the effect on the drop-out rate in similar school clusters in the country:

  • School cluster 1: We provide cash to parents who keep their children at school
  • School cluster 2: We train teachers in using student-centred pedagogies to make lessons more relevant and interactive.
  • School cluster 3: We provide schools with ICT and multimedia
  • School cluster 4: We give teachers a topping-up to their salaries if a certain percentage of students pass their exams at the end of the year and enrol for the next one.
  • School cluster 5: We focus on outreach activities to parents and mass organisations to make them aware of the benefits of education.
  • School cluster 6: This is our control group, where no measures are taken.
Cambodian teacher trainers measuring

the Sun’s apparent movement

This kind of programme design would allow us to compare various measures to address high drop-out rates.  After a few years we could compare results in an objective way and scale up the most successful solution.  Or not?

When I think about the potential for applying similar rigorous testing in the education programme in Cambodia I see some obstacles:

  1. The field of development partners is very crowded in ‘donor darling’ Cambodia. This makes it difficult to create a level playing field in which measures can be compared with each other and with the status quo.
  2. Interventions in education aim at mid and long term effects. Some strategies such as focusing on teachers’ or teacher trainers’ teaching skills might take year to resort effects. Other policies, like building more schools may create immediate effects, making strategies hard to compare within the limited lifetime of most development programmes.
  3. Response and culture bias are important challenges in Cambodia, characterised by high power distance and the importance of avoid ‘losing face’. Honest evaluations of a programme are hard to achieve and require trust and strong facilitating skills. Often people say, write or do what they think you want to hear or what they think would yield them most benefit, ready to switch back to old habits as soon as the intervention stops.
  4. Even if an randomized trial that takes into account long-term effects, culture and response bias and the crowded development field would point out that conditional cash transfer is more effective than improving teachers’ pedagogical skills, would that then imply that we – as VVOB – would better switch our attention to conditional cash transfer programmes? Various strategies can be complementary. Results from randomized trials illustrate a measurable correlation at a given time and place in a particular culture, but do not necessarily proof causation. In other words, they don’t always have much predictive value in other contexts.

I find Poor Economics an invitation to look more closely at development interventions and try to avoid the lazy thinking that reduces every problem to the same set of principles.  Details matter.  The poor seem to be trapped by the same kinds of problems that afflict the rest of us – lack of information, weak beliefs, and procrastination.  We need to force ourselves to understand the logic of people’s choices, tailor interventions and be prepared to learn.

Some interesting (development economics) concepts from the book with relevance for education

Time inconsistency

 “In the present we are impulsive, governed in large part by emotions and immediate desire: Small losses of time or petty discomforts that have to be endured right now feel much more unpleasant in the moment than when we think about them without a sense of immediacy.  The reverse, of course, goes for small rewards that we really crave in the present; when we plan for the future, the pleasure for these treats seems less important.” (p. 65)

Nudging

The use of incentives to give people a reason to act today, instead of convincing them first that the action is the ‘right thing to do’.  The key challenge is to design nudges tailored to the environment of developing countries.

Elite bias among teachers and parents

Teachers tend to pay only attention to their best students.  They ignore children who have fallen behind and focus on preparing the best students on the final exam.  Education systems in developing countries fail generally in their two basic tasks: providing children with a sound basic set of skills and identifying talent.  As a consequence, many parents stop taking interest in their children’s education. This behaviour creates a poverty trap even where none exists in the first place.  Many parents believe the returns to education are low at low levels and only high at higher levels, and that it is unlikely they will ever get to the higher levels, they may not want to make the effort to invest in the lower levels or hedge all their bets on one child. Parents seem to see education primarily as a way for their children to acquire wealth.  They see education as a lottery ticket, not as a safe investment.  Research indicates that each year of education has a similar value.  The combination of lack of information and incorrect expectations create an illusory poverty trap.

On Technology in education

“The current view of the use of technology in teaching in the education community is, however, not particularly positive. But this is based mainly on experience from rich countries, where the alternative to being taught by a computer is, to a large extent, being taught by a well-trained and motivated teacher.  This is not always the case in poor countries.  And the evidence from the developing world, though sparse, is quite positive.” (p. 100) 

This highlights what is particularly good about the computer as a learning tool: each child is able to set his or her own pace through the program.

On the poor as so-called natural entrepreneurs

Enterprises of the poor often seem more a way to buy a job when a more conventional employment opportunity is not available than a reflection of a particular entrepreneurial urge.  The emphasis on government jobs suggests a desire for stability, which brings a transformational effect: access to loans, higher value assigned to education, ‘mental space’ by reduced uncertainty.

PS The Economist has run a discussion on Poor Economics on its “Free Exchange” blog.