Policy makers and the media have shown a remarkable preference for Randomized Controlled Trials or RCTs in recent times. After their breakthrough in medicine, they are increasingly hailed as a way to bring human sciences into the realm of ‘evidence’-based policy. RCTs are believed to be accurate, objective and independent of the expert knowledge that is so widely distrusted these days. Policy makers are attracted by the seemingly ideology-free and theory-free focus on ‘what works’ in the RCT discourse.
Part of the appeal of RCTs lies in their simplicity. Trials are easily explained along the lines that random selection generates two otherwise identical groups, one treated and one not. All we need is to compare two averages. Unlike other methods, RCTs don’t require specialized understanding of the subject matter or prior knowledge. As such, it seems a truly general tool that works in the same way in agriculture, medicine, economics and education.
Deaton cautions against this view of RCTs as the magic bullet in social research. In a lengthy but well readable NBER paper he outlines a range of misunderstandings with RCTs. These broadly fall into two categories: problems with the running of RCTs and problems with their interpretation.
Firstly, RCTs require minimal assumptions, prior knowledge or insight in the context. They are non-parametric and no information is needed about the underlying nature of the data (no assumptions about covariates, heterogeneous treatment effects or shape of statistical distributions of the variables). A crucial disadvantage of this simplicity is that precision is reduced, because no prior knowledge or theories can be used to design a more refined research hypothesis. Precision is not the same as a lack of bias. In RCTs treatment and control groups come from the same underlying distribution. Randomization guarantees that the net average balance of other causes (error term) is zero, but only when the RCT is repeated many times on the same population (which is rarely done). I hadn’t realized this before and it’s almost never mentioned in reports. But it makes sense. In any one trial, the difference in means will be equal to the average treatment effect plus a term that reflects the imbalance in the net effects of the other causes. We do not know the size of this error term, but there is nothing in the randomization that limits its size.
RCTs are based on the fact that the difference in two means is the mean of the individual differences, i.e. the treatment effects. This is not valid for medians. This focus on the mean makes them sensitive to outliers in the data and to asymmetrical distributions. Deaton shows how an RCT can yield completely different results depending on whether an outlier falls in the treatment or control group. Many treatment effects are asymmetric, especially when money or health is involved. In a micro-financing scheme, a few talented, but credit-constrained entrepreneurs may experience a large and positive effect, while there is no effect for the majority of borrowers. Similarly, a health intervention may have no effect on the majority, but a large effect on a small group of people.
A key argument in favour of randomization is the ability to blind both those receiving the treatment and those administering it. In social science, blinding is rarely possible though. Subjects usually know whether they are receiving the treatment or not and can react to their assignment in ways that can affect the outcome other than through the operation of the treatment. This is problematic, not only because of selection bias. Concerns about the placebo, Pygmalion, Hawthorne and John Henry effects are serious.
Deaton recognizes that RCTs have their use within social sciences. When combined with other methods, including conceptual and theoretical development, they can contribute to discovering not “what works,” but why things work.
Unless we are prepared to make assumptions, and to stand on what we know, making statements that will be incredible to some, all the credibility of RCTs is for naught.
Also in cases where there is good reason to doubt the good faith of experimenters, as in some pharmaceutical trials, randomization will be the appropriate response. However, ignoring the prior knowledge in the field should be resisted as a general prescription for scientific research. Thirdly, an RCT may disprove a general theoretical proposition to which it provides a counterexample. Finally, an RCT, by demonstrating causality in some population can be thought of as proof of concept, that the treatment is capable of working somewhere.
Economists and other social scientists know a great deal, and there are many areas of theory and prior knowledge that are jointly endorsed by large numbers of knowledgeable researchers. Such information needs to be built on and incorporated into new knowledge, not discarded in the face of aggressive know-nothing ignorance.
The conclusions of RTCs are often wrongly applied to other contexts. RCTs do not have external validity. Establishing causality does nothing in and of itself to guarantee generalizability. Their results are not applicable outside the trial population. That doesn’t mean that RCTs are useless in other contexts. We can often learn much from coming to understand why replication failed and use that knowledge to make appropriate use of the original findings by looking for how the factors that caused the original result might be expected to operate differently in different settings. However, generalizability can only be obtained by thinking through the causal chain that has generated the RCT result, the underlying structures that support this causal chain, whether that causal chain might operate in a new setting and how it would do so with different joint distributions of the causal variables; we need to know why and whether that why will apply elsewhere.
Bertrand Russell’s chicken provides an excellent example of the limitations to straightforward extrapolation from repeated successful replication.
The bird infers, based on multiple repeated evidence, that when the farmer comes in the morning, he feeds her. The inference serves her well until Christmas morning, when he wrings her neck and serves her for Christmas dinner. Of course, our chicken did not base her inference on an RCT. But had we constructed one for her, we would have obtained exactly the same result.
The results of RCTs must be integrated with other knowledge, including the
practical wisdom of policy makers if they are to be usable outside the context in which they were constructed.
Another limitation of the results of RCTs relates to their scalability. As with other research methods, failure of trial results to replicate at a larger scale is likely to be the rule rather than the exception. Using RCT results is not the same as assuming the same results holds in all circumstances. Giving one child a voucher to go to private school might improve her future, but doing so for everyone can decrease the quality of education for those children who are left in the public schools.
Knowing “what works” in a trial population is of limited value without understanding the political and institutional environment in which it is set. Jean Drèze notes, based on extensive experience in India, “when a foreign agency comes in with its heavy boots and suitcases of dollars to administer a `treatment,’ whether through a local NGO or government or whatever, there is a lot going on other than the treatment.” There is also the suspicion that a treatment that works does so because of the presence of the “treators,” often from abroad, rather than because of the people who will be called to work it in reality. Unfortunately, there are few RCTs which are replicated after the pilot on the scaled-up version of the experiment.
This readable paper from one of the foremost experts in development economics provides a valuable counterweight to the often unnuanced admiration for everything RCTs. In a previous post, I discussed Poor Economics from “randomistas” Duflo and Banerjee. For those who want to know more, there is an excellent debate online between Abhijit Banerjee (J-PAL, MIT) and Angus Deaton on the merits of RCTs.