Understandings and Misunderstandings about RCTs

angus-deatonPolicy makers and the media have shown a remarkable preference for Randomized Controlled Trials or RCTs in recent times. After their breakthrough in medicine, they are increasingly hailed as a way to bring human sciences into the realm of ‘evidence’-based policy. RCTs are believed to be accurate, objective and independent of the expert knowledge that is so widely distrusted these days. Policy makers are attracted by the seemingly ideology-free and theory-free focus on ‘what works’ in the RCT discourse.

Part of the appeal of RCTs lies in their simplicity.  Trials are easily explained along the lines that random selection generates two otherwise identical groups, one treated and one not. All we need is to compare two averages.  Unlike other methods, RCTs don’t require specialized understanding of the subject matter or prior knowledge. As such, it seems a truly general tool that works in the same way in agriculture, medicine, economics and education.

Deaton cautions against this view of RCTs as the magic bullet in social research. In a lengthy but well readable NBER paper he outlines a range of misunderstandings with RCTs. These broadly fall into two categories: problems with the running of RCTs and problems with their interpretation.

Firstly, RCTs require minimal assumptions, prior knowledge or insight in the context. They are non-parametric and no information is needed about the underlying nature of the data (no assumptions about covariates, heterogeneous treatment effects or shape of statistical distributions of the variables).  A crucial disadvantage of this simplicity is that precision is reduced, because no prior knowledge or theories can be used to design a more refined research hypothesis.  Precision is not the same as a lack of bias.  In RCTs treatment and control groups come from the same underlying distribution. Randomization guarantees that the net average balance of other causes (error term) is zero, but only when the RCT is repeated many times on the same population (which is rarely done). I hadn’t realized this before and it’s almost never mentioned in reports.  But it makes sense. In any one trial, the difference in means will be equal to the average treatment effect plus a term that reflects the imbalance in the net effects of the other causes. We do not know the size of this error term, but there is nothing in the randomization that limits its size.

RCTs are based on the fact that the difference in two means is the mean of the individual differences, i.e. the treatment effects.  This is not valid for medians. This focus on the mean makes them sensitive to outliers in the data and to asymmetrical distributions. Deaton shows how an RCT can yield completely different results depending on whether an outlier falls in the treatment or control group.  Many treatment effects are asymmetric, especially when money or health is involved. In a micro-financing scheme, a few talented, but credit-constrained entrepreneurs may experience a large and positive effect, while there is no effect for the majority of borrowers. Similarly, a health intervention may have no effect on the majority, but a large effect on a small group of people.

A key argument in favour of randomization is the ability to blind both those receiving the treatment and those administering it.  In social science, blinding is rarely possible though. Subjects usually know whether they are receiving the treatment or not and can react to their assignment in ways that can affect the outcome other than through the operation of the treatment. This is problematic, not only because of selection bias. Concerns about the placebo, Pygmalion, Hawthorne and John Henry effects are serious.

Deaton recognizes that RCTs have their use within social sciences. When combined with other methods, including conceptual and theoretical development, they can contribute to discovering not “what works,” but why things work.

Unless we are prepared to make assumptions, and to stand on what we know, making statements that will be incredible to some, all the credibility of RCTs is for naught.

Also in cases where there is good reason to doubt the good faith of experimenters, as in some pharmaceutical trials, randomization will be the appropriate response. However, ignoring the prior knowledge in the field should be resisted as a general prescription for scientific research.  Thirdly, an RCT may disprove a general theoretical proposition to which it provides a counterexample. Finally, an RCT, by demonstrating causality in some population can be thought of as proof of concept, that the treatment is capable of working somewhere.

Economists and other social scientists know a great deal, and there are many areas of theory and prior knowledge that are jointly endorsed by large numbers of knowledgeable researchers.  Such information needs to be built on and incorporated into new knowledge, not discarded in the face of aggressive know-nothing ignorance.

The conclusions of RTCs are often wrongly applied to other contexts. RCTs do not have external validity.  Establishing causality does nothing in and of itself to guarantee generalizability. Their results are not applicable outside the trial population. That doesn’t mean that RCTs are useless in other contexts. We can often learn much from coming to understand why replication failed and use that knowledge to make appropriate use of the original findings by looking for how the factors that caused the original result might be expected to operate differently in different settings. However, generalizability can only be obtained by thinking through the causal chain that has generated the RCT result, the underlying structures that support this causal chain, whether that causal chain might operate in a new setting and how it would do so with different joint distributions of the causal variables; we need to know why and whether that why will apply elsewhere.

Bertrand Russell’s chicken provides an excellent example of the limitations to straightforward extrapolation from repeated successful replication.

The bird infers, based on multiple repeated evidence, that when the farmer comes in the morning, he feeds her. The inference serves her well until Christmas morning, when he wrings her neck and serves her for Christmas dinner. Of course, our chicken did not base her inference on an RCT. But had we constructed one for her, we would have obtained exactly the same result.

The results of RCTs must be integrated with other knowledge, including the
practical wisdom of policy makers if they are to be usable outside the context in which they were constructed.

Another limitation of the results of RCTs relates to their scalability. As with other research methods, failure of trial results to replicate at a larger scale is likely to be the rule rather than the exception. Using RCT results is not the same as assuming the same results holds in all circumstances.  Giving one child a voucher to go to private school might improve her future, but doing so for everyone can decrease the quality of education for those children who are left in the public schools.

Knowing “what works” in a trial population is of limited value without understanding the political and institutional environment in which it is set. Jean Drèze notes, based on extensive experience in India, “when a foreign agency comes in with its heavy boots and suitcases of dollars to administer a `treatment,’ whether through a local NGO or government or whatever, there is a lot going on other than the treatment.” There is also the suspicion that a treatment that works does so because of the presence of the “treators,” often from abroad, rather than because of the people who will be called to work it in reality. Unfortunately, there are few RCTs which are replicated after the pilot on the scaled-up version of the experiment.

This readable paper from one of the foremost experts in development economics provides a valuable counterweight to the often unnuanced admiration for everything RCTs.  In a previous post, I discussed Poor Economics from “randomistas” Duflo and Banerjee. For those who want to know more, there is an excellent debate online between Abhijit Banerjee (J-PAL, MIT) and Angus Deaton on the merits of RCTs.

4 comments on “Understandings and Misunderstandings about RCTs

  1. Hamish Chalmers says:

    Is this actually a critique of *all* research that compares alternative interventions to help determine casual relationships? I am struggling to determine what in Deaton’s critique is applicable only to RCTs and what is universal in research of this kind, with the one obvious exception that RCTs use an unbiased allocation schedule to generate comparison groups.

    Random allocation to comparison groups is the only unique characteristic of an RCT. So why has Deaton conflated a raft of other methodological and design issues with this one defining feature? For example, Deaton talks of replication as if replication is applicable only to RCTs (it isn’t, replicating QEDs and observational studies is also good practice and helpful), he talks of blinding as if blinding is somehow part of the definition of an RCT (it isn’t, blinding is not necessary in an RCT any more than it is unique to RCTs). In the above we have the remarkably bold assertion that RCTs do not have external validity. Even if this were universally true (it’s not), it would not be unique to RCTs – all research is done in the past and is intended to inform the future, therefore all research must account, at the very least, for temporal differences in context.

    There are other straw men in this argument.

    Why Deaton has singled out RCTs for special treatment when his argument seems actually to be with all research is beyond me.

    • vandewst says:

      Hello Hamish,
      Thanks for your thoughtful comments. I agree that many arguments can be extended to other methodologies. However, it seems to me that Deaton’s arguments relate to the claims made by “randomistas” that the results of their research are superior to those obtained with other methods. Friendly regards, Stefaan

      • Hamish Chalmers says:

        Thank you. I think this is definitely an area of widespread misunderstanding and so I welcome your friendly debate.

        You may be correct about how Deaton views ‘randomistas’, but if so, he really needs give examples of people claiming that the results of RCTs are superior to results of obtained using other methods. I am a proud ‘randomista’ and I work with a lot of people who might be classified as such, and the idea that people like me say that the results of RCTs are always superior to alternative methods is just not a familiar one. In fact when reading reports of RCTs it is common to find loads of caveats about the findings.

        People who understand what RCTs are and what they are not know that the only unique feature of the design is that they generate comparison groups by randomly allocating cases to conditions. That’s it.

        I don’t think it is controversial for ‘randomistas’ to argue that this is the best way of generating comparison groups that differ only as a result of the play of chance, rather than as a result of some systematic (non-random) characteristic. In any population there will be things that we know and can measure (so for example we could deliberately match cases based on these factors – say age, gender, or test scores). But there are also things that might be relevant that we don’t or can’t know about our participants and therefore can’t take into account when generating comparison groups. If we accept that there are things that we don’t or can’t know about our participants, then the only way around it, if you want to create probabilistically similar groups, is to use random allocation. Random allocation thus acknowledges and accounts for the limitations of our knowledge.

        So, the notion of ‘superiority’ centres around the question ‘how confident am I that the groups being compared were similar in all important known and unknown (and possibly unknowable) characteristics?’

        Of course, if your research question is one that does not involve comparisons and causal description then RCTs are not appropriate. You would be hard pressed to find a ‘randomista’ arguing that you need an RCT to help understand the views or opinions of a population of interest, for example. In addition you will be unlikely to find a ‘randomista’ arguing that you need an RCT when observational studies have reported very dramatic effects. Take for example the tired old chestnut about not needing an RCT to find out if parachutes work. 99.9% of people who do not open their parachutes after jumping out of a plane die. This is a highly statistically significant finding and is extremely dramatic. There is no need to go beyond observation here.

        Unfortunately for us, the effects of interventions in the social sciences are rarely so dramatic. Therefore, one key element in making casual inferences is ensuring that when we compare alternative interventions or approaches we are, in the best way we know how, comparing like with like. This means that any differences in outcome that we observe between groups can be more confidently attributed to the interventions being compared rather than to an effect of non-random differences between groups.

        That’s the strength of an RCT.

  2. […] recently read this blog post, which presents and expands on Angus Deaton and Nancy Cartwright’s recent thoughts about […]

Leave a reply to Hamish Chalmers Cancel reply