Understandings and Misunderstandings about RCTs

angus-deatonPolicy makers and the media have shown a remarkable preference for Randomized Controlled Trials or RCTs in recent times. After their breakthrough in medicine, they are increasingly hailed as a way to bring human sciences into the realm of ‘evidence’-based policy. RCTs are believed to be accurate, objective and independent of the expert knowledge that is so widely distrusted these days. Policy makers are attracted by the seemingly ideology-free and theory-free focus on ‘what works’ in the RCT discourse.

Part of the appeal of RCTs lies in their simplicity.  Trials are easily explained along the lines that random selection generates two otherwise identical groups, one treated and one not. All we need is to compare two averages.  Unlike other methods, RCTs don’t require specialized understanding of the subject matter or prior knowledge. As such, it seems a truly general tool that works in the same way in agriculture, medicine, economics and education.

Deaton cautions against this view of RCTs as the magic bullet in social research. In a lengthy but well readable NBER paper he outlines a range of misunderstandings with RCTs. These broadly fall into two categories: problems with the running of RCTs and problems with their interpretation.

Firstly, RCTs require minimal assumptions, prior knowledge or insight in the context. They are non-parametric and no information is needed about the underlying nature of the data (no assumptions about covariates, heterogeneous treatment effects or shape of statistical distributions of the variables).  A crucial disadvantage of this simplicity is that precision is reduced, because no prior knowledge or theories can be used to design a more refined research hypothesis.  Precision is not the same as a lack of bias.  In RCTs treatment and control groups come from the same underlying distribution. Randomization guarantees that the net average balance of other causes (error term) is zero, but only when the RCT is repeated many times on the same population (which is rarely done). I hadn’t realized this before and it’s almost never mentioned in reports.  But it makes sense. In any one trial, the difference in means will be equal to the average treatment effect plus a term that reflects the imbalance in the net effects of the other causes. We do not know the size of this error term, but there is nothing in the randomization that limits its size.

RCTs are based on the fact that the difference in two means is the mean of the individual differences, i.e. the treatment effects.  This is not valid for medians. This focus on the mean makes them sensitive to outliers in the data and to asymmetrical distributions. Deaton shows how an RCT can yield completely different results depending on whether an outlier falls in the treatment or control group.  Many treatment effects are asymmetric, especially when money or health is involved. In a micro-financing scheme, a few talented, but credit-constrained entrepreneurs may experience a large and positive effect, while there is no effect for the majority of borrowers. Similarly, a health intervention may have no effect on the majority, but a large effect on a small group of people.

A key argument in favour of randomization is the ability to blind both those receiving the treatment and those administering it.  In social science, blinding is rarely possible though. Subjects usually know whether they are receiving the treatment or not and can react to their assignment in ways that can affect the outcome other than through the operation of the treatment. This is problematic, not only because of selection bias. Concerns about the placebo, Pygmalion, Hawthorne and John Henry effects are serious.

Deaton recognizes that RCTs have their use within social sciences. When combined with other methods, including conceptual and theoretical development, they can contribute to discovering not “what works,” but why things work.

Unless we are prepared to make assumptions, and to stand on what we know, making statements that will be incredible to some, all the credibility of RCTs is for naught.

Also in cases where there is good reason to doubt the good faith of experimenters, as in some pharmaceutical trials, randomization will be the appropriate response. However, ignoring the prior knowledge in the field should be resisted as a general prescription for scientific research.  Thirdly, an RCT may disprove a general theoretical proposition to which it provides a counterexample. Finally, an RCT, by demonstrating causality in some population can be thought of as proof of concept, that the treatment is capable of working somewhere.

Economists and other social scientists know a great deal, and there are many areas of theory and prior knowledge that are jointly endorsed by large numbers of knowledgeable researchers.  Such information needs to be built on and incorporated into new knowledge, not discarded in the face of aggressive know-nothing ignorance.

The conclusions of RTCs are often wrongly applied to other contexts. RCTs do not have external validity.  Establishing causality does nothing in and of itself to guarantee generalizability. Their results are not applicable outside the trial population. That doesn’t mean that RCTs are useless in other contexts. We can often learn much from coming to understand why replication failed and use that knowledge to make appropriate use of the original findings by looking for how the factors that caused the original result might be expected to operate differently in different settings. However, generalizability can only be obtained by thinking through the causal chain that has generated the RCT result, the underlying structures that support this causal chain, whether that causal chain might operate in a new setting and how it would do so with different joint distributions of the causal variables; we need to know why and whether that why will apply elsewhere.

Bertrand Russell’s chicken provides an excellent example of the limitations to straightforward extrapolation from repeated successful replication.

The bird infers, based on multiple repeated evidence, that when the farmer comes in the morning, he feeds her. The inference serves her well until Christmas morning, when he wrings her neck and serves her for Christmas dinner. Of course, our chicken did not base her inference on an RCT. But had we constructed one for her, we would have obtained exactly the same result.

The results of RCTs must be integrated with other knowledge, including the
practical wisdom of policy makers if they are to be usable outside the context in which they were constructed.

Another limitation of the results of RCTs relates to their scalability. As with other research methods, failure of trial results to replicate at a larger scale is likely to be the rule rather than the exception. Using RCT results is not the same as assuming the same results holds in all circumstances.  Giving one child a voucher to go to private school might improve her future, but doing so for everyone can decrease the quality of education for those children who are left in the public schools.

Knowing “what works” in a trial population is of limited value without understanding the political and institutional environment in which it is set. Jean Drèze notes, based on extensive experience in India, “when a foreign agency comes in with its heavy boots and suitcases of dollars to administer a `treatment,’ whether through a local NGO or government or whatever, there is a lot going on other than the treatment.” There is also the suspicion that a treatment that works does so because of the presence of the “treators,” often from abroad, rather than because of the people who will be called to work it in reality. Unfortunately, there are few RCTs which are replicated after the pilot on the scaled-up version of the experiment.

This readable paper from one of the foremost experts in development economics provides a valuable counterweight to the often unnuanced admiration for everything RCTs.  In a previous post, I discussed Poor Economics from “randomistas” Duflo and Banerjee. For those who want to know more, there is an excellent debate online between Abhijit Banerjee (J-PAL, MIT) and Angus Deaton on the merits of RCTs.

Kicking Away The Ladder

ladderDeveloped countries stimulate developing countries to adopt the “good” institutions and “good” policies which will bring them economic growth and prosperity.   These are promoted by institutions such as the WTO, the IMF and the World Bank.  Recipes such as abolishing trade tariffs, an independent central bank and adhering to intellectual property rights feature high on their agendas.

In his book “Kicking away the ladder” Ha-Joon Chang shows that these policies are not so beneficial for developing countries.  Through historical analysis he shows that developed countries actively pursued all types of interventionist policies to achieve economic growth, contradicting the recipes they are now prescribing.  A case of poachers turning into gatekeepers.

Policies that were intensively used by the USA and European countries include tariff protection, import and export bans, direct state involvement in key industries, refusal to adopt patent laws, R&D support, granting monopoly rights, smuggling and poaching expert workers.  Chang points out that alleged free trade champions, the UK and USA, were the most protective of all and only switched to liberalisation after World War II when and as long as their hegemony was safe (see table below).  Asian tigers such as South Korea and Taiwan did the same, which explains their success.  Ha-Joon Chang shows that, in comparison, current developing countries offer relatively limited protection to their economies.

chang_tariffs

What does it imply for development cooperation? Developed countries often expect developing countries to adopt world-class institutions and policies in a nick of time.  However, the path to these kinds of institutions for developed countries was a long and winding path, a slow process that took decades, with frequent reversals.  We sometimes forget that universal suffrage was only achieved as recently as 1970 (in Canada) or 1971 (Switzerland). It took the USA until 1938 to ban child labour. Switzerland was notoriously late to adopt patent laws (explaining its success with pharmaceutical companies).  Imposing world-class institutions or policies on developing countries can be harmful because they take a lot of human and financial resources, which may be better spent elsewhere.  In fact, adopting such institutions and policies mainly benefits the developed countries, not the developing ones.

Ha Joon-Chang calls this practice of using successful strategies for economic development and then preventing other countries from applying the same strategy “kicking away the ladder”.  The WTO negotiation rounds or regional trade agreements have a lot in common with the “unequal” treaties between colonisers and colonised countries.

Why is institutional development so slow? Are there no last-mover benefits?  Chang gives following reasons:

  1. Institutional development is firmly linked with the state’s capacity to collect taxes. This capacity is linked to its ability to command political legitimacy and its capacity to organize the state (see blog post on Thinking like a State).  That’s also another reason why tariffs are so important for developing countries: they are some of the taxes that are easiest to collect. Institutional development is linked to the development of human capacity within a country by its education system. Setting up “good” institutions in countries that don’t have the human capital for it will lead to undermining, bad functioning or draw away scarce resources from other sectors.
  2. Well-functioning institutions and policies need to fight initial resistance and prejudice. Chang points to the resistance to introducing an income tax at the beginning of the 20th century in western countries.  It can take years and gradual policy changes to overcome this. The struggle to raise the retirement age in western countries is another illustration of the sometimes double standards we use toward developing countries.
  3. Many institutions are more the result of economic development rather than a condition for it. This is contentious, but Chang points to democracy as an example.

Chang advocates for developing countries to pursue an active interventionist economic policy.  His thesis confirms the importance of supporting developing countries in the strengthening of their education systems.  However, it also illustrates that the financial harm to developing countries as a result of unequal trade policies can be much higher than the aid flows to these countries.

Piketty and Inequality in South Africa

Every month, hundreds of children are fleeing abject poverty in Zimbabwe and heading to South Africa. It’s a dangerous journey, but many take the risk in the hope of a better life. But once on the other side, there is help. With the support of UKaid from the Department for International Development, there is food, shelter and the chance to go to school. Find out more in our feature: www.dfid.gov.uk/musinaSouth Africa is a very unequal country.  It has one of the highest Gini coefficients in the world, in particularly since various Latin American countries managed to bring their coefficients down in recent years.  It should be noted that the Gini coefficient is an indicator of inequality in income, not in wealth.  However, given South Africa’s history of Apartheid and colonialism, including wealth into the equation is not likely to reduce inequality.  Conceded, the Gini coefficient also ignores progress that has been made in the provision of basic services to the poor in housing, electricity provision, healthcare delivery and education infrastructure.

Why does inequality matter? A certain degree of inequality may well be positive for society.  It stimulates people to find their talents and get the best out of them.  However, too much inequality poses various problems.   It’s morally indefensible that some people earn orders of magnitude more than others, whatever their skills. There’s also research that points to negative political effects of high inequality.  In unequal societies democracy tends to be hollowed out as decision processes are captured by a tiny elite, the masses are powerless and become disentangled and the social state is dismantled.  No longer “having skin in the game”, they vote for extremists.  Economically, high inequality reduces consumption, compared to a more even distribution of means.  High inequality also reduces social mobility, wasting talent.

Economists disagree on the evolution of inequality.  Kuznets argued that in the initial stages of development, a country becomes more unequal.  Some people move from poor to rich and compared to (almost) everyone being poor, this constitutes more inequality.  As more people grow rich, inequality would drop.  This view was challenged by Piketty in his book Capital.  Piketty’s central thesis is that inequality naturally rises within a capitalist system, because the rate of return on wealth exceeds that of income (or economic growth).  Rather than focusing only on equality of opportunity, Piketty shows that we should also worry about the inequality of outcomes.  Piketty’s thesis has drawn both praise and criticism.  Most critics acknowledge that inequality is rising, but dispute whether it’s an inherent characteristics of capitalism or whether they are other factors at play, such as globalisation and its tendency for delocalisation and winner-takes-all markets and automation, threatening many low-skilled and medium-skilled jobs.  Piketty favours the ‘utopian’ solution of a global, progressive wealth tax. Awaiting utopia, progressively taxing income and property may help.  Piketty argues that insufficiently progressive tax rates are at the basis of skyrocketing top wages.

How relevant is Piketty’s analysis of inequality for developing countries?  South Africa, with 1% of the population earning 15% of total labour income and with two thirds of the population living in poverty, seems like a good illustration of Piketty’s thesis.  Economic growth has been anaemic for years, whereas income from property and assets have been rising.  High youth unemployment and lack of unemployment benefits are one driver of inequality.  A second is the high wage gap within the workplace.  The low quality education system churns out too many unqualified people and too few qualified ones. For maths, only 3% of Grade 9 learners achieve a score higher than 50% at the latest Annual National Assessments (ANAs) and 90% remain stuck in the lowest category, which indicates a total lack of basic numeracy.  As a result, skilled people can command a premium and the former remain stuck in menial, poorly-paid jobs.  High inequality gradually erodes democratic institutions and public services are steadily privatized.

In other developing countries the situation is opposite. High inequality in countries such as Cambodia is rather the result than the cause of weak public institutions.  An effective administration to collect taxes, regulators to deal with monopolies and anti-corruption watchdogs, an impartial justice system are absent favouring a corrupt elite.  In this case, taxing the rich more will not help. Only building more effective institutions can address this.  This extends beyond nation states.

Solutions need to be found on a global scale.  Unfortunately, global governance institutions such as the WTO, WHO and the IMF provide global public goods, but suffer from a lack of democratic legitimacy, especially in developing countries.  Strengthening legitimate and global governance may help to address global inequalities.

Piketty’s book focuses on advanced countries, but the wealth of discussion it has triggered includes plenty of analysis of its relevance for developing countries.  Rising inequality within and between states is one of the defining themes of our times, partly causing and caused by Piketty’s work.

More information on the relevance of Piketty’s book for resp. developing countries and South Africa in particular can be found here and here.  Both articles are well recommended.

The picture at the top of this post is courtesy of DFID and is released under an Attribution-NonCommercial-NoDerivs 2.0 Generic license.

Income Inequality in the Developing World

Science recently published a theme issue on income inequality in the developing world (free access, with registration).  It includes contributions from, among others, Thomas Piketty, Martin Ravallion and Angus Deaton.

The main idea from Piketty’s bestseller, Capital, is that inequality has been rising since the 19th century because yields on wealth are higher than those on income.  This trend was only interrupted by the 2 world wars.  Piketty’s thesis rests on historical data from the US and Europe.  This theme issue looks whether the conclusions are valid for developing countries as well.

Has the strong economic growth in developing countries since 2000 resulted in falling levels of inequality?  And what has been the effect on poverty?  The main findings from the article of Ravallion:

Science 2014 May 344(6186) 851-5, Fig. 1

Science 2014 May 344(6186) 851-5, Fig. 4

 

 

 

 

 

 

Inequality has fallen between 1981 and 2010.  However, the period between 2005 and 2010 shows an increase.  The variance over time is mainly attributable to inequality between countries.  Again, most recent data indicate that the component between countries has fallen, whereas the component within countries has risen.

  • Economic growth has lead to increasing inequality between countries, but to falling inequality within countries (although the latter trend has weakened in recent years).
  • The effect of economic growth on poverty depends on the initial level of inequality.  The higher that level, the lower the share of economic growth that flows to the poor and the lower the poverty reduction resulting from that growth.
  • Even if inequality has not been rising overall, there are still worries about high levels on inequality in developing countries:
    • capital tends to have diminishing returns, implying it’s more ‘useful’ when more equally spread;
    • high inequality means that many poor, talented people cannot reach their full potential;
    • high inequality tends to erode democracy, as a small group of people may hijack the democratic process and turn ‘inclusive institutions’ into ‘extractive ones’ (see Acemoglu’s and Anderson’s work);
    • low inequality and a strong middle class tend to create a more diversified and robust economy, as a result of a stronger focus on consumption goods and support for pro-growth policies.
  • Three cautionary remarks on the data:
    • The data, using the Gini or related MLD indicators, represent relative inequality. This means that inequality is the same whether incomes are 1$ and 2$ or 1000$ and 2000$.  This implies that even with constant relative inequality, the absolute differences in income and wealth can grow much larger.
    • Data on inequality in developing countries are notoriously unreliable.  The main data sources are the national accounts (household consumption item) and household surveys.  In the latter, the rich either don’t participate or tend to under-report their income and wealth.
    • Developing countries are a mixed bag.  Countries with rising inequality from a low base (India, China), countries with rising inequalities from a high base (South Africa, with Gini = 0.7!!!) and countries with decreasing inequality (most countries in Latin America).
  • Falling inequality is not something which happens ‘automatically’ as countries grow rich, as was postulated by Simon Kuznets.  It’s the result of pro-equity policies, such as investments in health and education (Bolsa Familia in Brazil) and job creation.

 

Eradication of Poverty on Hold?

A long piece in The Economist recently on the evolution in purchasing power parity between economies of developed and emerging countries. Up until a few years ago, it looked as if convergence would be reached within 30 years, even if excluding Chinese growth.  Hundreds of millions of people were drawn out of poverty.  Voices have been calling for the post-2015 global development goals to include the eradication of poverty by 2030.

However, the pace of economic growth has been slowing in emerging economies, not just in China, which is managing a difficult transition from low-wage, export-based manufacturing towards an economy dominated by services and internal consumption.  However, at the current pace, it will take 150 years to catch up (using as indicator GDP/ person in PPP as % of US GDP).

em_cathcing up_1

 

 

 

 

 

 

 

Convergence was foreseen by economists like Robert Solow.   As the main drivers he identified capital influx (as a result of higher interest rates offered by developing countries) and technological progress (enabling emerging economies to leapfrog development stages).  Pietra Rivoli saw a ‘race to the bottom’ by poor countries as a way to attract labour-intensive industries, allowing people to abandon agriculture, get access to better services, creating a virtuous spiral.

The main reasons why the convergence has grinded to a near standstill are:

  1. The peak of manufacturing in a country’s development occurs earlier and is lower than previously. Dani Rodrik attributes this to the growing role of technology, reducing demand for low-wage manufacturing jobs, lowering the incentive for companies to seek out regions with low wages and lowering the share of manufacturing in the total value chain of a product.
  2. The previous decade was a period of exceptional hyperglobalisation, spurred by strong demand for natural resources, China’s accession to the WTO and strong growth in trade (also outside China).

Rather than the optimistic scenario foreseeing income convergence within a generation, it looks we’re back at the slow grind towards convergence, driven by incremental progress in geography (infrastructure, see work of Jared Diamond), institutions (see work of Daren Acemoglu) and trade (e.g. regional agreements on trade in services).

The article is rather pessimistic in tone, as it considered the gains in poverty reduction as an exceptional feat not likely to be repeated soon. It raises critical questions for countries like India and Bangladesh which are looking to benefit from their demographic dividend and take over some of China’s low-wage industry.  It also underlines the need for investments in education.

#WorldSTE2013 Conference – Day 2: Peer instruction (E. Mazur) and Visible Learning (J.Hattie(

Day 2 of the WorldSTE conference centred on the keynote sessions of two educational ‘rock stars’, Prof. Eric Mazur and Prof. John Hattie.  Both delivered a polished, entertaining presentation, but with little new information for those already familiar with their work.  The conference organizers provided little time for discussion which was, certainly in Hattie’s keynote, a pity.

Mazur’s presentation was a shortened version of the ‘Confessions of a Lecturer’ talk which is available on YouTube in various lengths and colours (recent one).  Concept Tests combined with voting and peer discussion is a powerful way to activate students in lectures.  He referred to Pinker’s ‘curse of knowledge’ as one reason why fellow students are often better than explaining new stuff to each other than lecturers.We have introduced the methodology in Cambodia as well, using voting cards rather electronic clickers.   From my experience, the main challenge for teacher trainers is to get the questions right.  Questions should address a conceptual problem, should preferably relate it to an unfamiliar context and should neither be too easy nor too difficult.

Hattie’s keynote was based on the results of his meta-meta analysis to determine what makes good learning.  It is based on more than 800 meta-analyses into which more than 50 000 individual studies have been integrated.  Starting point are the falsely authoritative claims many teachers and educators make about what works in education, often in conflict with each other.   Extensive reviews of Hattie’s work have been written elsewhere (1, 2).  Here I just write down some personal reflections on his talk:

  1. Hattie likes to unsettle people by listing some of the factors that don’t make a difference, such as teachers’ content knowledge, teacher training, class size, school structures, ability grouping, inquiry-based methods and ICT.  However, I believe that many aspects of teaching quality are interrelated and strengthen or weaken each other.  Content knowledge as such doesn’t make a good teacher, but is a necessary condition for teachers to engage in class discussion or provide meaningful feedback, which are factors that do make a difference in Hattie’s study.  Similarly, class size doesn’t make a difference if  the teacher doesn’t adapt his/her teaching.  However, class size may affect the strategies and possibilities of teachers, as it affects factors such as class management, available space and time. In the same way school structures in itself don’t change teaching quality, but may affect the opportunities for teachers to engage in collaborative lesson preparation, which is strongly endorsed by Hattie.
  2. Similarly, Hattie seemed to admit that many relations are non-linear and that there are threshold effects.  Research on pedagogical content knowledge showed that teachers need to have a good understanding of the concepts they are teaching, but additional specialised subject courses don’t make additional difference.  In Cambodia, limited content knowledge does inhibit teachers to promote deep learning, which also makes a difference in Hattie’s research.

 Overview of effect sizes variables on learning outcomes

Overview of effect sizes variables on learning outcomes

3. This relates to the question how valid results are across countries and cultures.  Hattie’s research is mainly based on research from developed countries and Western cultures, and I wonder how applicable these effect sizes are in other countries and cultures.  The threshold effect size value of 0.4 is based on the typical progression of a student in a developed country.  In a developing country, an effect size of 0.4 may be actually quite high.  Hattie does recognize that the teacher factor is stronger in schools with low-economic status, implying that having a good teacher does matter more for them than for well-off kids.  Banerjee and Duflo have suggested that unlike disappointing results in developed countries ICT may have stronger benefits in developing countries:

“The current view of the use of technology in teaching in the education community is, however, not particularly positive. But this is based mainly on experience from rich countries, where the alternative to being taught by a computer is, to a large extent, being taught by a well-trained and motivated teacher.  This is not always the case in poor countries.  And the evidence from the developing world, though sparse, is quite positive.” (Duflo & Banarjee, Poor Economics,p. 100)

4. Hatie’s research doesn’t take into account factors that lie outside the influence of the school. However, many of the strongest factors in Hattie’s list, such as collaborative lesson preparation and evaluation, class discussions and setting student expectations are well-known for quite some time. Why haven’t they been applied more?  This question has been better addressed by researchers such as North and Konur, who focus on the institutional and organisational analysis of education quality.

5. The concept of effect sizes is statistically shaky.  In a recent paper, Angus Deaton Post writes about effect sizes:

The effect size—the average treatment effect expressed in numbers of standard deviations of the original outcome—though conveniently dimensionless, has little to recommend it. It removes any discipline on what is being compared. Apples and oranges become immediately comparable, as do treatments whose inclusion in a meta-analysis is limited only by the imagination of the analysts in claiming similarity. Beyond that, restrictions on the trial sample will reduce the baseline standard deviation and inflate the effect size. More generally, effect sizes are open to manipulation by exclusion rules. It makes no sense to claim replicability on the basis of effect sizes, let alone to use them to rank projects.

Hattie’s research is wildly ambitious, and therefore a great deal of scrutiny and criticism:

  • sole focus on quantitative research at the expense of qualitative studies (Terhart, 2011, login)
  • statistics underlying effect sizes controversial as well as the premise that effect sizes can be aggregated and compared (blog post on statistics used in Hattie’s research).
  • quality of the studies underlying the meta-analysis varies wildly and shouldn’t simply be aggregated due to publication bias (Higgins and Simpson, 2011, login: an extract

“VL [Visible Learning] seems to suffer from many of the criticisms levelled at meta-analyses and then adds more problems derived from the meta-meta-analysis level. It combines studies across some areas with little apparent conceptual connection; averages results from experimental, nonexperimental, manipulable and non-manipulable studies; effectively ignores subtleties such as implementation cost, additive effects, arbitrary signs and longevity, even when many of the meta-analyses it relies upon carefully highlight these issues. It then combines all the effect sizes by simply adding them together and dividing by the number of studies with no weighting. In this way it develops a simple figure, 0.40, above which, it argues, interventions are ‘worth having’ and below which interventions are not ‘educationally significant’. We argue that the process by which this number has been derived has rendered it effectively meaningless.” (Higgins and Simpson, 2011)

Despite the claim on Hattie’s website, I don’t believe Hattie has finally found the ‘holy grail’ of education research and settled the question of what makes qualitative education. Partly this is due to skepticism whether such a definitive generalized answer across cultures, education levels and economies is possible.  Partly it is due to methodological concerns about the reliability of aggregating aggregations of effect sizes and the validity of excluding qualitative research and all factors that lie outside the influence of the school.

Finally, the Hattie keynote made me nostalgic about the H809 course in MAODE during which papers would be turned inside out until you would be convinced that each constituted the worst kind of educational research ever conducted.  Hattie’s research would fit excellently in such a context.