Annex 7: A Range of Quantitative, Qualitative and Theory-based Approaches for Defining the Counterfactual
Many development agencies have decided that, given the complexity of strategic interventions, and the fact that most of them are intended to cover the total target population (at the country or sector level), it is not possible to define a conventional statistical counterfactual. However, while acknowledging the many difficulties of defining a counterfactual, it is recognized that without some way to estimate what would have been the situation if the intervention had not taken place, it is extremely difficult to make any assessment of the effectiveness of the intervention and to what degree, if at all, it contributed to its intended objectives. Consequently, there is a demand for creative approaches that development agencies can use in real-world contexts to assess what would have been the situation if the program or programs had not taken place: in other words, to define alternatives to the conventional counterfactual.
In recent years a number of options have been proposed for defining an alternative counterfactual, or comparison group. While it is recognized that many of the proposed approaches are methodologically weak, it is argued that in many circumstance these provide sufficient control to permit an approximate assessment of the program impact. Counterfactual designs can be categorized into five main groups.
A. Theory driven designs: Designs based on a program theory and usually represented graphically through a logic model. The theory describes the nature of the problem that the program is designed to address, the objectives of the program, the steps through which the objectives will be achieved, and the critical assumptions that must be tested. Conventionally, the counterfactual is tested by assessing how closely outcomes and processes conform to, or deviate from, the model. Sometimes the counterfactual is defined more explicitly by formulating an alternative model describing outcomes and processes. Often the alternative model (counterfactual) is formulated on the basis of criticisms of the program theory on which the program design is based.
B. Quantitative approaches: These include experimental and quasi-experimental designs, which can be considered the conventional approaches to the use of counterfactuals, as well as a number of other quantitative techniques.
C. Qualitative approaches: The counterfactual may be derived from asking individuals or groups what the situation would have been if the project had not taken place, or asking them what things were like before the program began. Participatory group consultation techniques such as PRA include exercises whereby groups reconstruct explanations of what changes the program produced and how.
D. Mixed methods designs: These designs integrate quantitative and qualitative techniques, building on the strengths of both approaches.
E. Rating scales: These are widely used for assessing the effectiveness and outcomes of complex, country-level programs.
A. Theory Driven Approaches
a. Program theory models: A fully articulated program theory model can describe the process through which a program is intended to produce changes, how the changes will be measured, the contextual factors that might explain variations in outcomes in different locations, and (through the use of results-chain modeling) some of the potential negative outcomes that should be monitored. The counterfactual can be defined in several ways.
First, the baseline conditions can be assessed through the initial diagnostic study describing the pre-intervention situation. The program theory describes the process of change that will occur if the underlying theory is valid and the outcomes that will be produced. The validity of the model is tested both by comparing actual outcomes to the theoretically expected outcomes, and also by using process analysis to assess how closely the actual process of change conforms to the model. The analysis of the process strengthens the explanatory power of the model because a situation can occur in which the expected outcomes are achieved but where the process of change does not correspond to the model. In this case, further analysis is needed to determine the validity of the model - whether it just needs minor refinements or whether the actual process of causality is significantly different from the model. If the latter, the experimental hypothesis may not have been supported by the evidence.
A second refinement of the model comes with the introduction of contextual analysis. The model can hypothesize how outcomes will be affected by contextual factors such as the local economy, the political context, the characteristics of participating communities and the level of support from local institutions. Contextual variables can either be analyzed descriptively, or, when the project operates in a large number of different locations and quantitative surveys are conducted, it may be possible to incorporate contextual variables into the analysis using dummy variables. In this way it is possible to test the validity of the model’s explanations of the role of these contextual factors.
A third refinement is to define one or more alternative models describing what changes will occur if the experimental model is not true. One alternative model is to define a set of outcome indicators assuming the model has not produced any changes which can be defined as the counterfactual. Another model might describe the hypothesized processes and outcomes based on a different set of assumptions. For example, it might be hypothesized that a program providing scholarships to encourage children to enroll in secondary school, instead of benefiting low income families as intended, would in fact be co-opted by the local elite. In this case the hypothesized outcomes would be increased enrollment by higher income families but no change, or even negative change, for low-income families.
It should be noted that this approach has been criticized on theoretical grounds (including the assumption of linear causality), because the models are often too general to falsify and because they rarely identify and test all of the plausible rival hypotheses (Bamberger, Rugh and Mabry 2006 pp 187-88; Cook 2000).
b. Historical analysis: Economic historians and political historians have often addressed the question of what would have been the consequences if things had worked out differently in the past. In his 1964 book, Railroads and American Economic Growth: Essays in Econometric History, Fogel used quantitative methods to imagine what the United States would have been like in 1890 if there were no railroads. He hypothesized that, in the absence of the railroad, America’s large canal system would have been expanded and its roads would have been upgraded and paved. As canals and paved roads were already being developed, and as both were economically and technically feasible, it is very likely that they would have been expanded in the absence of railroads. Consequently, the impact of railroads not being there was believed to be very much less than it would have been if the canal and roads options were not available. He estimated that the level of per capita income achieved by January 1, 1890 would have been reached by March 31, 1890 if railroads had never been invented.
Similar approaches could potentially be applied today to evaluate infrastructure projects such as the construction of major roads where no statistical counterfactual is available. The evaluation design would involve the identification of the available options and defining these as the counterfactual. In some cases investment in railways or water transport might be viable options, whereas in other cases the alternative would be to continue with the previous means of transport, which will usually be an inferior road. However, in this case the counterfactual would have to take into consideration the increased volume of traffic that would have occurred, even on the inferior road, in response to economic growth.
c. General elimination theory: This approach, developed by Michael Scriven (1976), is similar to the methods often used in crime investigations. It is assumed that observed outcomes have one or more causes. The evaluation begins by preparing a List of Possible Causes (LOPC). Each LOPC has a modus operandi that is analogous to a trail of evidence or a set of footprints. The trail is short if there is a proximate cause, and longer if there is a remote cause. The facts of the case are documented and the evidence is compared with the steps in the modus operandi for each option in the LOPC. Are all of the steps in the modus operandi for any of the LOPCs present? The credibility of each LOPC is compared and alternative explanations are considered for all potentially credible options. As in crime solving, the credibility of the evidence, the logical consistency of the sequence, and the correct temporal order are all assessed. Again, as for crime solving there is rarely an incontestable scientific proof, so it is a question of assessing the credibility of the evidence and the argument.
B. Quantitative Approaches
a. Experimental and quasi-experimental designs: These designs all use the conventional counterfactual based on a variation of the project-comparison group design. These are the designs that are typically used to evaluate the impacts of a project when a comparison (control) can be identified and measured.
b. Pipeline design: This is a type of quasi-experimental design but it is discussed here separately because it is widely used as an alternative way to define the counterfactual when an independent control group cannot be identified or measured. Pipeline designs are often used at the project level where a project such as road construction, installation of water supply or urban renewal is implemented in clearly defined phases over a period of years. The sections of the population that are not scheduled to receive services until the second or later phases can be used control groups to assess the changes in output/impact indicators for the families/communities receiving benefits under phase 1. Despite some methodological challenges (Bamberger 2006 p.11), this design is attractive because it avoids the ethical and other problems and costs of having to select and interview a separate control group that will receive no benefits. It is possible, but usually more difficult, to apply a similar approach at the country or sector level:
- Use regions where the program has not yet been implemented as the comparison group. For example, in Guatemala a new pension benefit was to be provided to all people over the age of 60, but the government did not have the administrative capacity to introduce the program in all regions at the same time.
- For a program that will cover different agencies, use agencies not yet covered as the comparison group. For example, in Colombia the government was planning to implement an anti-corruption program in a number of different ministries and agencies, but a number of administrative procedures had to be completed before the program began in each ministry so the program was launched at different points in time in different agencies.
The methodological challenge in the use of the pipeline approach at the country or sector level is that there are likely to be differences between the regions or agencies that enter the program at different points in time and these can weaken or invalidate the comparison. For example, the regions that do not enter the pension program in Phase 1 may be poorer regions with less capacity to comply with the administrative requirements, or they may be controlled by an opposition party so that government may deliberately delay their entry. Obviously it would be necessary to ascertain relevant characteristics of the comparison groups to determine how much those factors affect the kinds of changes being measured.
c. Concept mapping: Concept mapping is a technique that uses interviews with stakeholders or experts to obtain an approximate estimate of program effectiveness, outcomes or impacts. A comparison of the average ratings for areas receiving different levels of intervention combined with a comparison of ratings before and after the intervention can provide a counterfactual. A similar approach could be applied to evaluate a wide range of programs, including capacity development and technical assistance as well as programs providing more easily measured services such as health or education. This example shows how the approach could be applied in a multi-country program but a similar approach can also be used within one country.
A variation of this approach can be used in the very common situation where the evaluation is not commissioned until late in the program. In this case, after defining the key characteristics of a successful gender mainstreaming strategy, the stakeholders/experts are asked to:
- Rate each country on each scale at the time the program began. Normally this must rely on recall, which introduces a potential source of bias.
- The same group is then asked to rate the present situation of each country on the respective scales. The difference between the two measures provides an estimate of effectiveness or impact.
- An alternative approach is to ask respondents to rate the changes that have occurred in each country on each of the scales. Some researchers feel that asking people to rate the amount of change is more reliable than asking them to make two separate pre-program/post-program ratings.
The reliance on subjective judgment raises issues of validity and reliability and a key requirement is to include statistical tests for inter-rater reliability. These are built into the statistical packages used for concept mapping.
While the above data sets present carefully constructed statistical indicators that can be compared across countries, there are other data sets that compile all of the available surveys and data sets within a particular country. These are then sometimes combined to construct an index ranking all countries on a particular question. For example, Transparency International compiles all of the available studies and data on corruption and then publishes an annual ranking of most countries on a scale of corruption.
d. Citizen report cards and consumer surveys: First used in Bangalore, India, these studies are based on a survey administered by a respected independent organization to a large random sample of residents in a particular city. They are asked which service agencies they have had to contact (usually during the past 12 months) to resolve a problem. For each agency they have contacted they are asked a number of questions including about how they were treated by the agency, how many visits they had to make, and whether they had to pay bribes. The responses provide a baseline that is used to estimate progress in a follow-up study conducted several years later (Bamberger, MacKay and Ooi 2004 pp. 8-9 and 2005: pp 13-21). The study can be used as a comparison group either for a similar program in another city or for assessing a different program in the original city. It would also be possible to combine data from a number of citizen report card studies to provide comparison data for assessing a program in a different country. The comparison could either be based on calculating average baseline data from a number of cities or by calculating the range of improvements that occurred between the first and second studies – to provide a yardstick for assessing the amount of change produced in the city or country being evaluated.
e. Social network analysis: Social network analysis techniques ask a sample of respondents to name, for example: the people they know in the community, the people they would go to for help, the people who engage in high risk behavior, people they dislike or are afraid of, or to indicate which institutions they know about, have used or to rate their opinions on these institutions (Carrington, Scott and Wasserman 2005; Knoke and Yang 2008). The information is combined with socio-economic information on the respondents to draw community maps and to calculate, for example: density of community networks, friendship patterns and lines of conflict, knowledge of and opinions about community institutions, and identification of opinion leaders.
The techniques have been widely used in the fields of HIV/AIDS, health and family planning to identify patterns of spread of infection, to identify groups who are particularly active or at risk of unhealthy behavior or to understand how information is disseminated. There are a number of potential applications for defining counterfactuals, for example:
- Creating indices of social conflict or solidarity that can be used to assess the impacts of community development programs. Project and comparison groups can be scored on the index or pre- and post-comparisons can be made for the project population.
- Baseline data on attitudes to or knowledge about community institutions that can be used to assess the effectiveness of information campaigns.
C. Qualitative Approaches
a. Realist evaluations: This approach addresses questions such as: what might work, for whom and under what circumstances? (Pawson and Tilley 1997; Pawson 2006). It also focuses on the evaluand, understanding exactly what project implementation means in practice. What exactly are the treatments and services that are being delivered and how are they delivered? The popularity of this approach comes from the recognition that much of the experimental and quasi-experimental design literature has focused on assessing outcomes, implicitly assuming that the program treatment is clearly understood and is implemented as planned. Pawson and Tilley present a methodology, supported with many examples, to illustrate how to understand what really happens during project implementation.
b. PRA and other participatory group consultation techniques: There are a wide variety of participatory group consultation tools and methods that can be used to identify important changes that have taken place in a community or organization and to explain the factors that contributed to the changes. Kumar (2002, Chapter 4) provides examples of how relational techniques such as Cause and Effect Diagrams and Impact Diagrams can be used for this purpose. Similarly, techniques such as Process Maps and Venn Diagrams can assess the contribution of a particular project or external agency to achieving important changes. While these techniques are rarely used to define a counterfactual, the methodology could be applied for this purpose by, for example, asking groups to identify differences in the conditions of communities and households where projects have been implemented and in communities without projects. The value of these techniques is that group participants can both assess the importance of different changes for the welfare of the communities and also describe the processes of change.
c. Qualitative analysis of comparator countries: In contrast to the quantitative country comparison discussed above, qualitative comparisons normally only involve a small number of countries. Typically a number of countries will be selected from the same geographical region. They are selected judgmentally either trying to find countries that are similar (for example in terms of their education level or the quality of their road network) or where there are differences. Often the comparison will be made on broad qualitative indicators such as “the quality of education management,” or “the level of corruption.” As the evaluation is often to study broad system change or the effectiveness of particular policy measures, the comparison tends to be descriptive and seeking to detect broad patterns of change or quality. (see discussions of expert judgment and key informants below for sources of such information.)
d. Comparison with other sectors: When the evaluation is assessing broad organizational changes or policy reforms (such as anti-corruption policies, decentralization or citizen participation in decision-making) the unit of analysis is often a ministry or government agency. Often the best available comparison group will be another ministry or a sample of ministries not affected, or not yet affected, by the policy or reform. As each ministry has very different characteristics, it is very difficult to make a comparison between one ministry that has been affected and one that has not. A better option is to make the comparison with a sample of ministries that have not been affected to try to construct an average baseline or performance indicator.
e. Expert judgment: These techniques use the opinions and judgment of experts to assess both the changes that have been produced by the project and what would have been the situation if the project had not taken place (the counterfactual). Sometimes methodologies such as concept mapping (discussed earlier) are used to conduct a systematic analysis of the opinions of large numbers of experts, whereas in the majority of cases opinions are obtained more informally through structured, semi-structured or unstructured interviews. When using expert judgment the goal should be to solicit the views of a relatively large number of representative experts as experts do tend to disagree among themselves.
f. Key informants: The approach is similar to consultation with experts but in this case the intention is to obtain opinions from a broad range of individuals who have different experiences and different perspectives on the project. Most of the respondents would not be considered “experts” in the conventional sense, but rather people who know the project and the communities in which it operates. So an evaluation of a program to control illegal substance abuse might include among key informants: illegal substance users, their spouses and sexual partners, drug distributors, people in the same demographic groups who do not use illegal substances, neighbors, teachers, community development workers, religious leaders and the police. All of these respondents can help with the difficult task of assessing what would have been the situation of the community and of the different groups using and affected by substance abuse if the program had not taken place.
g. Public expenditure tracking (PET) studies: Originating in Uganda, these studies use a very detailed and rigorous methodology to track the flow of approved expenditures for the education, health or other sectors from the ministry to the intended users (individual schools, health clinics etc.). The studies monitor both the amount of time taken for the funds to reach the final user and the percentage “leakage” at each stage. The proportion of funds reaching the final user provides a baseline for assessing improvement when the study is repeated a few years later (Bamberger, MacKay and Ooi 2004: pp 16-17 and 2005: pp 45-49). PETs could be used in several ways to construct a counterfactual:
- If time and resources permit, a baseline PET study could be conducted in the country and sector being studied to create a baseline.
- Existing PET studies conducted at two points in other sectors in the country could be used to provide a rough benchmark of the level of improvement that can be achieved. Obviously the estimates would be more robust if information was available on several sectors.
- Data from other countries could be pooled to provide a benchmark both for initial levels of leakage and also to provide parameters for the level of improvement that can be achieved.
h. Holistic analysis: Holistic assessment of complex programs can be used when one of the objectives of the evaluation is to assess the catalytic role (plausible contribution) of one of the agencies participating in a multi-donor funded program in strengthening coordination, improving the delivery of services and promoting capacity development for the whole program. This design will usually combine one or more of the following approaches:
- Interviews with international and national partners to assess the organizational, technical and other contribution of the agency being studied.
- Focus groups with different partners, key informants and civil society.
- Direct observation of meetings, project locations and other project activities to observe the style of the agency being studied.
- Review of records of agency meetings and other secondary sources that might provide information on agency performance.
- Self-assessment by agency staff based on interviews or responses to self-administered questionnaires.
A variant involves “reverse construction” of a macro logic model. Whereas typical logic models used in project design begin with the proposed interventions and then predict what outcomes are to be achieved, this approach assesses what change occurred in impact indicators (e.g. improved well-being of the target population) and then uses a variety of means (as discussed above) to assess the relative contributions of various program interventions.
D. Mixed Methods Designs
Mixed methods designs are a powerful approach when working with a number of different sources of information, none of which on its own produces a credible counterfactual. Mixed method designs combine the strengths of both quantitative and qualitative data collection and analysis methods and hence can maximize the efficiency with which available sources of data are used. One of the useful tools of mixed methods is triangulation, which uses two or more independent estimates to assess the validity of each data source, and also provides different perspectives for interpreting data and also the findings of the analysis.
E. Tools for Rating Performance of Complex Programs
Given the large number of activities included in many complex programs, and the need to ensure consistency and comparability in the assessment of many different kinds of interventions, most development agencies base their evaluations on a set of rating criteria. The majority of the rating systems are based on the guidelines developed by the OECD/DAC Network for Development Evaluation (OECD/DAC 2010). This assesses development programs in terms of five criteria:
- Relevance: The extent to which the aid activity is suited to the priorities and policies of the target group, recipient and donor.
- Effectiveness: The extent to which an aid activity attains its objectives.
- Efficiency: Efficiency measures the outputs – qualitative and quantitative – in relation to the inputs.
- Impact: The positive and negative changes produced by a development intervention, directly or indirectly, intended or unintended.
- Sustainability is concerned with measuring whether the benefits of an activity are likely to continue after donor funding has been withdrawn. Projects need to be environmentally as well as financially sustainable.
Many agencies add additional criteria. For example, AusAID also uses the criteria of:
- Gender equality and equity.
- Effectiveness of the M&E system: how effectively does the M&E framework measure progress towards meeting objectives?
- Analysis and learning: how effectively are findings and lessons analyzed and used to improve future activities?