Impact evaluation is often taught as a sequence of method labels: randomized controlled trials, difference-in-differences, regression discontinuity, matching, and so on. That overview is useful for orientation, but it can hide the real task. Before choosing a design, you need to ask what effect is being estimated, how treatment enters people’s lives, and what comparison could plausibly stand in for the counterfactual.

Every evaluation design earns credibility from a different source. Randomization earns it from controlled assignment. Difference-in-differences earns it from trend assumptions. Regression discontinuity earns it from a credible cutoff rule. Matching relies on observed comparability and therefore has sharper limits. The right method is the one whose identifying logic matches the way treatment is actually assigned and measured.

In other words, method is not decoration around the research question. It is the translation of the assignment process into an estimable design.

Start With the Estimand

Before discussing software or model specification, define the estimand. At minimum, clarify:

  • what counts as treatment
  • what outcome is being affected
  • which population is of interest
  • what comparison is supposed to represent the counterfactual
  • over what time horizon effects should appear

This sounds basic, but many evaluation problems begin here. Studies often describe an intervention clearly while leaving the estimand vague. Are we trying to estimate the average effect on all eligible households, on those actually treated, or on units near a threshold? Are we measuring immediate response, medium-run adjustment, or post-implementation equilibrium?

When the estimand is underspecified, it becomes easy to move between interpretations that the design does not actually support.

Choose a Design Based on Assignment Logic

A practical way to choose among evaluation designs is to ask how treatment is assigned in the real world.

If assignment can be controlled

An experimental design may be feasible if ethical and operational constraints allow randomized assignment.

If treatment rolls out across time or place

A quasi-experimental before-and-after design may be possible if untreated units provide a credible comparison and pre-treatment dynamics are informative.

If eligibility follows a sharp rule

Regression discontinuity may be appropriate when a score, threshold, or ranking determines treatment and manipulation around the cutoff is limited.

If no strong design feature exists

Observational approaches such as matching or regression adjustment may still be informative, but the interpretation should be more cautious because hidden selection remains a live risk.

This assignment-first framing reduces a common mistake: selecting the most impressive-looking method before checking whether the underlying institutional process supports it.

Randomized Controlled Trials: Strong Design, Demanding Implementation

Randomized controlled trials are powerful because treatment assignment is deliberately separated from baseline characteristics. If randomization is implemented correctly, treatment and control groups should be comparable on average, which makes causal interpretation stronger than in most observational designs.

But strong identification does not mean automatic validity. Experimental studies still face practical threats:

  • imperfect compliance
  • attrition that differs across groups
  • spillovers between treated and untreated units
  • weak measurement of outcomes
  • treatment variation that is not tracked clearly

RCTs are best suited when the intervention can actually be randomized, the unit of assignment is clear, and the team can manage implementation discipline. They are not automatically the right choice just because a program has not launched yet.

The most common misuse of RCT language is to act as though random assignment solves every later problem. It does not. Randomization helps with baseline comparability; it does not rescue poor implementation, weak outcomes, or careless interpretation.

Difference-in-Differences: Useful When Timing Carries Information

Difference-in-differences (DiD) is attractive because many policy and program settings create before-and-after variation across treated and untreated groups. The core idea is familiar: compare the change in outcomes for treated units to the change for comparison units.

What makes DiD credible is not the regression command. It is the parallel trends idea. In the absence of treatment, would the groups have moved similarly enough that the change gap is informative?

That requirement does not mean treated and control groups must look identical. It means the untreated trend for the comparison group should be a reasonable proxy for the counterfactual trend of the treated group. This is partly a substantive question about institutions and exposure patterns, not only a statistical question.

DiD is often most useful when:

  • treatment timing varies across units
  • multiple pre-treatment periods exist
  • the policy shock is plausibly exogenous to short-run outcomes
  • comparable untreated units can be defined clearly

It becomes weaker when treated units are already on visibly different trajectories, when anticipation effects are likely, or when treatment status is entangled with prior shocks or political allocation processes.

A minimal STATA specification may look simple:

* Outcome on treatment x post interaction
reg outcome treated##post, vce(cluster cluster_id)

* With covariates
reg outcome treated##post age education baseline_outcome, vce(cluster cluster_id)

But the command is the easy part. The harder work is defending the comparison group, checking timing, and being honest about whether the design assumptions are plausible.

Regression Discontinuity: Strong Near a Cutoff, Narrow by Design

Regression discontinuity design (RDD) is powerful when treatment is allocated using a cutoff that units cannot easily manipulate. The logic is local: units just above and below the threshold are assumed to be similar enough that discontinuity in outcomes at the cutoff can be attributed to treatment.

RDD is often compelling when the rule is administratively clear, such as a score, income threshold, or eligibility rank. It is less convincing when the rule is loosely applied, poorly measured, or strategically manipulated.

The most common misuse of RDD is interpretive overreach. A valid local treatment effect near the threshold is not automatically the effect for the full eligible population. That does not make RDD weak. It simply means the external validity claim should match the design.

Researchers should therefore be explicit about:

  • whether the cutoff was actually enforced
  • whether units could influence the running variable
  • how sensitive the result is to bandwidth choices
  • what the effect means substantively given its local nature

Matching Methods: Sometimes Helpful, Never a Shortcut to Causality

Matching methods can improve balance on observed characteristics and provide a more disciplined comparison than raw treated-versus-untreated means. They can be useful when treatment assignment is not random but detailed baseline information exists and the objective is to improve comparability.

However, matching does not create causal identification by itself. Its core limitation is unchanged: if treated and untreated units differ on unobserved factors related to outcomes, bias may remain substantial.

That means matching is strongest when used modestly:

  • to improve descriptive comparability
  • to support transparency about observable overlap
  • as one part of a broader sensitivity-oriented analysis

The most common misuse is presenting matched estimates as though the matching step solved selection entirely. It rarely does.

Threats That Cut Across Designs

Regardless of method, several issues repeatedly weaken impact evaluations.

Spillovers

Treated units may affect untreated units through information, prices, social interaction, or market competition. If ignored, spillovers can bias estimates in either direction.

Attrition

If follow-up loss differs systematically by treatment status or outcome risk, the sample being analyzed may no longer represent the sample that was assigned or initially observed.

Clustering

Programs are often assigned or delivered at group levels such as village, school, branch, or office. Standard errors should reflect the level at which errors are correlated, not just the level at which rows are stored.

Timing mismatch

Some studies measure outcomes before effects could plausibly emerge, while others wait so long that many additional shocks have entered the picture. Timing should match the theory of change.

Overloaded outcome sets

When too many loosely prioritized outcomes are tested, interpretation becomes weaker and post hoc emphasis becomes tempting.

These problems are methodological, but they are also design and management problems. Strong evaluation requires operational discipline as well as econometric technique.

What Responsible Interpretation Sounds Like

A credible impact evaluation does not just present a point estimate. It explains why the estimate should be believed, what population it applies to, and where the design’s limits begin.

Responsible interpretation usually includes:

  • a plain-language statement of the identifying assumption
  • an explanation of why treatment assignment works as argued
  • uncertainty measures appropriate to the design
  • a clear distinction between internal validity and external validity
  • discussion of alternative explanations that remain possible

This kind of writing matters because users of evaluation evidence often remember the headline finding more than the design details. If caveats are buried, the evidence is easy to misuse.

What a Good Evaluation Makes Clear

Good impact evaluation is not defined by whether the method name sounds rigorous. It is defined by alignment between question, assignment process, data structure, and interpretation.

A strong evaluation should leave a careful reader able to answer four questions:

  1. What is the estimated effect, exactly?
  2. Why is the comparison credible?
  3. What are the main threats to that credibility?
  4. How far can the result reasonably be generalized?

That is the practical test of an evaluation. Not whether the method has prestige, but whether a careful reader can see what is being estimated, why the comparison is believable, and where the claim should stop.

If those boundaries are explicit, the evaluation can inform decisions without pretending to settle more than it actually does.