The Pre and Post-Test Design Puzzle: What is Missing in the Impact Evaluation?

When a new programme or intervention is implemented, we all want to know: “Did it work? ” A seemingly straightforward way to find it out is to compare outcomes before and after. If GPA scores go up, health improves, or incomes rise, many mistakenly assume this is clear proof of impact. But is it?

In my last blog, we discussed why correlation doesn’t imply causation. This time, we will focus on a popular but often misleading approach: the pre and post-test design. Specifically, we will explore how, without having the counterfactual (a crucial “what if ” scenario), the pre and post-test design can lead to wrong conclusions. By the end of this post, you will understand why impacts seen in such studies don’t necessarily mean the programme caused them, and what better alternatives exist to confidently measure the programme’s impacts.

Pre and Post-Test Design

A pre and post-test design is often used to assess the impact. It involves comparing the same group before and after an intervention. The average difference between these two measurements is taken as evidence of the intervention’s impact. This simplicity makes it appealing, especially in resource-constrained settings, as it is easy to implement, and provides a clear metric of improvement for stakeholders—all without requiring a control group.

To illustrate how pre and post-test analysis works in practice, I simulate a dataset of 500 individuals measured before and after an intervention. The data generation code in Python is at the end of this blog. In practice, many researchers and practitioners rely on a one-sample paired t-test to measure the effect of an intervention. The logic is simple: if the mean outcome after the intervention is significantly higher than before, the intervention is assumed to have had a positive impact. Here is what the paired t-test from our simulated dataset shows for the continuous outcome variable (y).

	After		Before		Difference
Outcome Variables (y)	86.485	[8.149]	81.484	[6.772]	5.001^***	(0.369)
Observations	500		500		500

Standard deviations in square brackets and standard errors in round brackets. ***p < 0.01, **p < 0.05, *p < 0.1.

At first glance, the result looks impressive—there is a statistically significant increase of 5 points in y after the intervention. But does this mean the intervention truly caused the change? In the next section, we will uncover why pre and post-test designs fail to provide valid causal estimates and what critical piece is missing in this approach.

The Missing Puzzle

In evaluating the effectiveness of an intervention, the fundamental question is: What would have happened if the intervention had never taken place? This is the essence of a counterfactual—the outcome we would have observed in the absence of the intervention. According to the potential outcomes framework (also known as Rubin’s Causal Model), the impact of an intervention is the difference between the observed outcome (what happened) and the counterfactual outcome (what would have happened without the intervention). However, we face a fundamental challenge: we can never observe both outcomes for the same individual. So, we must estimate them by finding a comparable group—individuals who are as similar as possible to those who received the intervention, differing only in whether they received it.

A pre and post-test design uses the pre-intervention outcome as a substitute for the counterfactual. This is problematic because we don’t know what would have happened to the same individuals if the intervention hadn’t occurred. For instance, if we measured student test scores before and after introducing a new teaching method, improvements might be due to the method, or they could simply reflect natural progress as students gain knowledge and skills over the school year. Without a proper counterfactual, we cannot confidently attribute observed changes to the intervention itself.

When we analyse our simulated data, a paired t-test shows a statistically significant increase in the outcome variable (y). But without a counterfactual, can we be sure this improvement was caused by the intervention? Could it have happened anyway due to external factors? This fundamental limitation—the absence of a counterfactual—makes pre and post-test designs vulnerable to misleading conclusions.

Filling in the Missing Puzzle

So, if the pre and post-test design falls short, how can we estimate the counterfactual? The most rigorous way to do this is through a Randomised Controlled Trial (RCT). In an RCT, individuals are randomly assigned to either a treatment group (which receives the intervention) or a control group (which does not). Because the assignment is random, the two groups are likely to be, on average, similar in all aspects of observed and unobserved confounders—factors that could influence the outcome—except for the intervention. This makes the control group a valid approximation of the counterfactual. Therefore, we can compare the treatment group to the control group, and the difference between the two groups will be the effect of the intervention.

In our dataset, we simulate a two-period setup where individuals are randomly assigned to either the treatment or control group in the post-period. Achieving balance between the treatment and control groups is crucial because it ensures that any differences in outcomes are likely due to the intervention, not pre-existing differences between the groups. To assess whether the two groups are comparable, we conduct a balance t-test on baseline covariates—variables that might affect the outcome (e.g., x1, x2, x3, and x4). While t-tests are commonly used, emerging approaches such as standardised mean differences provide additional ways to assess balance. The small differences in observed covariates suggest that the treatment and control groups are likely comparable not only in observed characteristics but also in unobserved confounders. This provides a robust counterfactual, something the pre and post-test design inherently lacks.

Baseline Covariates	Treatment		Control		Difference
x1	4.89	[1.98]	5.14	[1.94]	-0.25	(0.18)
x2	100.72	[14.93]	100.24	[14.43]	0.48	(1.31)
x3	0.50	[0.50]	0.52	[0.50]	-0.02	(0.04)
x4 = A	0.26	[0.44]	0.23	[0.42]	0.02	(0.04)
x4 = B	0.23	[0.42]	0.24	[0.43]	-0.01	(0.04)
x4 = C	0.29	[0.45]	0.24	[0.43]	0.04	(0.04)
x4 = D	0.22	[0.42]	0.29	[0.45]	-0.06	(0.04)
Observations	250		250		500

Standard deviations in square brackets and standard errors in round brackets. ***p < 0.01, **p < 0.05, *p < 0.1.

Having established that the groups are balanced, we can now examine the treatment effect. The table below shows the estimated effect of the intervention. The first column presents the raw difference in outcomes between the treatment and control groups. The result in the second column is adjusted for baseline covariates, but the results remain consistent. While randomisation aims to create balanced groups, some researchers argue that adjusting for baseline covariates can further reduce bias and increase precision. How might this consideration apply to your research or evaluations?

	(1)		(2)
Treatment Effect	9.906^***	(0.579)	9.849^***	(0.432)
R-squared	0.370		0.662
Observations	500		500

Robust standard errors in parentheses ***p < 0.01, **p < 0.05, *p < 0.1. Column (2) adjusted for covariates (x1, x2, x3, and x4).

Concluding Remarks

While pre and post-test designs are widely used, their inability to provide a proper counterfactual makes it nearly impossible to distinguish true treatment effects from external influences. Relying solely on pre and post-test designs can lead to misleading conclusions, potentially resulting in wasted resources, ineffective programmes, or missed opportunities to create real impact. That said, it can still provide useful descriptive information or generate hypotheses for future research, particularly in exploratory stages or when other methods are not feasible.

RCTs address the missing counterfactual by ensuring comparability between treatment and control groups through randomisation, allowing for stronger causal claims. However, they are not always feasible due to ethical, logistical, or practical constraints. In such cases, quasi-experimental methods offer alternative strategies for dealing with the missing counterfactual, and we will explore them in future blog posts.

Regardless of the method used, it is essential to critically evaluate the potential for confounding factors and to interpret results with caution. I encourage you to reflect on the methods used in your research and evaluations. Are you relying on pre and post-test designs? If so, consider their limitations and explore alternative approaches that can provide more robust evidence of impact.

import pandas as pd 
import numpy as np 

# Set random seed for reproducibility
np.random.seed(42)

# Number of individuals
n = 500

# Generate unique IDs
ids = np.arange(1, n + 1)

# Create pre and post period data
pre_post = np.tile([0, 1], n)

# Repeat IDs for pre and post periods
id_repeated = np.repeat(ids, 2)

# Generate covariates (same for pre and post periods)
x1 = np.random.normal(loc=5, scale=2, size=n)
x2 = np.random.normal(loc=100, scale=15, size=n)
x3 = np.random.binomial(1, 0.5, size=n)
x4 = np.random.choice(['A', 'B', 'C', 'D'], size=n, p=[0.25, 0.25, 0.25, 0.25])

# Repeat covariates for both periods
x1_repeated = np.repeat(x1, 2)
x2_repeated = np.repeat(x2, 2)
x3_repeated = np.repeat(x3, 2)
x4_repeated = np.repeat(x4, 2)

# Assign treatment randomly to 250 individuals in the post period
treatment_ids = np.random.choice(ids, size=250, replace=False)
treat = np.where((pre_post == 1) & np.isin(id_repeated, treatment_ids), 1, 0)

# Simulate outcome
y = (50 + 0.5 * x1_repeated + 0.3 * x2_repeated - 2 * x3_repeated +
      np.where(treat == 1, 10, 0) + np.random.normal(0, 5, n * 2))

# Create DataFrame
data = pd.DataFrame({'id': id_repeated, 'pre_post': pre_post, 'treat': treat, 'y': y,
    'x1': x1_repeated, 'x2': x2_repeated, 'x3': x3_repeated, 'x4': x4_repeated})