Understanding statistical methods is essential for rigorous development research. This post covers foundational concepts with mathematical notation.

The Linear Regression Model

The basic linear regression model estimates the relationship between a dependent variable \(Y\) and independent variables \(X\):

$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i$$

Where:

  • \(Y_i\) is the outcome for observation \(i\)
  • \(\beta_0\) is the intercept
  • \(\beta_1, \beta_2\) are coefficients
  • \(\varepsilon_i\) is the error term

Ordinary Least Squares (OLS)

The OLS estimator minimizes the sum of squared residuals:

$$\hat{\beta} = (X’X)^{-1}X’Y$$

For a simple regression with one predictor, the slope coefficient is:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}$$

Hypothesis Testing

To test whether income affects expenditure, we set up:

  • Null hypothesis: \(H_0: \beta_1 = 0\) (no effect)
  • Alternative: \(H_1: \beta_1 \neq 0\) (some effect)

The t-statistic is:

$$t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}$$

We reject \(H_0\) if \(|t| > t_{critical}\) at significance level \(\alpha = 0.05\).

Example in STATA

* Run OLS regression
regress expenditure income education age

* Get coefficient and standard error
display _b[income]
display _se[income]

* Test hypothesis: income coefficient = 0
test income = 0

R-Squared and Model Fit

The coefficient of determination measures explained variance:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}$$

An \(R^2\) of 0.65 means 65% of the variation in the dependent variable is explained by the model.

Difference-in-Differences

For policy evaluation, the DiD estimator compares treated vs control groups before and after intervention:

$$\hat{\tau}_{DiD} = (\bar{Y}^{T}_{post} - \bar{Y}^{T}_{pre}) - (\bar{Y}^{C}_{post} - \bar{Y}^{C}_{pre})$$

This removes time-invariant unobserved heterogeneity under the parallel trends assumption.

Key Takeaways

  1. Always check regression assumptions: linearity, homoscedasticity, normality of errors
  2. Report standard errors—preferably clustered at the appropriate level
  3. Consider causality carefully—correlation is not causation
  4. Use robust standard errors when heteroscedasticity is present

Statistical methods are powerful tools, but their validity depends on careful attention to assumptions and honest reporting of limitations.