Understanding statistical methods is essential for rigorous development research. This post covers foundational concepts with mathematical notation.
The Linear Regression Model
The basic linear regression model estimates the relationship between a dependent variable \(Y\) and independent variables \(X\):
$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i$$
Where:
- \(Y_i\) is the outcome for observation \(i\)
- \(\beta_0\) is the intercept
- \(\beta_1, \beta_2\) are coefficients
- \(\varepsilon_i\) is the error term
Ordinary Least Squares (OLS)
The OLS estimator minimizes the sum of squared residuals:
$$\hat{\beta} = (X’X)^{-1}X’Y$$
For a simple regression with one predictor, the slope coefficient is:
$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}$$
Hypothesis Testing
To test whether income affects expenditure, we set up:
- Null hypothesis: \(H_0: \beta_1 = 0\) (no effect)
- Alternative: \(H_1: \beta_1 \neq 0\) (some effect)
The t-statistic is:
$$t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}$$
We reject \(H_0\) if \(|t| > t_{critical}\) at significance level \(\alpha = 0.05\).
Example in STATA
* Run OLS regression
regress expenditure income education age
* Get coefficient and standard error
display _b[income]
display _se[income]
* Test hypothesis: income coefficient = 0
test income = 0
R-Squared and Model Fit
The coefficient of determination measures explained variance:
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}$$
An \(R^2\) of 0.65 means 65% of the variation in the dependent variable is explained by the model.
Difference-in-Differences
For policy evaluation, the DiD estimator compares treated vs control groups before and after intervention:
$$\hat{\tau}_{DiD} = (\bar{Y}^{T}_{post} - \bar{Y}^{T}_{pre}) - (\bar{Y}^{C}_{post} - \bar{Y}^{C}_{pre})$$
This removes time-invariant unobserved heterogeneity under the parallel trends assumption.
Key Takeaways
- Always check regression assumptions: linearity, homoscedasticity, normality of errors
- Report standard errors—preferably clustered at the appropriate level
- Consider causality carefully—correlation is not causation
- Use robust standard errors when heteroscedasticity is present
Statistical methods are powerful tools, but their validity depends on careful attention to assumptions and honest reporting of limitations.