Working with Large Datasets in STATA

Bangladesh’s national surveys—HIES, Labor Force Survey, Census samples—often contain hundreds of thousands of observations. Standard STATA commands can become painfully slow. Here’s how to work efficiently.

Start with Memory Management

* Check current settings
memory
query memory

* Increase limits if needed
set maxvar 32767
set matsize 11000

* The most important optimization: reduce data size
keep household_id weight region urban income expenditure

* Compress to optimize storage
compress

Use gtools for Speed

The gtools package dramatically speeds up common operations:

* Install gtools (one time)
ssc install gtools

* Compare speed: standard vs gtools
timer clear

timer on 1
collapse (mean) income, by(region year)
timer off 1

use "large_data.dta", clear

timer on 2
gcollapse (mean) income, by(region year)
timer off 2

timer list
* gtools is typically 10-50x faster

Efficient Data Processing

* Fast egen alternatives
gegen mean_inc = mean(income), by(district)
gegen group = group(division district)

* Faster duplicates check
gduplicates report household_id
gduplicates drop household_id, force

* Fast sorting
hashsort district household_id

Best Practices

Drop early: Remove unneeded variables and observations before processing
Compress often: After creating variables, compress the dataset
Use efficient types: Store binary variables as byte, not float
Process in chunks: For very large files, consider splitting by region
Save intermediate files: Don’t re-run expensive operations unnecessarily

These techniques turn hour-long jobs into minutes, making large-scale analysis practical even on modest computers.