Bangladesh’s national surveys—HIES, Labor Force Survey, Census samples—often contain hundreds of thousands of observations. Standard STATA commands can become painfully slow. Here’s how to work efficiently.
Start with Memory Management
* Check current settings
memory
query memory
* Increase limits if needed
set maxvar 32767
set matsize 11000
* The most important optimization: reduce data size
keep household_id weight region urban income expenditure
* Compress to optimize storage
compress
Use gtools for Speed
The gtools package dramatically speeds up common operations:
* Install gtools (one time)
ssc install gtools
* Compare speed: standard vs gtools
timer clear
timer on 1
collapse (mean) income, by(region year)
timer off 1
use "large_data.dta", clear
timer on 2
gcollapse (mean) income, by(region year)
timer off 2
timer list
* gtools is typically 10-50x faster
Efficient Data Processing
* Fast egen alternatives
gegen mean_inc = mean(income), by(district)
gegen group = group(division district)
* Faster duplicates check
gduplicates report household_id
gduplicates drop household_id, force
* Fast sorting
hashsort district household_id
Best Practices
- Drop early: Remove unneeded variables and observations before processing
- Compress often: After creating variables, compress the dataset
- Use efficient types: Store binary variables as byte, not float
- Process in chunks: For very large files, consider splitting by region
- Save intermediate files: Don’t re-run expensive operations unnecessarily
These techniques turn hour-long jobs into minutes, making large-scale analysis practical even on modest computers.