Bangladesh’s national surveys—HIES, Labor Force Survey, Census samples—often contain hundreds of thousands of observations. Standard STATA commands can become painfully slow. Here’s how to work efficiently.

Start with Memory Management

* Check current settings
memory
query memory

* Increase limits if needed
set maxvar 32767
set matsize 11000

* The most important optimization: reduce data size
keep household_id weight region urban income expenditure

* Compress to optimize storage
compress

Use gtools for Speed

The gtools package dramatically speeds up common operations:

* Install gtools (one time)
ssc install gtools

* Compare speed: standard vs gtools
timer clear

timer on 1
collapse (mean) income, by(region year)
timer off 1

use "large_data.dta", clear

timer on 2
gcollapse (mean) income, by(region year)
timer off 2

timer list
* gtools is typically 10-50x faster

Efficient Data Processing

* Fast egen alternatives
gegen mean_inc = mean(income), by(district)
gegen group = group(division district)

* Faster duplicates check
gduplicates report household_id
gduplicates drop household_id, force

* Fast sorting
hashsort district household_id

Best Practices

  1. Drop early: Remove unneeded variables and observations before processing
  2. Compress often: After creating variables, compress the dataset
  3. Use efficient types: Store binary variables as byte, not float
  4. Process in chunks: For very large files, consider splitting by region
  5. Save intermediate files: Don’t re-run expensive operations unnecessarily

These techniques turn hour-long jobs into minutes, making large-scale analysis practical even on modest computers.