This is about how I reproduced Yeh et al's research - "Using publicly available satellite imagery and deep learning to understand economic well-being in Africa" for My MastersMy Masters
I am currently doing my masters by research in electrical and information engineering at the University of the Witwatersrand. I am fortunate to be supervised by Professor Ken Nixon and Dr Martin Be....
Github
The repository for Yeh et al's research is publicly available on Github here.
Environment
- Conda
- Cluster specs
- Gcloud
- https://cloud.google.com/sdk/docs/install-sdk?hl=en-GB#linux
- Switching to Mamba
- mamba create -n africa_poverty_clean python=3.7
-
Mamba with upgraded packqages worked
- Creating symlink for data1 to data
Connecting Jupyter Notebook
- Run
jupyter-notebook --no-browser --port=8889 --ip=0.0.0.0
on correct node - On local PC, run
ssh -N -f -L 8889:hornet01:8889 emily@jaguar1.eie.wits.ac.za
- Go to relevant url provided in jupyter notebook outputs
- I reverted to the original env.yml
- I had to install packages one by one
-
Also still using mamba because it seems to work faster
-
Installing just plain tensorflow for now (not with GPU support)
02/01/2023
Installing Mamba
- Download installer for Mambaforge https://github.com/conda-forge/miniforge#mambaforge
- Run
zsh <Mambaforge shell script>
- Restart shell
Init env
mamba env create -f environment.yml
Installing Gcloud
- https://cloud.google.com/sdk/docs/install-sdk?hl=en-GB#deb
Download images to relevant directory
- Run
gsutil -m cp -r \ "gs://yeh-et-al/dhs_tfrecords_raw" \ "gs://yeh-et-al/dhsnl_tfrecords_raw" \ "gs://yeh-et-al/lsms_tfrecords_raw" \ .
05/01/2023
- Redownloading all images to google drive
- Re-do DHSNL for:
- ~~Burkino faso
- ~~Ethiopia
Kenya- ~~Mali~
NigeriaTogoZimbabwe
- Re-do DHSNL for:
06/01/2023
- Modifying chunk size only for these countries
- Modifying dhsnl code to check if country in list, then dispatch tasks
07/01/2023
- Re-do with chunk size 25
- Burkino faso 2016
- Ethiopia 2016
- Kenya 2016
- Mali 2016
- Nigeria 2016
- Togo 2016
- Mali 2015
- Zimbabwe 2016
- Re-do with chunk size 10
- Burkino faso 2016
- Ethiopia 2016
- Kenya 2016
- Mali 2016
- Nigeria 2016
- Togo 2016
- Mali 2015
- Zimbabwe 2016
08/01/2023
Success Yesterday
- ~~Mali 2015
Failed
- Burkino Faso 2016
- Ethiopia 2016
- Kenya 2016
- Mali 2016
- Nigeria 2016
- Togo 2016
- Zimbabwe 2016
Next Attempt Chunking 5
- ~~Burkino Faso 2016
- Ethiopia 2016
- 604, 605, 606
- Kenya 2016
- 504, 506
- Mali 2016
- 335-338
- Nigeria 2016
- 4354
- ~~Togo 2016
- ~~Zimbabwe 2016
Chunking 3
- Ethiopia 2016
- 1006, 1007, 1009, 1010
- ~~Kenya 2016
- Mali 2016
- 560 - 564
- Nigeria 2016
- 7256
09/01/2023
-
Going to try chunk size 2 for ethiopia, mali, nigeria
- ~~Ethiopia 2016
- ~~Mali 2016
- ~~Nigeria 2016
Now trying to download all the data xD
Dhsnl
- Had to download from google drive in chunks
- Then unzipped everything
14/01/2023
- Symlink between hard drive and code base for each tfrecords_raw and tfrecords folder
ln -s source_folder/<DHS/DHSNL/LSMS>_tfrecords... <REPO_DIR>/data/my_folder
- The source folder name cannot already exist in the repo data folder in this case
- Had to install ipywidgets
-
Then running preprocessing_1
- Processing dataset for verification
- Currently on DHS
- Issue here is that if one of the countries fail, everything has to be done again
- This could be an easy fix
- Ethiopia 2016 is an issue
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/tmp/ipykernel_68707/967162135.py in <module>
2 csv_path='data/dhs_clusters.csv',
3 input_dir=DHS_EXPORT_FOLDER,
----> 4 processed_dir=DHS_PROCESSED_FOLDER)
/tmp/ipykernel_68707/4000888777.py in process_dataset(csv_path, input_dir, processed_dir)
18 subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
19 validate_and_split_tfrecords(
---> 20 tfrecord_paths=tfrecord_paths, out_dir=out_dir, df=subset_df) 21
22
/tmp/ipykernel_68707/4000888777.py in validate_and_split_tfrecords(tfrecord_paths, out_dir, df)
68 ft_type = feature_map[col].WhichOneof('kind')
69 ex_val = feature_map[col].__getattribute__(ft_type).value[0]
---> 70 assert val == ex_val, f'Expected {col}={val}, but found {ex_val} instead'
71
72 # serialize to string and write to file
AssertionError: Expected lat=9.505470275878906, but found 9.312466621398926 instead
15/01/2023
- Resolved the above by re-downloading Ethiopia 2016
- Discovered that there are a few files that have not been exported from Earth Engine
- Hoping it's minimal xD Will keep a list below
- Discovered that there are a few files that have not been exported from Earth Engine
-
Tried to add a processed flag to each csv so that if something fails, not all the countries have to be re-run.
- Re-export from earth engine
- DHS
- ethiopia_2016
- LSMS
- ethiopia_2015
- DHSNL
- ~~angola_2015
- ~~ghana_2014
- ~~tanzania_2016
- ~~uganda_2009
- ~~uganda_2010
- ~~uganda_2013
- zambia_2016
- DHS
- 🐞 But noticed an issue that files aren't ordered before processed!
- Installing natsort to help with natural sorting of file names in tfrecords_raw folders
Also
- Added nbdev pre-commit hook
- Needed a settings.ini file in root which can be generated by
nbdev_create_config
- And this doesn't work in gitkraken because nbdev is not found (although nbdev is found elsewhere)
- Needed a settings.ini file in root which can be generated by
16/01/2023
- All files have now been downloaded successfully. I have copied zimbabwe files to cluster
- Going to pull down repo on cluster and do a reinstall of packages
- Next step
- Run mamba create env in repo again
- Started doing an scp across for folders but it's quite slow
- Going to rather upload all data to cloud storage
- Then can download on cluster
05/02/2023
- So all images are on cluster
Preprocessing Step 2
"Validate and Split Exported TFRecords"
- This means that:
- For each survey
- For each country and year in the survey
- Get array of all TFRecord paths for the country and year
- Get all survey records for that country and year
- For each TFRecord for the country and year
- For each record inside the tfrecord:
- parse into an actual Example message - a type of tensorflow data structure
- verify required bands exist
- compare feature map values against CSV values
- Serialize record
- Write record to standalone tfrecord file
- For each record inside the tfrecord:
Note on Data
- DHS clusters
- country
- year
- lat
- lon
- GID_1
- GID_2
- wealthpooled
- households
- urban_rural
- DHSNL Clusters
- country
- year
- lat
- lon
- LSMS Clusters
- country
- year
- lat
- lon
- geolev1
- geolev2
-
Preprocessing Step 3
"The point of this notebook is to create the in-country splits."
06/02/2023
- All preprocessing run
- Cannot run model analysis by default because it does not exist, trying to download from original repo
- This does have issues because files are named differently etc
- Trying to train as well
Traceback (most recent call last): File "train_direct.py", line 40, in <module> from utils.trainer import RegressionTrainer File "/home/esteyn/projects/africa_poverty_clean/utils/trainer.py", line 22, in <module> tuple[tf.Tensor, tf.Tensor, tf.Tensor, Optional[tf.Tensor]]] TypeError: 'type' object is not subscriptable
- Managed to get past error above
- Replaced
tuple
withTuple
from typing - Replaced
Callable
andMapping
from typing (instead of ABCMeta)
- Replaced
- 🤓 Switched to tmux
- Direct train completed
Actions
- Max activations notebook
- Model analysis notebooks
#💡 - when applying yeh et al - Apply with his model and then train South Africa and apply that