← BACK

Reproducing Yeh et al

29 November 2022

This is about how I reproduced Yeh et al's research - "Using publicly available satellite imagery and deep learning to understand economic well-being in Africa" for My MastersMy Masters
I am currently doing my masters by research in electrical and information engineering at the University of the Witwatersrand. I am fortunate to be supervised by Professor Ken Nixon and Dr Martin Be....

Github

The repository for Yeh et al's research is publicly available on Github here.

Environment

Conda
Cluster specs
Gcloud
- https://cloud.google.com/sdk/docs/install-sdk?hl=en-GB#linux
Switching to Mamba
- mamba create -n africa_poverty_clean python=3.7
Mamba with upgraded packqages worked
Creating symlink for data1 to data

Connecting Jupyter Notebook

Run jupyter-notebook --no-browser --port=8889 --ip=0.0.0.0 on correct node
On local PC, run ssh -N -f -L 8889:hornet01:8889 emily@jaguar1.eie.wits.ac.za
Go to relevant url provided in jupyter notebook outputs

I reverted to the original env.yml
I had to install packages one by one
Also still using mamba because it seems to work faster
Installing just plain tensorflow for now (not with GPU support)

02/01/2023

Installing Mamba

Download installer for Mambaforge https://github.com/conda-forge/miniforge#mambaforge
Run zsh <Mambaforge shell script>
Restart shell

Init env mamba env create -f environment.yml

Installing Gcloud

https://cloud.google.com/sdk/docs/install-sdk?hl=en-GB#deb

Download images to relevant directory

Run

                          gsutil -m cp -r \
"gs://yeh-et-al/dhs_tfrecords_raw" \
"gs://yeh-et-al/dhsnl_tfrecords_raw" \
"gs://yeh-et-al/lsms_tfrecords_raw" \
.

                        

05/01/2023

Redownloading all images to google drive
- Re-do DHSNL for:
  - ~~Burkino faso
  - ~~Ethiopia
  - ~~Kenya~~
  - ~~Mali~
  - ~~Nigeria~~
  - ~~Togo~~
  - ~~Zimbabwe~~

06/01/2023

Modifying chunk size only for these countries
Modifying dhsnl code to check if country in list, then dispatch tasks

07/01/2023

Re-do with chunk size 25
- Burkino faso 2016
- Ethiopia 2016
- Kenya 2016
- Mali 2016
- Nigeria 2016
- Togo 2016
- Mali 2015
- Zimbabwe 2016
Re-do with chunk size 10
- Burkino faso 2016
- Ethiopia 2016
- Kenya 2016
- Mali 2016
- Nigeria 2016
- Togo 2016
- Mali 2015
- Zimbabwe 2016

08/01/2023

Success Yesterday

~~Mali 2015

Failed

Burkino Faso 2016
Ethiopia 2016
Kenya 2016
Mali 2016
Nigeria 2016
Togo 2016
Zimbabwe 2016

Next Attempt Chunking 5

~~Burkino Faso 2016
Ethiopia 2016
- 604, 605, 606
Kenya 2016
- 504, 506
Mali 2016
- 335-338
Nigeria 2016
- 4354
~~Togo 2016
~~Zimbabwe 2016

Chunking 3

Ethiopia 2016
- 1006, 1007, 1009, 1010
~~Kenya 2016
Mali 2016
- 560 - 564
Nigeria 2016
- 7256

09/01/2023

Going to try chunk size 2 for ethiopia, mali, nigeria
~~Ethiopia 2016
~~Mali 2016
~~Nigeria 2016

Now trying to download all the data xD

Dhsnl

Had to download from google drive in chunks
Then unzipped everything

14/01/2023

Symlink between hard drive and code base for each tfrecords_raw and tfrecords folder
- ln -s source_folder/<DHS/DHSNL/LSMS>_tfrecords... <REPO_DIR>/data/my_folder
- The source folder name cannot already exist in the repo data folder in this case
Had to install ipywidgets
Then running preprocessing_1
Processing dataset for verification
- Currently on DHS
- Issue here is that if one of the countries fail, everything has to be done again
  - This could be an easy fix
- Ethiopia 2016 is an issue

                      ---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_68707/967162135.py in <module>
      2     csv_path='data/dhs_clusters.csv',
      3     input_dir=DHS_EXPORT_FOLDER,
----> 4     processed_dir=DHS_PROCESSED_FOLDER)
/tmp/ipykernel_68707/4000888777.py in process_dataset(csv_path, input_dir, processed_dir)
     18         subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
     19         validate_and_split_tfrecords(
---> 20             tfrecord_paths=tfrecord_paths, out_dir=out_dir, df=subset_df)     21 
     22 

/tmp/ipykernel_68707/4000888777.py in validate_and_split_tfrecords(tfrecord_paths, out_dir, df)
     68                 ft_type = feature_map[col].WhichOneof('kind')
     69                 ex_val = feature_map[col].__getattribute__(ft_type).value[0]
---> 70                 assert val == ex_val, f'Expected {col}={val}, but found {ex_val} instead'
     71 
     72             # serialize to string and write to file

AssertionError: Expected lat=9.505470275878906, but found 9.312466621398926 instead

                    

15/01/2023

Resolved the above by re-downloading Ethiopia 2016
- Discovered that there are a few files that have not been exported from Earth Engine
  - Hoping it's minimal xD Will keep a list below
Tried to add a processed flag to each csv so that if something fails, not all the countries have to be re-run.
Re-export from earth engine
- DHS
  - ethiopia_2016
- LSMS
  - ethiopia_2015
- DHSNL
  - ~~angola_2015
  - ~~ghana_2014
  - ~~tanzania_2016
  - ~~uganda_2009
  - ~~uganda_2010
  - ~~uganda_2013
  - zambia_2016
🐞 But noticed an issue that files aren't ordered before processed!
- Installing natsort to help with natural sorting of file names in tfrecords_raw folders

Also

Added nbdev pre-commit hook
- Needed a settings.ini file in root which can be generated by nbdev_create_config
- And this doesn't work in gitkraken because nbdev is not found (although nbdev is found elsewhere)

16/01/2023

All files have now been downloaded successfully. I have copied zimbabwe files to cluster
Going to pull down repo on cluster and do a reinstall of packages
Next step
- Run mamba create env in repo again
Started doing an scp across for folders but it's quite slow
Going to rather upload all data to cloud storage
Then can download on cluster

05/02/2023

So all images are on cluster

Preprocessing Step 2

"Validate and Split Exported TFRecords"

This means that:
- For each survey
- For each country and year in the survey
- Get array of all TFRecord paths for the country and year
- Get all survey records for that country and year
- For each TFRecord for the country and year
  - For each record inside the tfrecord:
    - parse into an actual Example message - a type of tensorflow data structure
    - verify required bands exist
    - compare feature map values against CSV values
    - Serialize record
    - Write record to standalone tfrecord file

Note on Data

DHS clusters
- country
- year
- lat
- lon
- GID_1
- GID_2
- wealthpooled
- households
- urban_rural
DHSNL Clusters
- country
- year
- lat
- lon
LSMS Clusters
- country
- year
- lat
- lon
- geolev1
- geolev2
- Preprocessing Step 3
  
  "The point of this notebook is to create the in-country splits."

06/02/2023

All preprocessing run
Cannot run model analysis by default because it does not exist, trying to download from original repo
- This does have issues because files are named differently etc

Trying to train as well

                          Traceback (most recent call last):
File "train_direct.py", line 40, in <module>
  from utils.trainer import RegressionTrainer
File "/home/esteyn/projects/africa_poverty_clean/utils/trainer.py", line 22, in <module>
  tuple[tf.Tensor, tf.Tensor, tf.Tensor, Optional[tf.Tensor]]]
TypeError: 'type' object is not subscriptable

                        

Managed to get past error above
- Replaced tuple with Tuple from typing
- Replaced Callable and Mapping from typing (instead of ABCMeta)
🤓 Switched to tmux
Direct train completed

Actions

Max activations notebook
Model analysis notebooks

#💡 - when applying yeh et al - Apply with his model and then train South Africa and apply that