Reproducing Yeh et al

29 November 2022

This is about how I reproduced Yeh et al's research - "Using publicly available satellite imagery and deep learning to understand economic well-being in Africa" for My MastersMy Masters
I am currently doing my masters by research in electrical and information engineering at the University of the Witwatersrand. I am fortunate to be supervised by Professor Ken Nixon and Dr Martin Be...
.

Github

The repository for Yeh et al's research is publicly available on Github here.

Environment

  • Conda
  • Cluster specs
  • Gcloud
    • https://cloud.google.com/sdk/docs/install-sdk?hl=en-GB#linux
  • Switching to Mamba
    • mamba create -n africa_poverty_clean python=3.7
  • Mamba with upgraded packqages worked

  • Creating symlink for data1 to data

Connecting Jupyter Notebook

  • Run jupyter-notebook --no-browser --port=8889 --ip=0.0.0.0 on correct node
  • On local PC, run ssh -N -f -L 8889:hornet01:8889 emily@jaguar1.eie.wits.ac.za
  • Go to relevant url provided in jupyter notebook outputs

  • I reverted to the original env.yml
  • I had to install packages one by one
  • Also still using mamba because it seems to work faster

  • Installing just plain tensorflow for now (not with GPU support)

    02/01/2023

Installing Mamba

  • Download installer for Mambaforge https://github.com/conda-forge/miniforge#mambaforge
  • Run zsh <Mambaforge shell script>
  • Restart shell

Init env mamba env create -f environment.yml

Installing Gcloud

  • https://cloud.google.com/sdk/docs/install-sdk?hl=en-GB#deb

Download images to relevant directory

  • Run
    gsutil -m cp -r \
    "gs://yeh-et-al/dhs_tfrecords_raw" \
    "gs://yeh-et-al/dhsnl_tfrecords_raw" \
    "gs://yeh-et-al/lsms_tfrecords_raw" \
    .
    

05/01/2023

  • Redownloading all images to google drive
    • Re-do DHSNL for:
      • ~~Burkino faso
      • ~~Ethiopia
      • Kenya
      • ~~Mali~
      • Nigeria
      • Togo
      • Zimbabwe

06/01/2023

  • Modifying chunk size only for these countries
  • Modifying dhsnl code to check if country in list, then dispatch tasks

07/01/2023

  • Re-do with chunk size 25
    • Burkino faso 2016
    • Ethiopia 2016
    • Kenya 2016
    • Mali 2016
    • Nigeria 2016
    • Togo 2016
    • Mali 2015
    • Zimbabwe 2016
  • Re-do with chunk size 10
    • Burkino faso 2016
    • Ethiopia 2016
    • Kenya 2016
    • Mali 2016
    • Nigeria 2016
    • Togo 2016
    • Mali 2015
    • Zimbabwe 2016

08/01/2023

Success Yesterday

  • ~~Mali 2015

Failed

  • Burkino Faso 2016
  • Ethiopia 2016
  • Kenya 2016
  • Mali 2016
  • Nigeria 2016
  • Togo 2016
  • Zimbabwe 2016

Next Attempt Chunking 5

  • ~~Burkino Faso 2016
  • Ethiopia 2016
    • 604, 605, 606
  • Kenya 2016
    • 504, 506
  • Mali 2016
    • 335-338
  • Nigeria 2016
    • 4354
  • ~~Togo 2016
  • ~~Zimbabwe 2016

Chunking 3

  • Ethiopia 2016
    • 1006, 1007, 1009, 1010
  • ~~Kenya 2016
  • Mali 2016
    • 560 - 564
  • Nigeria 2016
    • 7256

09/01/2023

  • Going to try chunk size 2 for ethiopia, mali, nigeria

  • ~~Ethiopia 2016
  • ~~Mali 2016
  • ~~Nigeria 2016

Now trying to download all the data xD

Dhsnl

  • Had to download from google drive in chunks
  • Then unzipped everything

14/01/2023

  • Symlink between hard drive and code base for each tfrecords_raw and tfrecords folder
    • ln -s source_folder/<DHS/DHSNL/LSMS>_tfrecords... <REPO_DIR>/data/my_folder
    • The source folder name cannot already exist in the repo data folder in this case
  • Had to install ipywidgets
  • Then running preprocessing_1

  • Processing dataset for verification
    • Currently on DHS
    • Issue here is that if one of the countries fail, everything has to be done again
      • This could be an easy fix
    • Ethiopia 2016 is an issue
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_68707/967162135.py in <module>
      2     csv_path='data/dhs_clusters.csv',
      3     input_dir=DHS_EXPORT_FOLDER,
----> 4     processed_dir=DHS_PROCESSED_FOLDER)
/tmp/ipykernel_68707/4000888777.py in process_dataset(csv_path, input_dir, processed_dir)
     18         subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
     19         validate_and_split_tfrecords(
---> 20             tfrecord_paths=tfrecord_paths, out_dir=out_dir, df=subset_df)     21 
     22 

/tmp/ipykernel_68707/4000888777.py in validate_and_split_tfrecords(tfrecord_paths, out_dir, df)
     68                 ft_type = feature_map[col].WhichOneof('kind')
     69                 ex_val = feature_map[col].__getattribute__(ft_type).value[0]
---> 70                 assert val == ex_val, f'Expected {col}={val}, but found {ex_val} instead'
     71 
     72             # serialize to string and write to file

AssertionError: Expected lat=9.505470275878906, but found 9.312466621398926 instead

15/01/2023

  • Resolved the above by re-downloading Ethiopia 2016
    • Discovered that there are a few files that have not been exported from Earth Engine
      • Hoping it's minimal xD Will keep a list below
  • Tried to add a processed flag to each csv so that if something fails, not all the countries have to be re-run.

  • Re-export from earth engine
    • DHS
      • ethiopia_2016
    • LSMS
      • ethiopia_2015
    • DHSNL
      • ~~angola_2015
      • ~~ghana_2014
      • ~~tanzania_2016
      • ~~uganda_2009
      • ~~uganda_2010
      • ~~uganda_2013
      • zambia_2016
  • 🐞 But noticed an issue that files aren't ordered before processed!
    • Installing natsort to help with natural sorting of file names in tfrecords_raw folders

Also

  • Added nbdev pre-commit hook
    • Needed a settings.ini file in root which can be generated by nbdev_create_config
    • And this doesn't work in gitkraken because nbdev is not found (although nbdev is found elsewhere)

16/01/2023

  • All files have now been downloaded successfully. I have copied zimbabwe files to cluster
  • Going to pull down repo on cluster and do a reinstall of packages
  • Next step
    • Run mamba create env in repo again
  • Started doing an scp across for folders but it's quite slow
  • Going to rather upload all data to cloud storage
  • Then can download on cluster

05/02/2023

  • So all images are on cluster
Preprocessing Step 2

"Validate and Split Exported TFRecords"

  • This means that:
    • For each survey
    • For each country and year in the survey
    • Get array of all TFRecord paths for the country and year
    • Get all survey records for that country and year
    • For each TFRecord for the country and year
      • For each record inside the tfrecord:
        • parse into an actual Example message - a type of tensorflow data structure
        • verify required bands exist
        • compare feature map values against CSV values
        • Serialize record
        • Write record to standalone tfrecord file

Note on Data

  • DHS clusters
    • country
    • year
    • lat
    • lon
    • GID_1
    • GID_2
    • wealthpooled
    • households
    • urban_rural
  • DHSNL Clusters
    • country
    • year
    • lat
    • lon
  • LSMS Clusters
    • country
    • year
    • lat
    • lon
    • geolev1
    • geolev2
    • Preprocessing Step 3

      "The point of this notebook is to create the in-country splits."

06/02/2023

  • All preprocessing run
  • Cannot run model analysis by default because it does not exist, trying to download from original repo
    • This does have issues because files are named differently etc
  • Trying to train as well
    Traceback (most recent call last):
    File "train_direct.py", line 40, in <module>
      from utils.trainer import RegressionTrainer
    File "/home/esteyn/projects/africa_poverty_clean/utils/trainer.py", line 22, in <module>
      tuple[tf.Tensor, tf.Tensor, tf.Tensor, Optional[tf.Tensor]]]
    TypeError: 'type' object is not subscriptable
    
  • Managed to get past error above
    • Replaced tuple with Tuple from typing
    • Replaced Callable and Mapping from typing (instead of ABCMeta)
  • 🤓 Switched to tmux
  • Direct train completed

Actions

  • Max activations notebook
  • Model analysis notebooks

#💡 - when applying yeh et al - Apply with his model and then train South Africa and apply that