ABSTRACT
This tutorial provides guidance on creating a machine learning model to predict maize yields from satellite imagery and other earth observation data. We train on variety trial data collected by CIMMYT used in Lobell et al. (2011) Nature Climate Change.

CONTENTS
Overview

  • Yield Prediction Overview
  • Geospatial Data and Tools
  • Variety Trial Data
  • Earth Observation Data Sets
  • Landsat Multispectral Satellite Imagery
  • Additional EO Data
  • Shuttle Radar Topography Mission (SRTM) Data
  • Weather Data
  • Data Structures
  • GIS in Python and Google Earth Engine for Earth Observation Data Introduction to Data

Developing Crop Type Mapping Model

  • Data Ingestion and Pre-Processing
  • Generate Composites
  • Train and Tune Model
  • Train and Test Subsets
  • Hyperparameter Tuning
  • Excluding Plot Observations
  • Sample Size
  • Feature Pre-Selection
  • Survey Data
  • EO Data
  • Visualizing Geographic Areas
  • Calculating Plot Areas
  • Sentinel-2 Level-2A Imagery
  • Shuttle Radar Topography Mission Data
  • Climate Data
  • Index Time Series and Crop Phenology
  • Obtain Harmonic Coefficients
  • Obtaining Observed Weather

Generate Predicted Cultivation Maps

  • Apply Trained and Tuned Model to Full Area of Interest
  • Generate Predicted Yield Maps

OVERVIEW
Yield Prediction Overview
Smallholder farms are increasingly in the spotlight of development goals at a national and international level. As changing climates and growing populations increase food scarcity, creating sustainable food systems has become a major focus (Economist Impact Food Sustainability Index 2021). However, monitoring the progress of these farms has been a challenge, as researchers and policy makers have relied on surveys or manual labeling of imagery to identify the type and scope of farming operations (Azzari et al. 2021). There is an acute need for new approaches to monitoring smallholder agriculture systems, particularly in developing countries; these are the primary focus of development goals, yet they often lack the resources to facilitate consistent, widespread monitoring. \

The 50x2030 Initiative to Close the Agricultural Data Gap aims to empower and support 50 low and lower-middle income countries (L/LMICs) by 2030 to build strong national data systems that produce and use high-quality and timely agricultural data through survey programs. The activities supported by the 50x2030 Initiative are carried out under three Implementation Components, namely (1) Data Production, led by the Food and Agriculture Organization of the United Nations (FAO), in collaboration with the World Bank Development Data Group, the Poverty and Equity Global Practice and the Agriculture Global Practice; (2) Methods and Tools Development, led by the World Bank Development Data Group, (3) Data Use, led by the International Fund for Agricultural Development. \

Under the Methods and Tools Development Component (MTD Component), the 50x2030 Initiative supports research to produce new, better, and more cost-effective tools and methodologies for data collection and analysis in the context of agricultural and rural surveys. MTD Component has three major work areas: (1) integration of surveys, (2) integration of improved methods into survey data collection, and (3) integration of surveys with other data sources. Under this third work area, 50x2030-supported research activities are geared towards the development of methods for the integration of survey data with different data sources, including but not limited to, geospatial, administrative and census data, with an eye on enhancing the value of survey data in policy-relevant analysis and research. A strong emphasis will be placed on enabling surveys to feed into remote sensing applications that aim to produce actionable, high-resolution of key indicators at-scale - anchored in the demands voiced by national governments and international development partners for advancing remote sensing applications that can both inform agricultural decision-making and help monitor and understand progress towards the Sustainable Development Goal (SDG) 2, with a focus on SDG Target 2.3 and 2.4. \

Since georeferenced ground data creates the backbone required for accurate satellite-based outcome monitoring, the 50x2030 initiative is uniquely positioned to catalyze the implementation, and eventual use, of integrated satellite-survey applications that can rapidly generate maps of agricultural outcomes across expansive geographies in smallholder systems -- reaching far beyond the coverage of field surveys. Over the last decade, the research has focused on developing and testing algorithms to derive satellite-based yields in large-scale systems (Clevers 1997); (Lobell et al. 2005) and in smallholder systems (Burke and Lobell, 2017; Gourlay et al., 2019; Jain et al., 2016; Jin et al., 2017; Lambert et al., 2018; Lobell et al., 2019, 2020). \

However, progress has been slower-than-desired in implementing integrated satellite-survey applications that can monitor agricultural outcomes at-scale (i.e. for entire countries and across continents) in the smallholder production systems that characterize much of the agriculture sector in low- and low-middle-income countries. One the main hurdles has been the lack of knowledge regarding the required volume, methods, and content of georeferenced microdata that should be collected as part of surveys in order to inform remote sensing applications capable of fulfilling spatially-disaggregated estimation and reporting needs. \

Against this background, the MTD Component of the 50x2030 Initiative is supporting research to generate guidelines for designing and implementing large-scale surveys in a way that can generate the required data for training remote sensing models for high-resolution crop area and crop yield mapping in low- and lower-middle income countries (50x2030 n.d.). \

In line with the goals outlined above, this workshop builds on research conducted by Lobell et al. (2011) studying particular non-linear yield responses of maize grown in Africa to temperature. While prior research had demonstrated such a non-linear effect in maize yields in the United States (Schlenker and Roberts, 2009), evidence from Africa was limited by the availability of a suitably large collection of maize yield measurements. The study emphasized the importance of variety trial data as a source for expanding our understanding of agriculture in regions typically thought to be data-poor.

The approach to predicting yields taken in this workshop roughly follows that of Deines et al. (2021). In that study, point-level maize yield ground truth collected by combine harvesters was used to validate predictions based on crop phenology and weather. While their approach to training the model was slightly different, we will "use similar features to form our predictions

Geospatial Data and Tools
Data Structures
In geospatial data analysis, data can be classified into two categories: raster and vector data. A graphic comparison between raster and vector data can be found in the World Bank Nighttime Lights Tutorial module 2, section 1.

  • Raster data: Data stored in a raster format is arranged in a regular grid of cells, without storing the coordinates of each point (namely, a cell, or a pixel). The coordinates of the corner points and the spacing of the grid can be used to calculate (rather than to store) the coordinates of each location in the grid. Any given pixel in the grid stores one or more values (in one or more bands).
  • Vector data: Data in a vector format is stored such that the X and Y coordinates are stored for each point. Data can be represented, for example, as points, lines and polygons. A point has only one coordinate (X and Y), a line has two or more coordinates, and a polygon is essentially a line that closes on itself to enclose a region. Polygons are usually used to represent the area and perimeter of continuous geographic features. Vector data stores features in their original resolution, without aggregation. \

In this tutorial, we will use vector and raster data. Geospatial data in vector format are often stored in a shapefile, a popular format for storing vector data developed by ESRI. The shapefile format is actually composed of multiple individual files which make up the entire data. At a minimum, there will be 3 file types included with this geographic data (.shp, .shx, .dbf), but there are often other files included which store additional information. In order to be read and used as a whole, all file types must have the same name and be in the same folder. Because the structure of points, lines, and polygons are different, each shapefile can only contain one vector type (all points, all lines, or all polygons). You will not find a mixture of point, line, and polygon objects in a single shapefile, so in order to work with these different types in the same analysis, multiple shapefiles will need to be used and layered. For more details on shapefiles and file types, see this documentation. \

Raster data, on the other hand, is stored in Tagged Image File Format (TIFF or TIF). A GeoTIFF is a TIFF file that follows a specific standard for structuring meta-data. The meta-data stored in a TIFF is called a tif tag and GeoTIFFs often contain tags including spatial extent, coordinate reference system, resolution, and number of layers. \

More information and examples can be found in sections 2 & 3 of the Earth Analytics Python Course. \

GIS in Python and Google Earth Engine for Earth Observation Data
We'll be sourcing the EO data used in this process from Google Earth Engine and coding in this Colab notebook. Before proceeding you will need to have a personal Google Earth Engine account linked to a Gmail account. It may take a day or longer for your Google Earth Engine account to be granted access. \

Two of the primary packages we'll be using, Pandas and GeoPandas, must be installed according to their installation instructions: Pandas Installation and GeoPandas Installation. If you're on Windows, GeoPandas installation can occasionally be temperamental - using an environment, as in the World Bank Nighttime Lights Tutorial, can often circumvent any issues, but if you're still having problems, there are a number of guides online, such as this Practial Data Science guide or this Medium post by Nayane Maia, which provide installation help that may allow you to be more successful. Using Windows Subsystem for Linux (WSL) can also make use of tricky packages like GeoPandas easier.

INTRODUCTION TO DATA
Variety Trial Data
While the data used in measuring crop yields will differ by location and time period of interest for your use case, for the purposes of this module we will leverage the variety trial data collected by CIMMYT utilized by Lobell et al. in their study Nonlinear heat effects on African maize as evidenced by historical yield trials. These data were originally collected to test the yield responses of novel varieties to different managements and shocks, but thanks to their GPS coordinates, could be co-located with weather data for the purposes of the study. Similarly, we make use of these geo-referenced data by matching yield observations with Earth Observation data. \

The data can be downloaded from CIMMYT's Dataverse, a repository of data used in CIMMYT studies. CIMMYT, like many of the other CGIAR organizations, now has an open data policy and publishes data they collect whenever ethical and feasible. \

Earth Observation (EO) Data Sets
Landsat Multispectral Imagery
The method we will use to predict maize yields is based on Landsat multispectral imagery. The Landsat satellites have been generating Earth Observation data since the 1970s(!), though the instruments used for Earth Observation have changed over time. We will use data collected by the Thematic Mapper aboard Landsat 5 and the Enhanced Thematic Mapper + aboard Landsat 7. These instruments all collect multispectral image data: raster data containing information from many segments of the electromagnetic spectrum. These are used to enable vegetation, land cover, and environmental monitoring of all of Earth's land surfaces and coastal waters. \

We will be using Tier 1 Landsat imagery to make predictions of maize production. The Tier 1 product provides imagery which has been orthorectified to remove distortion caused by differing terrain and sensor angles when the images were collected. Orthorectification allows the images to be used to accurately measure and quantify land features and distances. For more information on orthorectification, see this USGS brief explanation or this more detailed description from the Satellite Imaging Corporation.

The Tier 1 product provides Surface Reflectance (SR) images, which are derived from Top Of Atmosphere (TOA) reflectances. TOA reflectance values represent all lights which are reflected back to the satellite from the earth or the atmosphere. For analyses like ours, which are interested in the land cover, the presence of atmospheric reflections can distort the interpretation of the image, as they are blended with the reflectances from earth. To remove this potential source of distortion, atmospheric corrections can be applied to reduce or remove the effect of reflectances from the atmosphere and only retain the light reflected from the ground. The SR product is the result of processing the TOA product to remove atmospheric reflectance, and thus is the preferred source for our investigation of crop type. \

Landsat imagery records values for multiple spectral bands, but in practice, rather than being used individually, these bands are often combined into indices. Spectral indices can combine multiple bands with complementary information into a single value which can provide more information on ground cover than the individual bands. Our analysis will use the Normalized Difference Vegetation Index (NDVI), which is a particular transformation of the Red and Near Infrared bands. \

Additional Earth Observation Data Sources
While the core components of this model are the satellite imagery and the variety trial data, Deines et al. also incorporate additional Earth Observation (EO) data sources into their predictive model. They selected features of the landscape and climate which are correlated with the selection of crop types for given plots of land. \

Topography Features: Elevation, Slope, and Aspect
In order to assess cropland suitability, the team utilized elevation, slope, and aspect (the direction of the slope) as proxies. These selections were made under the assumption that erosion and soil degradation are exacerbated by high slope and elevation, thus making these areas less likely to be strong agricultural sites. Elevation, slope, and aspect can be obtained from the Shuttle Radar Topography Mission (SRTM) datasource, which is available at a resolution of 30m. The Shuttle Radar Topography Mission, flown on the space shuttle Endeavour in February, 2000, was used by the National Aeronautics and Space Administration (NASA) and the National Geospatial-Intelligence Agency (NGA) to create near-global land elevation data. This was accomplished through the use of two antennas on the Endeavour, with the surface elevation calculated as the difference between these signals. \

Multiple versions of this SRTM elevation data have been created from the original data collection, with subsequent versions performing tasks such as identifying and correcting for water bodies and filling voids. These versions are made available in 1 and 3 arc-second resolutions, where 1 arc-second corresponds to roughly 30m and 3 arc-second corresponds to apprximately 90m. For this model, we'll leverage Version 3.0 (also known as SRTM Plus), which was developed by NASA's Jet Propulsion Laboratory (JPL) to fill the voids in the elevation data with non-commercial data sources ASTER GDEM2 (a joint project of NASA and components of the Japanese government), GMTED2010 (developed by NGA and the U.S. Geological Survey (USGS)), and NED (developed by USGS). For more information on the SRTM versions and methodology, see the SRTM User Guide. \

Weather Features: Temperature and Precipitation
Weather considerations are also an important component of cropland suitability, so these can also be incorporated into this crop classification model. Deines et al. included a number of hand-selected weather variables, such as average temperature, total precipitation, and growing degree days (GDD), with a growing degree day defined as a day where the mean temperature is greater than a base value that must be exceeded for a crop to grow. However, we will use a slightly different approach based on harmonic regression. \

Because weather stations are relatively sparse globally, it is often helpful or necessary to use an interpolated weather product of some kind. Here, we will use the ERA5 Land Reanalysis. This product combines information from weather stations around the world with a climate model to produce hourly estimates of a number of weather variables at a 0.1 arc-degree resolution. We will use a daily version of this product for our analysis.

DEVELOPING YIELD PREDICTION MODEL
To predict yields, we will train a random forest model on our Earth Observation features. A random forest model consists of an ensemble of decision trees (ie, a set of multiple, individual decision trees), where each decision tree utilizes a series of features to determine the optimal classification for given observations. For an approachable overview of Random Forest Models, see this Towards Data Science post by Tony Yiu, or for a more thorough overview see Chapter 8 of An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani, which can be downloaded for free here. Further information on Random Forest models can also be found in The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Rob Tibshirani, and Jerome Friedman, which can be downloaded free here, and an introduction to Random Forests including Python example code can be found in Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurelien Geron. \

To begin, in order for a random forest model to successfully predict yields, we need to provide it information (features) about each location we hope to predict at. We will need to provide these features for both the data set we use for training and for any additional locations we hope to predict at using our trained model. The data sources we mentioned above - Landsat Imagery, Survey Data, and the additional Earth Observation data - will provide the features for our model. \

Note: Most of the code below was written by Natalie Ayers for the purpose of developing a learning module for the Geo4Dev Initiative. In some places, I have inserted or changed code to fit or particular task and data.

Data Ingestion and Preprocessing
Variety Trial Data
As mentioned above, the yield data we'll be using to model this process was collected by CIMMYT and can be obtained from their Dataverse. While the format and contents of each data source will vary, to be used in this model a dataset must contain information including: \

  • Georeferenced locations, providing an approximate location or boundary of where the yield measurement took place
  • The yield actually measured
  • The planting date \

Since each variety trial typically contains yield measurements pertaining to different varieties and managments, we'll need to be able to identify each distinct trial as we collect the Earth Observation data. We'll use the provided GPS coordinates to collect the associated SRTM, Landsat, and ERA5 Land information for each trial, but we should also create a unique identifier for each trial that we can use to link the trial and Earth Observation data together. While this will be different depending on the survey data you use, in our example, if we will not be receiving any additional data, we could randomly assign unique identifiers to every record we have. We could also distinctly identify each trial by the sitecodexPlantingDate combination. \

To work with our data tables, we will be using Pandas, a popular and versitile tool for data analysis in Python, and GeoPandas, Geopandas is a popular Python package for working with geographic data. If your data comes in as plot corner points or center points, without being in a geographic type, you can use GeoPandas to convert to a geographic type. \

To start, we will import our variety trial data which contains location GPS coordinates. These coordinates will use the Coordinate Reference System (CRS) World Geodetic System 1984 (WGS84), which has the code EPSG:4326 and is one of the most common Coordinate Reference Systems. This system defines locations on a three-dimensional surface using degrees, which is what we know as latitude and longitude. For more information on Coordinate Reference Systems in Python, see this Towards Data Science tutorial by Abdishakur. \

Before we begine, we will need to set up our Python environment. While the environment Colab is deployed with contains most of the libraries we'll need, such as pandas, geopandas, and earthengine, we do need to install geemap, which will allow us to interact with data and imagery on Google Earth Engine. When setting up an environment on your personal computer, a server, or a virtual machine such as and AWS EC2 instance, it's usually best to use some kind of package management software such as Conda to avoid dependency issues. Here, we'll use pip to install the missing library.

In [10]:
!pip install geemap
Requirement already satisfied: geemap in /usr/local/lib/python3.11/dist-packages (0.35.1)
Requirement already satisfied: bqplot in /usr/local/lib/python3.11/dist-packages (from geemap) (0.12.44)
Requirement already satisfied: colour in /usr/local/lib/python3.11/dist-packages (from geemap) (0.1.5)
Requirement already satisfied: earthengine-api>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from geemap) (1.4.6)
Requirement already satisfied: eerepr>=0.0.4 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.1.0)
Requirement already satisfied: folium>=0.17.0 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.19.4)
Requirement already satisfied: geocoder in /usr/local/lib/python3.11/dist-packages (from geemap) (1.38.1)
Requirement already satisfied: ipyevents in /usr/local/lib/python3.11/dist-packages (from geemap) (2.0.2)
Requirement already satisfied: ipyfilechooser>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.6.0)
Requirement already satisfied: ipyleaflet>=0.19.2 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.19.2)
Requirement already satisfied: ipytree in /usr/local/lib/python3.11/dist-packages (from geemap) (0.2.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (from geemap) (3.10.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from geemap) (1.26.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from geemap) (2.2.2)
Requirement already satisfied: plotly in /usr/local/lib/python3.11/dist-packages (from geemap) (5.24.1)
Requirement already satisfied: pyperclip in /usr/local/lib/python3.11/dist-packages (from geemap) (1.9.0)
Requirement already satisfied: pyshp>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from geemap) (2.3.1)
Requirement already satisfied: python-box in /usr/local/lib/python3.11/dist-packages (from geemap) (7.3.2)
Requirement already satisfied: scooby in /usr/local/lib/python3.11/dist-packages (from geemap) (0.10.0)
Requirement already satisfied: google-cloud-storage in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.19.0)
Requirement already satisfied: google-api-python-client>=1.12.1 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.155.0)
Requirement already satisfied: google-auth>=1.4.1 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.27.0)
Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (0.2.0)
Requirement already satisfied: httplib2<1dev,>=0.9.2 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (0.22.0)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.32.3)
Requirement already satisfied: branca>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from folium>=0.17.0->geemap) (0.8.1)
Requirement already satisfied: jinja2>=2.9 in /usr/local/lib/python3.11/dist-packages (from folium>=0.17.0->geemap) (3.1.5)
Requirement already satisfied: xyzservices in /usr/local/lib/python3.11/dist-packages (from folium>=0.17.0->geemap) (2025.1.0)
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.11/dist-packages (from ipyfilechooser>=0.6.0->geemap) (7.7.1)
Requirement already satisfied: jupyter-leaflet<0.20,>=0.19 in /usr/local/lib/python3.11/dist-packages (from ipyleaflet>=0.19.2->geemap) (0.19.2)
Requirement already satisfied: traittypes<3,>=0.2.1 in /usr/local/lib/python3.11/dist-packages (from ipyleaflet>=0.19.2->geemap) (0.2.1)
Requirement already satisfied: traitlets>=4.3.0 in /usr/local/lib/python3.11/dist-packages (from bqplot->geemap) (5.7.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->geemap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->geemap) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->geemap) (2025.1)
Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (8.1.8)
Requirement already satisfied: future in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (1.0.0)
Requirement already satisfied: ratelim in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (0.1.6)
Requirement already satisfied: six in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (1.17.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (4.55.7)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (24.2)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (3.2.1)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.11/dist-packages (from plotly->geemap) (9.0.0)
Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5 in /usr/local/lib/python3.11/dist-packages (from google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (2.19.2)
Requirement already satisfied: uritemplate<5,>=3.0.1 in /usr/local/lib/python3.11/dist-packages (from google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (4.1.1)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (5.5.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.11/dist-packages (from google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (0.4.1)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.11/dist-packages (from google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (4.9)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (5.5.6)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.0)
Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.6.10)
Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (7.34.0)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.0.13)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2>=2.9->folium>=0.17.0->geemap) (3.0.2)
Requirement already satisfied: google-cloud-core<3.0dev,>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from google-cloud-storage->earthengine-api>=1.0.0->geemap) (2.4.1)
Requirement already satisfied: google-resumable-media>=2.7.2 in /usr/local/lib/python3.11/dist-packages (from google-cloud-storage->earthengine-api>=1.0.0->geemap) (2.7.2)
Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /usr/local/lib/python3.11/dist-packages (from google-cloud-storage->earthengine-api>=1.0.0->geemap) (1.6.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (2024.12.14)
Requirement already satisfied: decorator in /usr/local/lib/python3.11/dist-packages (from ratelim->geocoder->geemap) (4.4.2)
Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.11/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (1.66.0)
Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5 in /usr/local/lib/python3.11/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (4.25.6)
Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /usr/local/lib/python3.11/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (1.26.0)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.11/dist-packages (from ipykernel>=4.5.1->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.1.12)
Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.11/dist-packages (from ipykernel>=4.5.1->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.4.2)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (75.1.0)
Requirement already satisfied: jedi>=0.16 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.19.2)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.0.50)
Requirement already satisfied: pygments in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.18.0)
Requirement already satisfied: backcall in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.0)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.1.7)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.9.0)
Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.11/dist-packages (from pyasn1-modules>=0.2.1->google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (0.6.1)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.11/dist-packages (from widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.5.5)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in /usr/local/lib/python3.11/dist-packages (from jedi>=0.16->ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.8.4)
Requirement already satisfied: pyzmq<25,>=17 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (24.0.1)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (23.1.0)
Requirement already satisfied: jupyter-core>=4.6.1 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (5.7.2)
Requirement already satisfied: nbformat in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (5.10.4)
Requirement already satisfied: nbconvert>=5 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (7.16.6)
Requirement already satisfied: nest-asyncio>=1.5 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.6.0)
Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.8.3)
Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.18.1)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.21.1)
Requirement already satisfied: nbclassic>=0.4.7 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.2.0)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.11/dist-packages (from pexpect>4.3->ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.11/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.13)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.11/dist-packages (from jupyter-core>=4.6.1->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.3.6)
Requirement already satisfied: notebook-shim>=0.2.3 in /usr/local/lib/python3.11/dist-packages (from nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.12.3)
Requirement already satisfied: bleach!=5.0.0 in /usr/local/lib/python3.11/dist-packages (from bleach[css]!=5.0.0->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.2.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.7.1)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.3.0)
Requirement already satisfied: mistune<4,>=2.0.3 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.1.1)
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.10.2)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.5.1)
Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.11/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.21.1)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.11/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.23.0)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.11/dist-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (21.2.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.11/dist-packages (from bleach!=5.0.0->bleach[css]!=5.0.0->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.5.1)
Requirement already satisfied: tinycss2<1.5,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from bleach[css]!=5.0.0->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.4.0)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (25.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2024.10.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.36.2)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.22.3)
Requirement already satisfied: jupyter-server<3,>=1.8 in /usr/local/lib/python3.11/dist-packages (from notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.24.0)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.17.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.6)
Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.22)
Requirement already satisfied: anyio<4,>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from jupyter-server<3,>=1.8->notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.7.1)
Requirement already satisfied: websocket-client in /usr/local/lib/python3.11/dist-packages (from jupyter-server<3,>=1.8->notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.8.0)
Requirement already satisfied: typing-extensions>=4.4.0 in /usr/local/lib/python3.11/dist-packages (from referencing>=0.28.4->jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.12.2)
Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio<4,>=3.1.0->jupyter-server<3,>=1.8->notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.3.1)

Now we're ready to begin working with our variety trial data!

In [11]:
from google.colab import drive #mounting Gdrive to be able to access data files
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [12]:
import pandas as pd
import geopandas as gpd

# loading yields and georeferences
file_id = '1UDz_P75VYZzxiy5mbJWC2tpUBUYvB6Kd' # Variety trial data from Gdrive link
link = f'https://drive.google.com/uc?id={file_id}'
yields = pd.read_csv(link)

yields
Out[12]:
logYield Management PlantingDate AnthesisDate ASI DaysToSilk vargroup yrcode sitecode gs150.gdd ... gs150.tmin tmin2 presilk.tmax2 silk.tmax2 postsilk.tmax2 presilk.gdd830 silk.gdd830 postsilk.gdd830 presilk.gdd30 silk.gdd30
0 2.116713 Optimal 20031205 64.7 2.0 66.7 EPOP 2004 1041 2057.781964 ... 15.750548 248.079765 803.539520 762.833373 762.645584 742.054326 281.274711 1023.444977 1.125855 0.020164
1 1.859403 Optimal 20031205 65.2 2.0 67.2 ILPO 2004 1041 2057.781964 ... 15.750548 248.079765 799.009864 767.489960 765.981249 755.186385 282.395275 1009.192354 1.125855 0.020164
2 1.662258 Optimal 20031205 69.9 2.5 72.4 EIHY 2004 1041 2057.781964 ... 15.750548 248.079765 793.657903 764.914204 779.044136 821.639768 291.092979 934.041267 1.125855 0.020164
3 2.094564 Optimal 20031205 67.8 2.3 70.1 ILHY 2004 1041 2057.781964 ... 15.750548 248.079765 794.168510 772.895256 773.155186 792.787401 289.850278 964.136335 1.125855 0.020164
4 1.950514 Optimal 20031205 68.0 3.4 71.4 ILHY 2004 1041 2057.781964 ... 15.750548 248.079765 793.751548 770.235062 775.340270 807.085180 291.329302 948.359532 1.125855 0.020164
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26137 -1.044977 Optimal 20070118 NaN NaN NaN EPOP 2007 731 2753.377492 ... 22.276206 496.229341 NaN NaN NaN NaN NaN NaN NaN NaN
26138 -0.868930 Optimal 20070118 NaN NaN NaN EPOP 2007 731 2753.377492 ... 22.276206 496.229341 NaN NaN NaN NaN NaN NaN NaN NaN
26139 -0.621571 Optimal 20070118 NaN NaN NaN EPOP 2007 731 2753.377492 ... 22.276206 496.229341 NaN NaN NaN NaN NaN NaN NaN NaN
26140 -0.696954 Optimal 20070118 NaN NaN NaN EPOP 2007 731 2753.377492 ... 22.276206 496.229341 NaN NaN NaN NaN NaN NaN NaN NaN
26141 -1.679324 Optimal 20070118 NaN NaN NaN EPOP 2007 731 2753.377492 ... 22.276206 496.229341 NaN NaN NaN NaN NaN NaN NaN NaN

26142 rows × 37 columns

In [12]:
 
In [13]:
file_id = '1fzOZbo6DMraeNVsri8FmI1R2No2oMWR3' # locations from Gdrive link
link = f'https://drive.google.com/uc?id={file_id}'
latlon = pd.read_csv(link)

latlon
Out[13]:
LocationID Country Location Region Latitude Longitude ElevM
0 5 ANGOLA ANGOLA Eastern and Southern Africa 0.00 0.00 0
1 6 ANGOLA CABINDA Eastern and Southern Africa -5.60 12.20 22
2 7 ANGOLA CACUSO Eastern and Southern Africa -9.42 15.75 1067
3 8 ANGOLA CELA Eastern and Southern Africa -11.42 15.12 1305
4 9 ANGOLA CHIANGA Eastern and Southern Africa -12.73 15.83 1693
... ... ... ... ... ... ... ...
180 1065 ZIMBABWE RATTRAY FARM Eastern and Southern Africa -17.70 31.20 1403
181 1066 ZIMBABWE RUWA Eastern and Southern Africa -17.89 31.24 1554
182 1067 ZIMBABWE SAVE VALLEY Eastern and Southern Africa -20.35 32.33 443
183 1068 ZIMBABWE SHAMVA ZIM Eastern and Southern Africa -17.32 31.57 1106
184 1069 ZIMBABWE ZIMBABWE Eastern and Southern Africa 0.00 0.00 0

185 rows × 7 columns

In [14]:
# Filter out locations with missing latlon
latlon = latlon.loc[(latlon['Latitude'])!=0|(latlon['Longitude']!=0)].reset_index()
latlon
Out[14]:
index LocationID Country Location Region Latitude Longitude ElevM
0 1 6 ANGOLA CABINDA Eastern and Southern Africa -5.60 12.20 22
1 2 7 ANGOLA CACUSO Eastern and Southern Africa -9.42 15.75 1067
2 3 8 ANGOLA CELA Eastern and Southern Africa -11.42 15.12 1305
3 4 9 ANGOLA CHIANGA Eastern and Southern Africa -12.73 15.83 1693
4 5 10 ANGOLA HUMPATA Eastern and Southern Africa -15.03 13.43 1890
... ... ... ... ... ... ... ... ...
145 179 1064 ZIMBABWE RATTRAY ARNOLD Eastern and Southern Africa -17.67 31.17 1458
146 180 1065 ZIMBABWE RATTRAY FARM Eastern and Southern Africa -17.70 31.20 1403
147 181 1066 ZIMBABWE RUWA Eastern and Southern Africa -17.89 31.24 1554
148 182 1067 ZIMBABWE SAVE VALLEY Eastern and Southern Africa -20.35 32.33 443
149 183 1068 ZIMBABWE SHAMVA ZIM Eastern and Southern Africa -17.32 31.57 1106

150 rows × 8 columns

In [15]:
import shapely.geometry as shg

print(f"First latitude: {latlon['Latitude'].iloc[0]}")
print(f"First longitude: {latlon['Longitude'].iloc[0]}")

lat = latlon['Latitude'].iloc[0]
lon = latlon['Longitude'].iloc[0]

shg.Point(lon,lat)
First latitude: -5.6
First longitude: 12.2
Out[15]:
No description has been provided for this image
In [16]:
# Make geometry and turn it into a GeoDataFrame
import shapely.geometry as shg

latlon['geometry'] = latlon.apply(lambda x: shg.Point(x['Longitude'],x['Latitude']),axis=1)
latlon = gpd.GeoDataFrame(latlon,geometry='geometry',crs="EPSG:4326")
latlon.plot('Country')
Out[16]:
<Axes: >
No description has been provided for this image
In [17]:
latlon
Out[17]:
index LocationID Country Location Region Latitude Longitude ElevM geometry
0 1 6 ANGOLA CABINDA Eastern and Southern Africa -5.60 12.20 22 POINT (12.2 -5.6)
1 2 7 ANGOLA CACUSO Eastern and Southern Africa -9.42 15.75 1067 POINT (15.75 -9.42)
2 3 8 ANGOLA CELA Eastern and Southern Africa -11.42 15.12 1305 POINT (15.12 -11.42)
3 4 9 ANGOLA CHIANGA Eastern and Southern Africa -12.73 15.83 1693 POINT (15.83 -12.73)
4 5 10 ANGOLA HUMPATA Eastern and Southern Africa -15.03 13.43 1890 POINT (13.43 -15.03)
... ... ... ... ... ... ... ... ... ...
145 179 1064 ZIMBABWE RATTRAY ARNOLD Eastern and Southern Africa -17.67 31.17 1458 POINT (31.17 -17.67)
146 180 1065 ZIMBABWE RATTRAY FARM Eastern and Southern Africa -17.70 31.20 1403 POINT (31.2 -17.7)
147 181 1066 ZIMBABWE RUWA Eastern and Southern Africa -17.89 31.24 1554 POINT (31.24 -17.89)
148 182 1067 ZIMBABWE SAVE VALLEY Eastern and Southern Africa -20.35 32.33 443 POINT (32.33 -20.35)
149 183 1068 ZIMBABWE SHAMVA ZIM Eastern and Southern Africa -17.32 31.57 1106 POINT (31.57 -17.32)

150 rows × 9 columns

Visualizing Geographic Areas
We can visualize the locations using Google Earth Engine's mapping capabilities. If you're using Google Colab or other setups where you're running into errors with standard geemap (for example, Jupyter notebooks in VS Code using Windows Subsystem for Linux), you'll need to use geemap.foliumap.Map(). Otherwise, you can use geemap.Map(). \

Our first step will be to Authenticate the connection to Google Earth Engine. The first time you run this code, you'll need to Authenticate using the authentication code you're provided when you are granted GEE access. For more detailed instruction, see this Introduction to Google Earth Engine in the World Bank's Nighttime Lights tutorial. \

Finally, we'll convert our GeoPandas polygons to geographic types that geemap will recognize, which we can do easily using geemap.geopandas_to_ee(). (For more information on converting between geemap and geopandas, see this documentation.) Once we've converted our plots, we can add them to a basemap from geemap to visualize the areas we'll be analyzing. \

Note: As of spring 2022, there are problems using the Earth Engine package with Python 3.9. If you run into an error, for example ModuleNotFoundError: No module named 'StringIO', it may mean that you need to use a different version of Python. I've found 3.7 to work with all the packages required for this module. If you are following the World Bank OpenNightLights tutorial to create a conda environment, this will require creating the environment with conda create --name onl python=3.7 rather than python=3.9. \

In [18]:
import geemap.foliumap, ee

try:
        ee.Initialize()
except Exception as e:
        ee.Authenticate()
        ee.Initialize(project='ee-nltryan')
WARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
In [19]:
site_a = latlon[['LocationID','geometry']].iloc[0:1]
site_b = latlon[['LocationID','geometry']].iloc[50:51]
site_c = latlon[['LocationID','geometry']].iloc[100:101]
In [20]:
ee_site_a = geemap.geopandas_to_ee(site_a).geometry()
center_lat = latlon['Latitude'].iloc[0]
center_lon = latlon['Longitude'].iloc[0]

map_site_a = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=17)
map_site_a.add_basemap('SATELLITE')
map_site_a.addLayer(ee_site_a,vis_params={'color':'a4c639'},opacity=0.8)
map_site_a
Out[20]:
In [21]:
ee_site_b = geemap.geopandas_to_ee(site_b).geometry()
center_lat = latlon['Latitude'].iloc[50]
center_lon = latlon['Longitude'].iloc[50]

map_site_b = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=17)
map_site_b.add_basemap('SATELLITE')
map_site_b.addLayer(ee_site_b,vis_params={'color':'a4c639'},opacity=0.8)
map_site_b
Out[21]:
In [22]:
ee_site_c = geemap.geopandas_to_ee(site_c).geometry()
center_lat = latlon['Latitude'].iloc[100]
center_lon = latlon['Longitude'].iloc[100]

map_site_c = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=17)
map_site_c.add_basemap('SATELLITE')
map_site_c.addLayer(ee_site_c,vis_params={'color':'a4c639'},opacity=0.8)
map_site_c
Out[22]:
In [23]:
# Merge locations onto yields
yields = pd.merge(yields,latlon,
                  left_on='sitecode',right_on='LocationID')
yields = yields.drop('index',axis=1).reset_index()
yields
Out[23]:
index logYield Management PlantingDate AnthesisDate ASI DaysToSilk vargroup yrcode sitecode ... presilk.gdd30 silk.gdd30 LocationID Country Location Region Latitude Longitude ElevM geometry
0 0 2.116713 Optimal 20031205 64.7 2.0 66.7 EPOP 2004 1041 ... 1.125855 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2)
1 1 1.859403 Optimal 20031205 65.2 2.0 67.2 ILPO 2004 1041 ... 1.125855 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2)
2 2 1.662258 Optimal 20031205 69.9 2.5 72.4 EIHY 2004 1041 ... 1.125855 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2)
3 3 2.094564 Optimal 20031205 67.8 2.3 70.1 ILHY 2004 1041 ... 1.125855 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2)
4 4 1.950514 Optimal 20031205 68.0 3.4 71.4 ILHY 2004 1041 ... 1.125855 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26137 26137 -1.044977 Optimal 20070118 NaN NaN NaN EPOP 2007 731 ... NaN NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84)
26138 26138 -0.868930 Optimal 20070118 NaN NaN NaN EPOP 2007 731 ... NaN NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84)
26139 26139 -0.621571 Optimal 20070118 NaN NaN NaN EPOP 2007 731 ... NaN NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84)
26140 26140 -0.696954 Optimal 20070118 NaN NaN NaN EPOP 2007 731 ... NaN NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84)
26141 26141 -1.679324 Optimal 20070118 NaN NaN NaN EPOP 2007 731 ... NaN NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84)

26142 rows × 46 columns

This dataframe can be easily joined by PlantingDate and sitecode to the other sources of data we'll need for our model.

EO Data
Landsat Imagery
To obtain our Landsat imagery for each of the trials, we will use Google Earth Engine's API to obtain the Landsat 5 TM and Landsat 7 ETHM+ Tier 1 ImageCollections. The Landsat imagery needs to be pre-processed to mask pixels with clouds and shadows so these do not influence our predictions of crop type. Masking pixels in an image makes the pixels transparent, so they will not be used in any of the operations or analysis performed. Masks are Images themselves that act as a guide for which pixels should be kept and which pixels should be made transparent. These mask Images can then be applied to other Images which cover the same area. Once applied, any pixels which have mask values which below a threshold (eg, mask values of 0) are made transparent on the image being masked. For more information and an example of masking, see Google Earth Engine's Masking tutorial. We will make use of Landsat's Quality band, which can be used to construct measures of cloud and shadow likelihood. This enables us to identify and mask (make transparent) the pixels which don't represent land. \

First, though, we need to figure our where and when we'll look for imagery. To do so, we'll need to upload the geographic and temporal data related to the trials to Google Earth Engine. We'll look at a 90 day window after the planting date, as this is about how long it takes for maize to mature, and a 5km buffer around the GPS position. This information will be uploaded to GEE as a FeatureCollection, Earth Engine's analog to a row of a GeoPandas GeoDataFrame.

In [24]:
import datetime as dt

yields['PlantingDate'] = yields['PlantingDate'].apply(lambda x: dt.date(int(str(x)[0:4]),
                                                                        int(str(x)[4:6]),
                                                                        int(str(x)[6:])))
yields['EndDate'] = yields['PlantingDate'].apply(lambda x: x+dt.timedelta(days=90))

yields['PlantingDate'] = yields['PlantingDate'].apply(lambda x: x.isoformat())
yields['EndDate'] = yields['EndDate'].apply(lambda x: x.isoformat())

# Only need to get data for unique trials
yield_site_dates = yields[['sitecode','PlantingDate','EndDate','geometry']].drop_duplicates()
yield_site_dates
Out[24]:
sitecode PlantingDate EndDate geometry
0 1041 2003-12-05 2004-03-04 POINT (28.4 -14.2)
121 1041 2004-12-27 2005-03-27 POINT (28.4 -14.2)
211 1041 2002-12-16 2003-03-16 POINT (28.4 -14.2)
297 1041 2006-01-04 2006-04-04 POINT (28.4 -14.2)
332 1041 2006-12-12 2007-03-12 POINT (28.4 -14.2)
... ... ... ... ...
25937 729 2007-02-01 2007-05-02 POINT (38.87 -13.24)
26025 520 2006-12-10 2007-03-10 POINT (27.78 -29.55)
26027 520 2006-12-14 2007-03-14 POINT (27.78 -29.55)
26067 737 2006-12-19 2007-03-19 POINT (34.75 -24.49)
26117 731 2007-01-18 2007-04-18 POINT (38.26 -16.84)

505 rows × 4 columns

In [25]:
yield_site_dates['eeGeom'] = yield_site_dates['geometry'].apply(lambda x: ee.Geometry.Point(x.x,x.y)) # EE GPS data
yield_site_dates['eeFeature'] = yield_site_dates.apply(lambda x: ee.Feature(x['eeGeom'].buffer(5_000), # Buffer out 5km
 {'sitecode':x['sitecode'],'PlantingDate':x['PlantingDate'],'EndDate':x['EndDate']}), # Include sitecode and PlantingDate for merging later
                                   axis=1)
ee_site_dates = ee.FeatureCollection(yield_site_dates['eeFeature'].tolist())
# Visualize the sites
center_lat = -12.042079
center_lon = 25.756629

map_sites = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=4)
map_sites.add_basemap('SATELLITE')
map_sites.addLayer(ee_site_dates,vis_params={'color':'a4c639'},opacity=0.8)
map_sites
Out[25]:

With these areas to use as our geometries, we can now begin to pull in the Landsat data, mask the clouds, and calculate the indices. I wrote the following functions to work with Landsat data and have used them in multiple projects. \

In [26]:
def cloud_mask(img):
    '''
    Function to mask cloudy pixels in a ls image using QA_PIXEL bits. Note: the relevant bits of QA_PIXEL are the same across satellites.

    Inputs:
        - img (ee.Image): the ls image to mask

    Outputs:
        - mask_img (ee.Image): img with cloudy pixels masked
    '''

    qa = img.select(['QA_PIXEL'])

    mask = ee.Image.constant(1).subtract((qa.bitwiseAnd(1<<3).And(qa.bitwiseAnd(1<<9))).Or(qa.bitwiseAnd(1<<4).And(qa.bitwiseAnd(1<<10))))

    return(img.updateMask(mask))

def calc_ndvi(img,red,nir):
    '''
    Function to calculate NDVI in an image

    Inputs:
        - img (ee.Image): The ls image to calculate on
        - red (str): red band name
        - nir (str): nir band name
    Outputs:
        - ndvi (ee.Image): image with one ndvi band named 'ndvi'
    '''
    # NDVI = (NIR-Red) / (NIR+Red)
    ndvi = ee.Image(img.normalizedDifference([nir,red]))
    ndvi = ndvi.rename('ndvi')

    return(ndvi.copyProperties(img))

def scale_facts(img):
    opticalBands = img.select('SR_B.').multiply(0.0000275).add(-0.2)
    thermalBands = img.select('ST_B.*').multiply(0.00341802).add(149.0)
    return(img.addBands(opticalBands,overwrite=True).addBands(thermalBands,overwrite=True))

def calc_mean(feature,img,scale):
    img = ee.Image(img)
    feature_mean = img.reduceRegion(
        ee.Reducer.mean(),
        feature.geometry(),
        scale,
        maxPixels=15_000_000,
        bestEffort=True
    )

    updated = feature.set(feature_mean)
    return(updated)


def append_imgs(current,prev):
    next_img = ee.Image(prev).addBands(current)
    return(next_img)

Our workflow will proceed as follows. For each trial, we will restrict attention to the set of Landsat 4, 5, and 7 images that intersect the 5km radius circle centered on the provided GPS coordinates during the 90 days after the planting date. For each of these images, we will then mask clouds and shadows, mask non-agricultural land using a landuse product developed by the European Space Agency, calculate NDVI, and record the value as a new column for the trial along with the date the image was taken. \

Since trials may have different numbers of images, this will end up being a ragged array, with some locations having many more images than others.

We'll first get the land use image that we'll use to mask non-agricultural pixels.

In [27]:
agland = ee.ImageCollection("ESA/WorldCover/v200").first().eq(40).rename('ag')

Next, we'll need to add the date of capture information to images, apply relevant scaling factors, mask clouds and non-agricultural land, and calculate NDVI. We'll use a function to do this and apply it to each image. \

In [28]:
def process_ls_img(img,red,nir,sat):
    sat_band = ee.Image.constant(sat).rename("sat")

    date = img.get('DATE_ACQUIRED')
    date = ee.Date(date).millis()
    date = ee.Image.constant(date).rename('d')

    img = scale_facts(img)
    img = cloud_mask(img)
    img = img.updateMask(agland)
    img_ndvi = calc_ndvi(img,'SR_B3','SR_B4')
    img_ndvi = ee.Image(img_ndvi)
    img = img_ndvi.addBands(date).addBands(sat_band)
    return(img)

We'll apply this function within a function that will use the trial location and timing information to restrict to the relevant set of images.

In [29]:
def site_ndvis(site_date):
    planting_date = ee.String(site_date.get('PlantingDate'))
    end_date = ee.String(site_date.get('EndDate'))

    ls5 = ee.ImageCollection("LANDSAT/LT05/C02/T1_L2").filterBounds(site_date.geometry()).filterDate(planting_date,end_date)
    ls7 = ee.ImageCollection("LANDSAT/LE07/C02/T1_L2").filterBounds(site_date.geometry()).filterDate(planting_date,end_date)


    ls5 = ee.Algorithms.If(ls5.size().gt(0),
                           ls5.map(lambda x: process_ls_img(x,'SR_B3','SR_B4',5)),
                           ee.ImageCollection([]))

    ls7 = ee.Algorithms.If(ls7.size().gt(0),
                           ls7.map(lambda x: process_ls_img(x,'SR_B3','SR_B4',7)),
                           ee.ImageCollection([]))



    all_ls = ee.ImageCollection(ls5).merge(ee.ImageCollection(ls7))
    all_ls_size = all_ls.size()
    all_ls = ee.Algorithms.If(all_ls_size.gt(0),
                              all_ls.iterate(append_imgs,ee.Image()),
                              all_ls)

    site_data = ee.Algorithms.If(all_ls_size.gt(0),
                                 calc_mean(site_date,all_ls,30),
                                 site_date)

    return(site_data)

We'll now apply this function to every image in our S2 ImageCollection using the function imageCollection.map(), which cycles through every image in an ImageCollection, applies the provided function to the image, and returns each processed image as a new ImageCollection.

In [30]:
ee_sites_ls = ee_site_dates.map(lambda x: site_ndvis(x))

Now, we can look into the FeatureCollection we've created to ensure we have what we need.

In [35]:
ls_cols = ['sitecode','PlantingDate','ndvi','d','sat']
ls_cols = ls_cols + [f'ndvi_{i}' for i in range(1,18)]
ls_cols = ls_cols + [f'd_{i}' for i in range(1,18)]
ls_cols = ls_cols + [f'sat_{i}' for i in range(1,18)]
#sites_ls = geemap.ee_to_pandas(ee_sites_ls,col_names=ls_cols)
sites_ls = geemap.ee_to_df(ee_sites_ls,columns=ls_cols)
sites_ls
Out[35]:
sitecode PlantingDate ndvi d sat ndvi_1 ndvi_2 ndvi_3 ndvi_4 ndvi_5 ... sat_8 sat_9 sat_10 sat_11 sat_12 sat_13 sat_14 sat_15 sat_16 sat_17
0 1041 2003-12-05 0.639729 1.075853e+12 5.0 0.512278 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1041 2004-12-27 0.593166 1.107648e+12 5.0 0.596977 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1041 2002-12-16 0.542198 1.043366e+12 7.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1041 2006-01-04 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1041 2006-12-12 NaN 1.169856e+12 5.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
500 729 2007-02-01 NaN 1.172534e+12 5.0 0.703562 0.671129 0.673045 0.751673 0.691734 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
501 520 2006-12-10 0.347322 1.165882e+12 5.0 0.446907 0.514929 0.348674 0.440482 0.510405 ... 7.0 7.0 7.0 7.0 7.0 7.0 NaN NaN NaN NaN
502 520 2006-12-14 0.446907 1.167264e+12 5.0 0.514929 0.440482 0.510405 0.431683 0.381406 ... 7.0 7.0 7.0 7.0 7.0 7.0 NaN NaN NaN NaN
503 737 2006-12-19 NaN 1.171757e+12 5.0 0.718362 0.678643 0.757944 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
504 731 2007-01-18 0.548135 1.169770e+12 5.0 0.604690 0.626434 0.646203 0.598807 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

505 rows × 56 columns

In [36]:
# Clean up landsat data
from math import pi
import re
import datetime as dt

import numpy as np

sites_ls = sites_ls.rename(columns={'d':'d_0','sat':'sat_0','ndvi':'ndvi_0'})
ls_stack = sites_ls.set_index(['sitecode','PlantingDate']).stack().reset_index(-1).rename(columns={'level_2':'var'})
ls_stack['t'] = ls_stack['var'].apply(lambda x: re.search("(?<=_)[0-9]+",x)[0])
ls_stack['var'] = ls_stack['var'].apply(lambda x: re.search(".*(?=_)",x)[0])
ls_stack = ls_stack.reset_index().set_index(['sitecode','PlantingDate','t','var']).unstack()


def fix_cols(tup):
    if tup[0]!=0:
        return(tup[0])
    else:
        return(tup[1])

ls_stack = ls_stack.reset_index()
ls_stack.columns = [fix_cols(c) for c in ls_stack.columns]

ls_stack = ls_stack.dropna(subset='ndvi')

ls_stack['d'] = ls_stack['d'].apply(lambda x: dt.datetime.fromtimestamp(x / 1000.0, tz=dt.timezone.utc))
ls_stack['d'] = ls_stack['d'].apply(lambda x: x.date())
ls_stack['PlantingDate'] = ls_stack['PlantingDate'].apply(dt.date.fromisoformat)
ls_stack['t'] = ls_stack['d']-ls_stack['PlantingDate']
ls_stack['t'] = ls_stack['t'].apply(lambda x: x.days/90)

ls_stack
Out[36]:
sitecode PlantingDate t d ndvi sat
0 6 2002-12-13 0.433333 2003-01-21 0.422388 7.0
1 6 2002-12-13 0.611111 2003-02-06 0.469667 7.0
3 6 2002-12-16 0.400000 2003-01-21 0.422388 7.0
4 6 2002-12-16 0.577778 2003-02-06 0.469667 7.0
6 6 2002-12-17 0.388889 2003-01-21 0.422388 7.0
... ... ... ... ... ... ...
2352 1067 2006-05-20 0.788889 2006-07-30 0.289856 5.0
2353 1067 2006-05-20 0.088889 2006-05-28 0.331522 7.0
2354 1068 2004-12-24 0.611111 2005-02-17 0.679722 5.0
2355 1068 2004-12-24 0.166667 2005-01-08 0.576108 7.0
2356 1068 2004-12-24 0.344444 2005-01-24 0.457912 7.0

1911 rows × 6 columns

While these indices were created for us in the code, let's quickly go over calculation with image bands in case you need different indices or calculations. Complex expressions can be performed using the Image.expression() method. To use this, you provide the equation and the bands to use the equation and apply this equation to every image using .map(). For example, to calculate NDVI, we use the formula (NIR - RED) / (NIR + RED), where NIR is the 'NIR' band and RED is the 'RED' band.

Now that we've seen how everything works in code, let's head over to the Earth Engine code editor to get a sense of what's happening behind the scenes and how you can start to develop new projects using Earth Engine. \

This script has one piece to fill in at line 10. Look back at the code in this notebook to help figure out how to finish that line. Once you've completed the script, you'll be able to explore qualities of an image collection interactively and visualize a Landsat image. \

This next script has you obtain statistics over an area. You'll also get to see how Earth Engine combines data from many images into a single image. \

This final script has you practice doing "band math" and masking images. You'll see how you can perform calculations on images and combine information from multiple images.

Index Time Series and Crop Phenology
Since we've obtained Landsat imagery for our entire growing season duration, we can also visualize our data over time to view how NDVI changes. The change in index values over time can provide important information, including a crop's growth cycle and pattern, known as crop phenology. As a crop grows from seed to full plant, the values of the indices for that area will also change, allowing researchers to obtain detailed information about a crop's growth. \

The FeatureCollection we obtained is comprised of multiple satellite Images taken at different times and areas, and each image contains multiple bands - including those used to calculate NDVI, GCVI, and the other vegetation indices. \

We'll visualize the data two ways here: \

  • We'll look at all NDVI values relative to their time in the 90 day window \
  • We'll look at one particular NDVI series
In [37]:
 ls_stack.plot.scatter(x='t',y='ndvi')
Out[37]:
<Axes: xlabel='t', ylabel='ndvi'>
No description has been provided for this image
In [38]:
series_ex = ls_stack.loc[(ls_stack['sitecode']==1067)&(ls_stack['PlantingDate'].apply(str)=='2006-05-20')]
series_ex[['d','ndvi']].set_index('d').plot()
Out[38]:
<Axes: xlabel='d'>
No description has been provided for this image

Here, it looks like NDVI generally increases, though there is quite a bit of noise. This probably has to do with spatial error in the reported location of the variety trial.

We've plotted the mean NDVI values for our sample trial over our growing season. The values are fairly choppy, especially with the number of images masked due to cloud cover, which can make it more difficult to determine the pattern over time. This is why smoothing is often applied to indices in order to use them in determining crop phenology. \

Deines et al. fit a harmonic regression model to the time series of NDVI, which means they include the time values within sine and cosine terms in their regression equation. For a brief introduction to using the sum of sine and cosine terms, known as a Fourier series, in functions representing periodic motion and to harmonic analysis, see this Britannica explainer. While other methods have been explored for accurately capturing the changes in crops over seasons, the use of harmonic regression has been found to be successful in representing crop behavior (Jin et al. 2019). \

For our use case with one harmonic pair, we'll be running the regression \

$NDVI_{it}=\alpha_i+\delta_i t+\beta^{cos}_icos(2\pi t)+\beta^{sin}_isin(2\pi t)+ɛ_{it}$

where $t$ is the fraction of the 90 day window that has passed at the time of observation. \

To actually implement this, we'll need to add the cosine and sine terms, then run a trial-specific linear regression. Because we have four terms to fit, we need at least four observations within the 90 day period to actually estimate all of the parameters of the model. \

In [39]:
import warnings

from sklearn.linear_model import LinearRegression

ls_stack['cos_t'] = (2*pi*ls_stack['t']).apply(np.cos)
ls_stack['sin_t'] = (2*pi*ls_stack['t']).apply(np.sin)

yield_site_dates['ls_int'] = np.nan
yield_site_dates['ls_t_coef'] = np.nan
yield_site_dates['ls_cos_coef'] = np.nan
yield_site_dates['ls_sin_coef'] = np.nan

for i in range(0,len(yield_site_dates.index)):
    site_date = ls_stack.loc[(ls_stack['sitecode']==yield_site_dates['sitecode'].iloc[i])&(ls_stack['PlantingDate'].apply(str)==yield_site_dates['PlantingDate'].iloc[i])]
    #print(len(site_date.index))
    if len(site_date.index)>=4:
        y=site_date['ndvi'].to_numpy()
        x=site_date[['t','cos_t','sin_t']].to_numpy()

        ols = LinearRegression()
        ols.fit(x,y)
        # Lots of value setting warnings from pandas that we'll ignore
        with warnings.catch_warnings():
          warnings.simplefilter("ignore")
          yield_site_dates['ls_int'].iloc[i] = ols.intercept_
          yield_site_dates['ls_t_coef'].iloc[i] = ols.coef_[0]
          yield_site_dates['ls_cos_coef'].iloc[i] = ols.coef_[1]
          yield_site_dates['ls_sin_coef'].iloc[i] = ols.coef_[2]

yield_site_dates
Out[39]:
sitecode PlantingDate EndDate geometry eeGeom eeFeature ls_int ls_t_coef ls_cos_coef ls_sin_coef
0 1041 2003-12-05 2004-03-04 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN
121 1041 2004-12-27 2005-03-27 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN
211 1041 2002-12-16 2003-03-16 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN
297 1041 2006-01-04 2006-04-04 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN
332 1041 2006-12-12 2007-03-12 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
25937 729 2007-02-01 2007-05-02 POINT (38.87 -13.24) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... -0.161036 1.316836 -0.227792 0.159079
26025 520 2006-12-10 2007-03-10 POINT (27.78 -29.55) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... 0.366722 0.230899 -0.000952 -0.008080
26027 520 2006-12-14 2007-03-14 POINT (27.78 -29.55) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... 0.523215 -0.050080 0.009552 -0.091911
26067 737 2006-12-19 2007-03-19 POINT (34.75 -24.49) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN
26117 731 2007-01-18 2007-04-18 POINT (38.26 -16.84) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... 0.578816 0.036213 -0.016432 -0.035037

505 rows × 10 columns

The coefficients that we've added as columns now summarize the trajectory of NDVI over the 90 day period using four values. We'll use these four values as inputs into our predictions later.

OPTIONAL: Plot Smoothed Index Time Series
As an optional step, let's now try to recreate our time series plot above using the smoothed values for NDVI. \

To do so, we'll make a dataframe of predicted values at many values of $t$, using the coefficients we just estimated. We'll then merge on the observed values of NDVI and compare the predictions to the truth.

In [40]:
ndvi_pred = pd.DataFrame({'t':np.linspace(0,1,100)})
ndvi_pred['cos_t'] = (2*pi*ls_stack['t']).apply(np.cos)
ndvi_pred['sin_t'] = (2*pi*ls_stack['t']).apply(np.sin)

int_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_int'].iloc[0]
t_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_t_coef'].iloc[0]
cos_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_cos_coef'].iloc[0]
sin_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_sin_coef'].iloc[0]

ndvi_pred['ndvi_hat'] = int_coef+t_coef*ndvi_pred['t']+cos_coef*ndvi_pred['cos_t']+sin_coef*ndvi_pred['sin_t']
ndvi_pred = pd.merge(ndvi_pred,
                     ls_stack.loc[(ls_stack['sitecode']==1067)&(ls_stack['PlantingDate'].apply(str)=='2006-05-20'),['t','ndvi']],
                     on='t',
                     how='outer')

ndvi_pred
Out[40]:
t cos_t sin_t ndvi_hat ndvi
0 0.000000 -0.913545 0.406737 0.308813 NaN
1 0.010101 -0.766044 -0.642788 0.316720 NaN
2 0.020202 NaN NaN NaN NaN
3 0.030303 -0.809017 0.587785 0.310176 NaN
4 0.040404 -0.882948 -0.469472 0.316090 NaN
... ... ... ... ... ...
113 0.969697 -0.848048 0.529919 0.355034 NaN
114 0.977778 NaN NaN NaN 0.420683
115 0.979798 -0.809017 -0.587785 0.362338 NaN
116 0.989899 0.173648 -0.984808 0.374155 NaN
117 1.000000 NaN NaN NaN NaN

118 rows × 5 columns

In [41]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1,1)

ndvi_pred.loc[~ndvi_pred['ndvi_hat'].isna(),['t','ndvi_hat']].set_index('t').plot(ax=ax)
ndvi_pred.loc[~ndvi_pred['ndvi'].isna(),['t','ndvi']].plot.scatter(x='t',y='ndvi',ax=ax)
Out[41]:
<Axes: xlabel='t', ylabel='ndvi'>
No description has been provided for this image

Shuttle Radar Topography Mission Data
To obtain the elevation, slope, and aspect information for our plots, we will use the Shuttle Radar Topography Mission Version 3.0, which we can find in Google Earth Engine as the Image ee.Image("USGS/SRTMGL1_003").

First, we'll set the variables to use in downloading the elevation data:

In [42]:
# define GEE id of SRTM V3 Image
srtm = ee.Image('USGS/SRTMGL1_003').updateMask(agland)

We can obtain the elevation itself using the srtm_id and selecting the band elevation. Then, we can use this elevation Image along with the ee.Terrain package to obtain the slope and aspect information.

In [43]:
elevation = srtm.select('elevation')
slope = ee.Terrain.slope(elevation).rename('slope')
aspect = ee.Terrain.aspect(elevation).rename('aspect')

We want to calculate the elevation, slope, and aspect for each plot of land and store the values for use in our model. This will again require the geographic boundaries of the plots to define the areas over which we'll calculate average elevation, slope, and aspect. We can use the same geemap sample plots created above to calculate average elevation, slope, and aspect with the reduceRegions() method. To begin, though, let's first demonstrate with a single Image and plot: we'll calculate the mean slope in plot c using reduceRegion() and the slope Image we just created.

In [44]:
# Calculate mean slope for area
slope_c = slope.reduceRegion(
    reducer = ee.Reducer.mean(),
    geometry = ee_site_c.buffer(1000),
    scale = 30
)
mean_slope_c = slope_c.get('slope')
print('Mean slope for site c:',mean_slope_c.getInfo())
Mean slope for site c: 2.747879962126414

We can also visualize these topography characteristics on a map to get a better sense of what we're calculating. For example, if we'd like to view areas of higher slope, we can plot slope on top of a map of the areas we're interested in.

In [45]:
center_lat = yield_site_dates['geometry'].iloc[100].y
center_lon = yield_site_dates['geometry'].iloc[100].x

elev_map = geemap.foliumap.Map(center=[center_lat,center_lon], zoom=9)
elev_map.addLayer(slope, {min:0, max: 90}, 'slope')
elev_map
Out[45]:

In this example, we can visualize areas of higher and lower slope as shades of grey. While this provides the information we need, often it can be helpful to display satellige imagery using a more extensive color palette to more easily distinguish the gradient of values. Here's an example of the elevation we obtained from SRTM using a white to red color palette.

In [46]:
center_lat = yield_site_dates['geometry'].iloc[100].y
center_lon = yield_site_dates['geometry'].iloc[100].x


map_plot_c = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=6)

terrain_params = {'min': 0, 'max': 1000,
                  'palette': [ '#ffffff','#ff0000']}
map_plot_c.addLayer(elevation, terrain_params, 'elevation')
map_plot_c
Out[46]:

In order to obtain the elevation, slope, and aspect for all plots of interest for use in our model, we can use ee.Image.reduceRegion() to perform the same calculations for the mean and .map() to apply it to multiple areas. We'll use our ee_site_dates feature collection for this purpose.

In [48]:
srtm_img = elevation.addBands(slope)
srtm_img = srtm_img.addBands(aspect)

ee_site_srtm = ee_site_dates.map(lambda x: calc_mean(x,srtm_img,30))
# srtm_df = geemap.ee_to_pandas(ee_site_srtm)
srtm_df = geemap.ee_to_df(ee_site_srtm)

srtm_df
Out[48]:
EndDate PlantingDate aspect elevation sitecode slope
0 2004-03-04 2003-12-05 141.450640 1181.855266 1041 1.977909
1 2005-03-27 2004-12-27 141.450640 1181.855266 1041 1.977909
2 2003-03-16 2002-12-16 141.450640 1181.855266 1041 1.977909
3 2006-04-04 2006-01-04 141.450640 1181.855266 1041 1.977909
4 2007-03-12 2006-12-12 141.450640 1181.855266 1041 1.977909
... ... ... ... ... ... ...
500 2007-05-02 2007-02-01 176.217421 498.452234 729 2.160150
501 2007-03-10 2006-12-10 174.417739 1994.903339 520 5.144013
502 2007-03-14 2006-12-14 174.417739 1994.903339 520 5.144013
503 2007-03-19 2006-12-19 156.100872 89.777128 737 2.426769
504 2007-04-18 2007-01-18 168.197365 73.112299 731 2.247579

505 rows × 6 columns

We can see that this DataFrame has values for mean elevation, slope, and aspect for each of our trials. These values can be merged on to our Landsat data to provide extra information useful for predicting yield.

Climate Data
For this demonstration, we will utilize ERA 5 Reanalysis data on daily precipitation and maximum temperature. These are available for direct download or through Google Earth Engine; here, we'll use Google Earth Engine to obtain mean values for the 5km radius circles centered on the provided GPS locations.

First, let's visualize precipitation for one trial.

In [49]:
# this is not a real plot from any survey data; it was randomly selected for demonstration
center_lat = yield_site_dates['geometry'].iloc[0].y
center_lon = yield_site_dates['geometry'].iloc[0].x

precip_point_g = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=7)
precip_point_g.add_basemap('SATELLITE')

# select precipitation for the range of dates in the growing season
precip = ee.ImageCollection("ECMWF/ERA5/DAILY").\
  filter(ee.Filter.date(yield_site_dates['PlantingDate'].iloc[0], yield_site_dates['EndDate'].iloc[0])).select('total_precipitation')

precip_params = {'min': 0.0, 'max': 0.10,\
  'palette': ['001137', '0aab1e', 'e7eb05', 'ff4a2d', 'e90000'],}
precip_point_g.addLayer(precip, precip_params, 'total_precipitation');
precip_point_g
Out[49]:

Now, we'll get the total precipitation and temperature over the course of the growing season for each of our plots:

In [51]:
def process_era_img(img):
    img = img.select(['maximum_2m_air_temperature','total_precipitation'])
    img = img.rename(['t','p'])
    img = img.updateMask(agland)
    return(img)

def site_weather(site_date):
    planting_date = ee.String(site_date.get('PlantingDate'))
    end_date = ee.String(site_date.get('EndDate'))

    era5 = ee.ImageCollection("ECMWF/ERA5/DAILY").filterBounds(site_date.geometry()).filterDate(planting_date,end_date)

    era5 = ee.Algorithms.If(era5.size().gt(0),
                           ee.ImageCollection(era5.map(process_era_img)),
                           ee.ImageCollection([]))
    era5 = ee.ImageCollection(era5)
    era5_size = era5.size()
    era5 = ee.Algorithms.If(era5_size.gt(0),
                            ee.Image(era5.iterate(append_imgs,ee.Image())),
                            era5)

    site_data = ee.Algorithms.If(era5_size.gt(0),
                                 calc_mean(site_date,era5,30),
                                 site_date)

    return(site_data)

ee_site_era5 = ee_site_dates.map(lambda x: site_weather(x))

site_era5 = geemap.ee_to_df(ee_site_era5)
site_era5
Out[51]:
EndDate PlantingDate p p_1 p_10 p_11 p_12 p_13 p_14 p_15 ... t_81 t_82 t_83 t_84 t_85 t_86 t_87 t_88 t_89 t_9
0 2004-03-04 2003-12-05 0.015568 0.004844 0.004068 0.013641 0.002943 0.000320 0.002526 0.009870 ... 297.579468 297.346741 297.674896 297.569794 299.302246 301.228638 298.878235 299.837128 300.256134 298.602142
1 2005-03-27 2004-12-27 0.003703 0.026521 0.000781 0.000152 0.000142 0.003902 0.008237 0.011529 ... 302.013550 301.217163 299.966095 301.336060 300.738037 301.758209 301.622437 299.063843 299.259888 300.890137
2 2003-03-16 2002-12-16 0.004360 0.014276 0.020998 0.020011 0.011626 0.029256 0.021670 0.012067 ... 301.148163 299.238129 296.953491 300.799744 297.679199 296.279633 299.398712 300.254852 300.157043 296.176147
3 2006-04-04 2006-01-04 0.008878 0.005565 0.016839 0.021182 0.014174 0.009605 0.005294 0.004315 ... 299.669495 299.427521 301.432129 297.776428 299.015594 298.784180 298.875183 299.045349 297.762024 300.792847
4 2007-03-12 2006-12-12 0.014733 0.006304 0.007241 0.006208 0.000174 0.000274 0.004535 0.010779 ... 299.051453 298.938599 302.099487 301.654236 299.694794 297.407623 296.389435 299.265564 298.629211 299.326202
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
500 2007-05-02 2007-02-01 0.000886 0.014860 0.012758 0.011453 0.004728 0.002353 0.000262 0.001151 ... 302.756393 302.214558 301.248064 302.707794 303.163194 303.672526 305.190473 301.593207 302.151482 303.419322
501 2007-03-10 2006-12-10 0.002900 0.000003 0.036347 0.002470 0.001266 0.014713 0.008835 0.000695 ... 301.010653 301.718612 302.090371 296.883295 292.387048 296.363779 296.472263 295.078919 296.334451 301.033064
502 2007-03-14 2006-12-14 0.000833 0.000897 0.008835 0.000695 0.003145 0.008284 0.011418 0.004879 ... 292.387048 296.363779 296.472263 295.078919 296.334451 297.331167 297.696334 298.753351 301.303561 298.308230
503 2007-03-19 2006-12-19 0.000522 0.000130 0.000028 0.000627 0.027878 0.008867 0.000165 0.006064 ... 301.915191 304.183616 305.861503 306.734443 305.456852 304.723269 306.804111 307.411262 300.775432 306.256657
504 2007-04-18 2007-01-18 0.011238 0.042136 0.001714 0.012431 0.005236 0.002775 0.001222 0.000129 ... 301.759778 302.310618 304.353557 305.352067 304.986025 303.722007 303.911944 304.914958 303.700500 305.237236

505 rows × 183 columns

Similar to what we did with the NDVI data, we'll run another set of harmonic regressions to parsimoniously characterize weather over the growing season. Rather than using one set of harmonic pairs, however, we'll use two. The specification we'll estimate is given by

$NDVI_{it}=\alpha_i+\delta_i t+\sum\limits_{w=1}^2[\beta^{cos_w}_icos(2w\pi t)+\beta^{sin_w}_isin(2w\pi t)]+ɛ_{it}$

In [52]:
site_era5 = site_era5.drop('EndDate',axis=1)
site_era5 = site_era5.rename(columns={'t':'t_0','p':'p_0'})
site_era5 = site_era5.set_index(['sitecode','PlantingDate']).stack().reset_index()
site_era5['time'] = site_era5['level_2'].apply(lambda x: re.search("(?<=_)[0-9]+",x)[0])
site_era5['var'] = site_era5['level_2'].apply(lambda x: re.search(".*(?=_)",x)[0])
site_era5 = site_era5.drop('level_2',axis=1)
site_era5 = site_era5.set_index(['sitecode','PlantingDate','time','var']).unstack()
site_era5 = site_era5.reset_index()
site_era5.columns = [fix_cols(c) for c in site_era5.columns]
site_era5['time'] = site_era5['time'].apply(int)/90
site_era5['cos_t'] = (2*pi*site_era5['time']).apply(np.cos)
site_era5['cos_2t'] = (4*pi*site_era5['time']).apply(np.cos)
site_era5['sin_t'] = (2*pi*site_era5['time']).apply(np.sin)
site_era5['sin_2t'] = (4*pi*site_era5['time']).apply(np.sin)

yield_site_dates['temp_int'] = np.nan
yield_site_dates['temp_t_coef'] = np.nan
yield_site_dates['temp_cos1_coef'] = np.nan
yield_site_dates['temp_sin1_coef'] = np.nan
yield_site_dates['temp_cos2_coef'] = np.nan
yield_site_dates['temp_sin2_coef'] = np.nan

yield_site_dates['prec_int'] = np.nan
yield_site_dates['prec_t_coef'] = np.nan
yield_site_dates['prec_cos1_coef'] = np.nan
yield_site_dates['prec_sin1_coef'] = np.nan
yield_site_dates['prec_cos2_coef'] = np.nan
yield_site_dates['prec_sin2_coef'] = np.nan

for i in range(0,len(yield_site_dates.index)):
    site_date = site_era5.loc[(site_era5['sitecode']==yield_site_dates['sitecode'].iloc[i])&(site_era5['PlantingDate']==yield_site_dates['PlantingDate'].iloc[i])]
    if len(site_date.index)>=4:
        y=site_date['t'].to_numpy()
        x=site_date[['t','cos_t','sin_t','cos_2t','sin_2t']].to_numpy()

        ols = LinearRegression()
        ols.fit(x,y)

        with warnings.catch_warnings():
          warnings.simplefilter("ignore")
          yield_site_dates['temp_int'].iloc[i] = ols.intercept_
          yield_site_dates['temp_t_coef'].iloc[i] = ols.coef_[0]
          yield_site_dates['temp_cos1_coef'].iloc[i] = ols.coef_[1]
          yield_site_dates['temp_sin1_coef'].iloc[i] = ols.coef_[2]
          yield_site_dates['temp_cos2_coef'].iloc[i] = ols.coef_[3]
          yield_site_dates['temp_sin2_coef'].iloc[i] = ols.coef_[4]

        y=site_date['p'].to_numpy()

        ols = LinearRegression()
        ols.fit(x,y)
        with warnings.catch_warnings():
          warnings.simplefilter("ignore")
          yield_site_dates['prec_int'].iloc[i] = ols.intercept_
          yield_site_dates['prec_t_coef'].iloc[i] = ols.coef_[0]
          yield_site_dates['prec_cos1_coef'].iloc[i] = ols.coef_[1]
          yield_site_dates['prec_sin1_coef'].iloc[i] = ols.coef_[2]
          yield_site_dates['prec_cos2_coef'].iloc[i] = ols.coef_[3]
          yield_site_dates['prec_sin2_coef'].iloc[i] = ols.coef_[4]

yield_site_dates
Out[52]:
sitecode PlantingDate EndDate geometry eeGeom eeFeature ls_int ls_t_coef ls_cos_coef ls_sin_coef ... temp_cos1_coef temp_sin1_coef temp_cos2_coef temp_sin2_coef prec_int prec_t_coef prec_cos1_coef prec_sin1_coef prec_cos2_coef prec_sin2_coef
0 1041 2003-12-05 2004-03-04 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN ... -1.736619e-17 3.205925e-17 4.018757e-18 1.174892e-17 0.540462 -0.001775 0.000133 0.000033 0.001089 0.000317
121 1041 2004-12-27 2005-03-27 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN ... 5.916547e-17 1.179240e-17 5.920669e-17 -1.556963e-17 0.509888 -0.001681 0.000162 0.000994 -0.001183 0.001088
211 1041 2002-12-16 2003-03-16 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN ... 9.197232e-17 1.661072e-17 1.027760e-16 1.126899e-16 0.577168 -0.001894 0.001244 -0.000235 0.001430 0.000089
297 1041 2006-01-04 2006-04-04 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN ... -3.508322e-17 1.072939e-17 -4.996104e-17 -1.178115e-17 0.576704 -0.001901 -0.000920 0.001219 -0.000539 0.003002
332 1041 2006-12-12 2007-03-12 POINT (28.4 -14.2) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN ... 4.547606e-17 1.546745e-18 3.371382e-17 1.789307e-17 0.810949 -0.002677 -0.004516 -0.001545 0.002296 0.003972
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25937 729 2007-02-01 2007-05-02 POINT (38.87 -13.24) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... -0.161036 1.316836 -0.227792 0.159079 ... -3.152655e-17 -4.061944e-17 -3.205936e-17 -3.706137e-17 0.154494 -0.000498 0.001377 0.002959 0.000369 0.001259
26025 520 2006-12-10 2007-03-10 POINT (27.78 -29.55) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... 0.366722 0.230899 -0.000952 -0.008080 ... -1.323362e-16 1.279006e-18 -1.153060e-16 2.235635e-17 0.064505 -0.000205 0.000514 0.000491 0.000680 0.001376
26027 520 2006-12-14 2007-03-14 POINT (27.78 -29.55) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... 0.523215 -0.050080 0.009552 -0.091911 ... -9.379284e-17 8.731942e-18 -7.848041e-17 8.685789e-18 0.088653 -0.000285 0.000721 0.000257 0.001376 0.000666
26067 737 2006-12-19 2007-03-19 POINT (34.75 -24.49) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... NaN NaN NaN NaN ... -1.574434e-16 3.032031e-17 -1.472751e-17 3.793571e-17 0.113799 -0.000366 0.000331 0.000091 -0.000853 0.000985
26117 731 2007-01-18 2007-04-18 POINT (38.26 -16.84) ee.Geometry({\n "functionInvocationValue": {\... ee.Feature({\n "functionInvocationValue": {\n... 0.578816 0.036213 -0.016432 -0.035037 ... 1.020221e-16 -1.126183e-18 7.934377e-17 4.851794e-18 0.431510 -0.001400 -0.000721 0.002653 0.000046 -0.000007

505 rows × 22 columns

We now have the precipitation and temperature for each trial. Along with the trial data, Landsat NDVI, and SRTM terrain measures, we now have all the features we'll use to build out our model for yield prediction.

Generate Composites
In order to utilize all our information in a random forest model, we'll need to combine all the information corresponding to each plot in the same table. Python makes it easy to use Pandas DataFrames in machine learning models, so we will combine each individual DataFrame or GeoDataFrame we created above (the selected variety trial information, the Landsat NDVI data, the Shuttle Radar Topography Mission data, and the weather data) into the same table and join by trial. Pandas provides a few options for joining tables, but for this purpose we can use pd.merge(), which performs a database-like join using according to specified fields. For more information on merge, see the Pandas documentation, and for information on the ways to combine tables in Pandas, see this user guide. \

In [53]:
yields
Out[53]:
index logYield Management PlantingDate AnthesisDate ASI DaysToSilk vargroup yrcode sitecode ... silk.gdd30 LocationID Country Location Region Latitude Longitude ElevM geometry EndDate
0 0 2.116713 Optimal 2003-12-05 64.7 2.0 66.7 EPOP 2004 1041 ... 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2) 2004-03-04
1 1 1.859403 Optimal 2003-12-05 65.2 2.0 67.2 ILPO 2004 1041 ... 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2) 2004-03-04
2 2 1.662258 Optimal 2003-12-05 69.9 2.5 72.4 EIHY 2004 1041 ... 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2) 2004-03-04
3 3 2.094564 Optimal 2003-12-05 67.8 2.3 70.1 ILHY 2004 1041 ... 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2) 2004-03-04
4 4 1.950514 Optimal 2003-12-05 68.0 3.4 71.4 ILHY 2004 1041 ... 0.020164 1041 ZAMBIA ZAMSEED, FARM Eastern and Southern Africa -14.20 28.40 1178 POINT (28.4 -14.2) 2004-03-04
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26137 26137 -1.044977 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84) 2007-04-18
26138 26138 -0.868930 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84) 2007-04-18
26139 26139 -0.621571 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84) 2007-04-18
26140 26140 -0.696954 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84) 2007-04-18
26141 26141 -1.679324 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... NaN 731 MOZAMBIQUE MOCUBA Eastern and Southern Africa -16.84 38.26 66 POINT (38.26 -16.84) 2007-04-18

26142 rows × 47 columns

In [54]:
yield_site_dates = yield_site_dates.dropna(subset='ls_int')
yields_w_features = pd.merge(yields,yield_site_dates,
                             on=['sitecode','PlantingDate'])
yields_w_features = pd.merge(yields_w_features,srtm_df,
                             on=['sitecode','PlantingDate'])

yields_w_features
Out[54]:
index logYield Management PlantingDate AnthesisDate ASI DaysToSilk vargroup yrcode sitecode ... prec_int prec_t_coef prec_cos1_coef prec_sin1_coef prec_cos2_coef prec_sin2_coef EndDate aspect elevation slope
0 406 -0.749718 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 0.273267 -0.000894 -0.000462 0.000345 0.000767 -0.001314 2004-02-18 152.694000 1541.680776 2.851092
1 407 0.011830 Optimal 2003-11-20 NaN NaN NaN ILPO 2004 517 ... 0.273267 -0.000894 -0.000462 0.000345 0.000767 -0.001314 2004-02-18 152.694000 1541.680776 2.851092
2 408 -0.898451 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 0.273267 -0.000894 -0.000462 0.000345 0.000767 -0.001314 2004-02-18 152.694000 1541.680776 2.851092
3 409 -0.613412 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 0.273267 -0.000894 -0.000462 0.000345 0.000767 -0.001314 2004-02-18 152.694000 1541.680776 2.851092
4 410 -0.178768 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 0.273267 -0.000894 -0.000462 0.000345 0.000767 -0.001314 2004-02-18 152.694000 1541.680776 2.851092
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12674 26137 -1.044977 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 0.431510 -0.001400 -0.000721 0.002653 0.000046 -0.000007 2007-04-18 168.197365 73.112299 2.247579
12675 26138 -0.868930 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 0.431510 -0.001400 -0.000721 0.002653 0.000046 -0.000007 2007-04-18 168.197365 73.112299 2.247579
12676 26139 -0.621571 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 0.431510 -0.001400 -0.000721 0.002653 0.000046 -0.000007 2007-04-18 168.197365 73.112299 2.247579
12677 26140 -0.696954 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 0.431510 -0.001400 -0.000721 0.002653 0.000046 -0.000007 2007-04-18 168.197365 73.112299 2.247579
12678 26141 -1.679324 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 0.431510 -0.001400 -0.000721 0.002653 0.000046 -0.000007 2007-04-18 168.197365 73.112299 2.247579

12679 rows × 71 columns

Features
We've generated three sets of features we can use to predict yields: harmonic coefficients characterizing the trajectory of NDVI over the growing season, SRTM feature related to topography, and harmonic coefficients describing temperature and precipitation. In addition to these features, the variety trial data contain a number of variables describing the conditions of the particular trial which should be useful for predicting variation in yields within a given trial, including management practices (e.g. optimal, low nitrogen) and the "family" the variety is derived from. We'll make a list to keep track of all these features we'll be using for prediction.

In [55]:
yields_w_features = pd.merge(yields_w_features,pd.get_dummies(yields_w_features['Management']),
                             left_index=True,right_index=True)
yields_w_features = pd.merge(yields_w_features,pd.get_dummies(yields_w_features['vargroup']),
                             left_index=True,right_index=True)
yields_w_features
Out[55]:
index logYield Management PlantingDate AnthesisDate ASI DaysToSilk vargroup yrcode sitecode ... slope Drought Low N Low pH Maize Streak Virus Optimal EIHY EPOP ILHY ILPO
0 406 -0.749718 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 2.851092 False False False False True False True False False
1 407 0.011830 Optimal 2003-11-20 NaN NaN NaN ILPO 2004 517 ... 2.851092 False False False False True False False False True
2 408 -0.898451 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 2.851092 False False False False True False True False False
3 409 -0.613412 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 2.851092 False False False False True False True False False
4 410 -0.178768 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... 2.851092 False False False False True False True False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12674 26137 -1.044977 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 2.247579 False False False False True False True False False
12675 26138 -0.868930 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 2.247579 False False False False True False True False False
12676 26139 -0.621571 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 2.247579 False False False False True False True False False
12677 26140 -0.696954 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 2.247579 False False False False True False True False False
12678 26141 -1.679324 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... 2.247579 False False False False True False True False False

12679 rows × 80 columns

In [56]:
features = ['elevation','slope','aspect', # SRTM features
            'ls_int','ls_t_coef','ls_cos_coef','ls_sin_coef', # NDVI features
            'temp_int','temp_t_coef','temp_cos1_coef','temp_sin1_coef','temp_cos2_coef','temp_sin2_coef', # temperature features
            'prec_int','prec_t_coef','prec_cos1_coef','prec_sin1_coef','prec_cos2_coef','prec_sin2_coef', # precipitation features
            'Drought','Low N','Low pH','Maize Streak Virus','Optimal','EIHY','EPOP','ILHY','ILPO', # Management and variety features
            #'Latitude','Longitude','yrcode' # context
            ]

Train and Tune Model
Train, Validation, and Test Subsets
Once we have the complete feature set for use in our model, we will prepare the data for use in model training by splitting into five folds. These folds will be used for training, validation, and testing. To be used for valid inference, errors in the model need to be uncorrelated across folds, though the particularities of fold construction depend on the use case for the predictions.

In the context of spatial data, ensuring errors are uncorrelated across folds will typically require some kind of spatial splitting procedure. Here, we'll assign trials to folds, rather than individual observations. This ensures that we can't over fit by learning some function of the features that can identify trials and use the within-trial correlation in yields to obtain high-predictive performance. The issue with learning such a function is that it will generalize poorly outside of the particular sites and years we're training on. Ensuring all observations within a trial are only used for training, validation, or testing helps ameliorate this concern.

In [57]:
np.random.seed(8105) # Random.org Min: 1, Max: 10000 2023-09-20 18:36:37 UTC
yield_sites = yields_w_features[['sitecode','yrcode']].drop_duplicates()
yield_sites['fold'] = np.random.choice(range(0,5),size=len(yield_sites.index))
yield_sites
Out[57]:
sitecode yrcode fold
0 517 2004 1
52 517 2005 3
54 517 2001 1
187 513 2004 1
239 513 2001 4
... ... ... ...
12359 1043 2007 3
12403 1057 2007 1
12520 729 2007 4
12612 520 2007 2
12654 731 2007 3

155 rows × 3 columns

In [58]:
yields_w_features = pd.merge(yields_w_features,yield_sites,
                             on=['sitecode','yrcode'])
yields_w_features
Out[58]:
index logYield Management PlantingDate AnthesisDate ASI DaysToSilk vargroup yrcode sitecode ... Drought Low N Low pH Maize Streak Virus Optimal EIHY EPOP ILHY ILPO fold
0 406 -0.749718 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... False False False False True False True False False 1
1 407 0.011830 Optimal 2003-11-20 NaN NaN NaN ILPO 2004 517 ... False False False False True False False False True 1
2 408 -0.898451 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... False False False False True False True False False 1
3 409 -0.613412 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... False False False False True False True False False 1
4 410 -0.178768 Optimal 2003-11-20 NaN NaN NaN EPOP 2004 517 ... False False False False True False True False False 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12674 26137 -1.044977 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... False False False False True False True False False 3
12675 26138 -0.868930 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... False False False False True False True False False 3
12676 26139 -0.621571 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... False False False False True False True False False 3
12677 26140 -0.696954 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... False False False False True False True False False 3
12678 26141 -1.679324 Optimal 2007-01-18 NaN NaN NaN EPOP 2007 731 ... False False False False True False True False False 3

12679 rows × 81 columns

Hyperparameter Tuning
We now can utilize the training dataset to tune our hyperparameters with a grid search and cross-validation. Hyperparameters are parameters of a machine learning model which do not come from the data but instead are set prior to training the final model. For example, a random forest model can use multiple different criteria for determining how 'close' each field is to the others according to the training data provided, and the criterion used is a hyperparameter which is set prior to model training. \

A grid search is a process which is performed prior to the final model training in order to determine which hyperparameters provide the best model performance. In a grid search, multiple combinations of hyperparameters are tested to determine the best combination for performance. \

We'll also use cross-validation to make this testing more comprehensive. Cross-validation essentially trains and tests multiple models on different subsets, or folds, of the data to arrive at a more robust estimate of model performance. For example, if our training dataset was an (unrealistically small) 100 records and we set our number of 'folds' to be k = 5, we would have 5 folds of 20 records. In the first cross-validation step, the first fold of 20 will be left out while three folds are used to train a model and the final fold is used to choose the best combination of hyperparameters. Then, the model which is trained will be used on the first fold and the performance assessed. Next, the second fold will be left out and the other four folds (the first, third, fourth, and fifth) will be used to train the model and choose hyperparameters again, with the performance assessed on the held-out second fold. This process wil continue until all five folds have been used to evaluate a model, and the combined performance of all five models will be used as our performance estimate. \

In [59]:
from sklearn.ensemble import RandomForestRegressor

pred_dfs = []
for fold in range(0,5):
  train_val = yields_w_features.loc[yields_w_features['fold']!=fold]
  val_fold = (fold+1)%5
  val = yields_w_features.loc[yields_w_features['fold']==val_fold]
  train = train_val.loc[train_val['fold']!=val_fold]

  train_x = train[features].to_numpy()
  train_y = train['logYield'].to_numpy()

  val_x = val[features].to_numpy()
  val_y = val['logYield'].to_numpy()

  trainval_x = train_val[features].to_numpy()
  trainval_y = train_val['logYield'].to_numpy()

  test_df = yields_w_features.loc[yields_w_features['fold']==fold]
  test_x = test_df[features].to_numpy()

  best_r2 = -1
  best_depth = -1

  for d in range(1,11):
    rf = RandomForestRegressor(max_depth=d)
    rf.fit(train_x,train_y)

    r2 = rf.score(val_x,val_y)
    print(r2)
    if r2>best_r2:
      best_r2 = r2
      best_depth = d

  rf = RandomForestRegressor(best_depth)
  rf.fit(trainval_x,trainval_y)
  preds = rf.predict(test_x)
  test_df['logYield_hat'] = preds
  pred_dfs.append(test_df)

yields_w_features = pd.concat(pred_dfs)
yields_w_features
-0.21670445974577524
-0.20868851382220166
-0.27323653217950494
-0.16932273773026774
-0.1581475251104807
-0.1624630786586283
-0.1723100627995504
-0.2211647525979481
-0.2545038334207088
-0.16777324725148368
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['logYield_hat'] = preds
0.04399194345913193
0.11540378363361647
0.16470541476121825
0.13855971211419893
0.2516018729748746
0.3217334135712866
0.2806350919036348
0.3523249988674868
0.3910143539379939
0.3632055721266426
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['logYield_hat'] = preds
-0.13536442589038034
-0.13205149981689757
-0.078517327768584
-0.13945995682836232
-0.1614833831956628
-0.17953220032491557
-0.2478425538748279
-0.21674798821739483
-0.22752016392971397
-0.20751149526080392
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['logYield_hat'] = preds
0.1034746510774972
0.19800642110865874
0.03487763034138236
0.0011729896206109647
0.20708913540142238
0.19464295600074177
0.13592546342225298
0.16135349673663624
0.14723650188138582
0.1426360007570635
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['logYield_hat'] = preds
-0.20439082696241662
-0.10735924697463384
0.12677745685625463
-0.05718765886008037
-0.1410356847509262
-0.190792448335396
-0.2050362691701979
-0.17318619275435276
-0.2085622703504546
-0.16457193251032276
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['logYield_hat'] = preds
Out[59]:
index logYield Management PlantingDate AnthesisDate ASI DaysToSilk vargroup yrcode sitecode ... Low N Low pH Maize Streak Virus Optimal EIHY EPOP ILHY ILPO fold logYield_hat
263 697 1.087450 Optimal 2003-11-21 66.0 7.0 73.0 EPOP 2004 896 ... False False False True False True False False 0 1.330768
264 698 1.128495 Optimal 2003-11-18 72.4 3.9 76.3 ILPO 2004 896 ... False False False True False False False True 0 1.185352
265 699 1.323541 Optimal 2003-11-21 72.0 4.0 76.0 EPOP 2004 896 ... False False False True False True False False 0 1.330768
266 700 1.249902 Optimal 2003-11-21 68.0 5.3 73.3 EPOP 2004 896 ... False False False True False True False False 0 1.330768
267 701 1.496650 Optimal 2003-11-21 72.4 1.7 74.1 EPOP 2004 896 ... False False False True False True False False 0 1.330768
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12607 26020 1.348938 Optimal 2007-01-11 54.0 8.7 62.7 EIHY 2007 729 ... False False False True True False False False 4 0.305599
12608 26021 1.527577 Optimal 2007-01-10 58.3 5.2 63.5 ILHY 2007 729 ... False False False True False False True False 4 1.362675
12609 26022 1.662596 Optimal 2007-01-10 59.0 5.4 64.4 ILHY 2007 729 ... False False False True False False True False 4 1.362675
12610 26023 1.214847 Optimal 2007-01-10 60.3 4.8 65.1 ILHY 2007 729 ... False False False True False False True False 4 1.362675
12611 26024 1.476491 Optimal 2007-01-10 60.0 5.2 65.2 ILHY 2007 729 ... False False False True False False True False 4 1.362675

12679 rows × 82 columns

In [60]:
yields_w_features[['logYield','logYield_hat']].corr()**2
Out[60]:
logYield logYield_hat
logYield 1.000000 0.124822
logYield_hat 0.124822 1.000000

HOMEWORK: GENERATE PREDICTED YIELD MAPS \

Apply Trained Model in Earth Engine
Now that we've trained our random forest we're ready to predict yields where we don't observe ground truth. There are at least a couple reasonable ways to go about this. The first is to construct the features, just as we have done for this workshop, for a larger collection of locations. For example, you may make a grid of 5 km x 5 km cells. Once you've constructed the features for these new locations, you can use the random forest's .predict() method to generate predicted yields. \

Apply Trained Model in Earth Engine
The second approach is notably more complicated and involves forming the predictions within Earth Engine. Rather than making a dataframe with all of the necessary features, it is possible to make an eeImage with bands corresponding to each of the features. To do so, you would need to run the harmonic regressions within Earth Engine using ee.Reducer.linearRegression(). To implement the random forest on earth engine, you can create a string representation of the model using geemap.ml.rf_to_strings() and upload it to Earth Engine with ee.Classifier.decisionTreeEnsemble().

References \

“About | 50 by 2030.” 2019. https://www.50x2030.org/about.

Azzari, George, Shruti Jain, Graham Jeffries, Talip Kilic, and Siobhan Murray. 2021. “Understanding the Requirements for Surveys to Support Satellite-Based Crop Type Mapping: Evidence from Sub-Saharan Africa.” Remote Sensing 13, no. 23: 4749. https://doi.org/10.3390/rs13234749.

Deines, Jillian M., Rinkal Patel, Sang-Zi Liang, Walter Dado, and David B. Lobell. 2021. “A Million Kernels of Truth: Insights into Scalable Satellite Maize Yield Mapping and Yield Gap Analysis from an Extensive Ground Dataset in the US Corn Belt.” Remote Sensing of Environment 253: 112174. https://doi.org/10.1016/j.rse.2020.112174.

Earth Resources Observation And Science (EROS) Center. 2017. “Sentinel.” Tiff,jpg. U.S. Geological Survey. https://doi.org/10.5066/F76W992G. “Food Sustainability Index.” 2021. Economist Impact and Barilla Centre for Food and Nutrition Foundation. https://impact.economist.com/projects/foodsustainability/.

Funk, C.C., Peterson, P.J., Landsfeld, M.F., Pedreros, D.H., Verdin, J.P., Rowland, J.D., Romero, B.E., Husak, G.J., Michaelsen, J.C., and Verdin, A.P.. 2014. A quasi-global precipitation time series for drought monitoring: U.S. Geological Survey Data Series 832, 4 p. http://pubs.usgs.gov/ds/832/.

Funk, Chris, Pete Peterson, Seth Peterson, Shraddhanand Shukla, Frank Davenport, Joel Michaelsen, Kenneth R. Knapp, Martin Landsfeld, Gregory Husak, Laura Harrison, James Rowland, Michael Budde, Alex Meiburg, Tufa Dinku, Diego Pedreros, and Nicholas Mata. 2019. A high-resolution 1983–2016 Tmax climate data record based on infrared temperatures and stations by the Climate Hazard Center. Journal of Climate 32, 5639-5658. https://doi.org/10.1175/JCLI-D-18-0698.1.

Jin, Zhenong, George Azzari, Calum You, Stefania Di Tommaso, Stephen Aston, Marshall Burke, and David B. Lobell. 2019. “Smallholder Maize Area and Yield Mapping at National Scales with Google Earth Engine.” Remote Sensing of Environment 228: 115–28. https://doi.org/10.1016/j.rse.2019.04.016.

“Missions - Sentinel Online - Sentinel Online.” n.d. Accessed July 5, 2021. https://sentinel.esa.int/web/sentinel/missions.

“Sentinel-2 - Satellite Description - Sentinel Online - Sentinel Online.” n.d. Accessed September 2, 2021. https://sentinel.esa.int/web/sentinel/missions/sentinel-2/satellite-description.

“User Guides - Sentinel-2 MSI - Resolutions - Sentinel Online - Sentinel Online.” n.d. Accessed September 2, 2021. https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions.

“Sentinel-2 - Missions - Resolution and Swath - Sentinel Handbook - Sentinel Online.” n.d. Accessed September 2, 2021. https://sentinel.esa.int/web/sentinel/missions/sentinel-2/instrument-payload/resolution-and-swath.