ABSTRACT
This tutorial provides guidance on creating a machine learning model to predict maize yields from satellite imagery and other earth observation data. We train on variety trial data collected by CIMMYT used in Lobell et al. (2011) Nature Climate Change.
CONTENTS
Overview
- Yield Prediction Overview
- Geospatial Data and Tools
- Variety Trial Data
- Earth Observation Data Sets
- Landsat Multispectral Satellite Imagery
- Additional EO Data
- Shuttle Radar Topography Mission (SRTM) Data
- Weather Data
- Data Structures
- GIS in Python and Google Earth Engine for Earth Observation Data Introduction to Data
Developing Crop Type Mapping Model
- Data Ingestion and Pre-Processing
- Generate Composites
- Train and Tune Model
- Train and Test Subsets
- Hyperparameter Tuning
- Excluding Plot Observations
- Sample Size
- Feature Pre-Selection
- Survey Data
- EO Data
- Visualizing Geographic Areas
- Calculating Plot Areas
- Sentinel-2 Level-2A Imagery
- Shuttle Radar Topography Mission Data
- Climate Data
- Index Time Series and Crop Phenology
- Obtain Harmonic Coefficients
- Obtaining Observed Weather
Generate Predicted Cultivation Maps
- Apply Trained and Tuned Model to Full Area of Interest
- Generate Predicted Yield Maps
OVERVIEW
Yield Prediction Overview
Smallholder farms are increasingly in the spotlight of development goals at a national and international level. As changing climates and growing populations increase food scarcity, creating sustainable food systems has become a major focus (Economist Impact Food Sustainability Index 2021). However, monitoring the progress of these farms has been a challenge, as researchers and policy makers have relied on surveys or manual labeling of imagery to identify the type and scope of farming operations (Azzari et al. 2021). There is an acute need for new approaches to monitoring smallholder agriculture systems, particularly in developing countries; these are the primary focus of development goals, yet they often lack the resources to facilitate consistent, widespread monitoring. \
The 50x2030 Initiative to Close the Agricultural Data Gap aims to empower and support 50 low and lower-middle income countries (L/LMICs) by 2030 to build strong national data systems that produce and use high-quality and timely agricultural data through survey programs. The activities supported by the 50x2030 Initiative are carried out under three Implementation Components, namely (1) Data Production, led by the Food and Agriculture Organization of the United Nations (FAO), in collaboration with the World Bank Development Data Group, the Poverty and Equity Global Practice and the Agriculture Global Practice; (2) Methods and Tools Development, led by the World Bank Development Data Group, (3) Data Use, led by the International Fund for Agricultural Development. \
Under the Methods and Tools Development Component (MTD Component), the 50x2030 Initiative supports research to produce new, better, and more cost-effective tools and methodologies for data collection and analysis in the context of agricultural and rural surveys. MTD Component has three major work areas: (1) integration of surveys, (2) integration of improved methods into survey data collection, and (3) integration of surveys with other data sources. Under this third work area, 50x2030-supported research activities are geared towards the development of methods for the integration of survey data with different data sources, including but not limited to, geospatial, administrative and census data, with an eye on enhancing the value of survey data in policy-relevant analysis and research. A strong emphasis will be placed on enabling surveys to feed into remote sensing applications that aim to produce actionable, high-resolution of key indicators at-scale - anchored in the demands voiced by national governments and international development partners for advancing remote sensing applications that can both inform agricultural decision-making and help monitor and understand progress towards the Sustainable Development Goal (SDG) 2, with a focus on SDG Target 2.3 and 2.4. \
Since georeferenced ground data creates the backbone required for accurate satellite-based outcome monitoring, the 50x2030 initiative is uniquely positioned to catalyze the implementation, and eventual use, of integrated satellite-survey applications that can rapidly generate maps of agricultural outcomes across expansive geographies in smallholder systems -- reaching far beyond the coverage of field surveys. Over the last decade, the research has focused on developing and testing algorithms to derive satellite-based yields in large-scale systems (Clevers 1997); (Lobell et al. 2005) and in smallholder systems (Burke and Lobell, 2017; Gourlay et al., 2019; Jain et al., 2016; Jin et al., 2017; Lambert et al., 2018; Lobell et al., 2019, 2020). \
However, progress has been slower-than-desired in implementing integrated satellite-survey applications that can monitor agricultural outcomes at-scale (i.e. for entire countries and across continents) in the smallholder production systems that characterize much of the agriculture sector in low- and low-middle-income countries. One the main hurdles has been the lack of knowledge regarding the required volume, methods, and content of georeferenced microdata that should be collected as part of surveys in order to inform remote sensing applications capable of fulfilling spatially-disaggregated estimation and reporting needs. \
Against this background, the MTD Component of the 50x2030 Initiative is supporting research to generate guidelines for designing and implementing large-scale surveys in a way that can generate the required data for training remote sensing models for high-resolution crop area and crop yield mapping in low- and lower-middle income countries (50x2030 n.d.). \
In line with the goals outlined above, this workshop builds on research conducted by Lobell et al. (2011) studying particular non-linear yield responses of maize grown in Africa to temperature. While prior research had demonstrated such a non-linear effect in maize yields in the United States (Schlenker and Roberts, 2009), evidence from Africa was limited by the availability of a suitably large collection of maize yield measurements. The study emphasized the importance of variety trial data as a source for expanding our understanding of agriculture in regions typically thought to be data-poor.
The approach to predicting yields taken in this workshop roughly follows that of Deines et al. (2021). In that study, point-level maize yield ground truth collected by combine harvesters was used to validate predictions based on crop phenology and weather. While their approach to training the model was slightly different, we will "use similar features to form our predictions
Geospatial Data and Tools
Data Structures
In geospatial data analysis, data can be classified into two categories: raster and vector data. A graphic comparison between raster and vector data can be found in the World Bank Nighttime Lights Tutorial module 2, section 1.
- Raster data: Data stored in a raster format is arranged in a regular grid of cells, without storing the coordinates of each point (namely, a cell, or a pixel). The coordinates of the corner points and the spacing of the grid can be used to calculate (rather than to store) the coordinates of each location in the grid. Any given pixel in the grid stores one or more values (in one or more bands).
- Vector data: Data in a vector format is stored such that the X and Y coordinates are stored for each point. Data can be represented, for example, as points, lines and polygons. A point has only one coordinate (X and Y), a line has two or more coordinates, and a polygon is essentially a line that closes on itself to enclose a region. Polygons are usually used to represent the area and perimeter of continuous geographic features. Vector data stores features in their original resolution, without aggregation. \
In this tutorial, we will use vector and raster data. Geospatial data in vector format are often stored in a shapefile, a popular format for storing vector data developed by ESRI. The shapefile format is actually composed of multiple individual files which make up the entire data. At a minimum, there will be 3 file types included with this geographic data (.shp, .shx, .dbf), but there are often other files included which store additional information. In order to be read and used as a whole, all file types must have the same name and be in the same folder. Because the structure of points, lines, and polygons are different, each shapefile can only contain one vector type (all points, all lines, or all polygons). You will not find a mixture of point, line, and polygon objects in a single shapefile, so in order to work with these different types in the same analysis, multiple shapefiles will need to be used and layered. For more details on shapefiles and file types, see this documentation. \
Raster data, on the other hand, is stored in Tagged Image File Format (TIFF or TIF). A GeoTIFF is a TIFF file that follows a specific standard for structuring meta-data. The meta-data stored in a TIFF is called a tif tag and GeoTIFFs often contain tags including spatial extent, coordinate reference system, resolution, and number of layers. \
More information and examples can be found in sections 2 & 3 of the Earth Analytics Python Course. \
GIS in Python and Google Earth Engine for Earth Observation Data
We'll be sourcing the EO data used in this process from Google Earth Engine and coding in this Colab notebook. Before proceeding you will need to have a personal Google Earth Engine account linked to a Gmail account. It may take a day or longer for your Google Earth Engine account to be granted access. \
Two of the primary packages we'll be using, Pandas and GeoPandas, must be installed according to their installation instructions: Pandas Installation and GeoPandas Installation. If you're on Windows, GeoPandas installation can occasionally be temperamental - using an environment, as in the World Bank Nighttime Lights Tutorial, can often circumvent any issues, but if you're still having problems, there are a number of guides online, such as this Practial Data Science guide or this Medium post by Nayane Maia, which provide installation help that may allow you to be more successful. Using Windows Subsystem for Linux (WSL) can also make use of tricky packages like GeoPandas easier.
INTRODUCTION TO DATA
Variety Trial Data
While the data used in measuring crop yields will differ by location and time period of interest for your use case, for the purposes of this module we will leverage the variety trial data collected by CIMMYT utilized by Lobell et al. in their study Nonlinear heat effects on African maize as evidenced by historical yield trials. These data were originally collected to test the yield responses of novel varieties to different managements and shocks, but thanks to their GPS coordinates, could be co-located with weather data for the purposes of the study. Similarly, we make use of these geo-referenced data by matching yield observations with Earth Observation data. \
The data can be downloaded from CIMMYT's Dataverse, a repository of data used in CIMMYT studies. CIMMYT, like many of the other CGIAR organizations, now has an open data policy and publishes data they collect whenever ethical and feasible. \
Earth Observation (EO) Data Sets
Landsat Multispectral Imagery
The method we will use to predict maize yields is based on Landsat multispectral imagery. The Landsat satellites have been generating Earth Observation data since the 1970s(!), though the instruments used for Earth Observation have changed over time. We will use data collected by the Thematic Mapper aboard Landsat 5 and the Enhanced Thematic Mapper + aboard Landsat 7. These instruments all collect multispectral image data: raster data containing information from many segments of the electromagnetic spectrum. These are used to enable vegetation, land cover, and environmental monitoring of all of Earth's land surfaces and coastal waters. \
We will be using Tier 1 Landsat imagery to make predictions of maize production. The Tier 1 product provides imagery which has been orthorectified to remove distortion caused by differing terrain and sensor angles when the images were collected. Orthorectification allows the images to be used to accurately measure and quantify land features and distances. For more information on orthorectification, see this USGS brief explanation or this more detailed description from the Satellite Imaging Corporation.
The Tier 1 product provides Surface Reflectance (SR) images, which are derived from Top Of Atmosphere (TOA) reflectances. TOA reflectance values represent all lights which are reflected back to the satellite from the earth or the atmosphere. For analyses like ours, which are interested in the land cover, the presence of atmospheric reflections can distort the interpretation of the image, as they are blended with the reflectances from earth. To remove this potential source of distortion, atmospheric corrections can be applied to reduce or remove the effect of reflectances from the atmosphere and only retain the light reflected from the ground. The SR product is the result of processing the TOA product to remove atmospheric reflectance, and thus is the preferred source for our investigation of crop type. \
Landsat imagery records values for multiple spectral bands, but in practice, rather than being used individually, these bands are often combined into indices. Spectral indices can combine multiple bands with complementary information into a single value which can provide more information on ground cover than the individual bands. Our analysis will use the Normalized Difference Vegetation Index (NDVI), which is a particular transformation of the Red and Near Infrared bands. \
Additional Earth Observation Data Sources
While the core components of this model are the satellite imagery and the variety trial data, Deines et al. also incorporate additional Earth Observation (EO) data sources into their predictive model. They selected features of the landscape and climate which are correlated with the selection of crop types for given plots of land. \
Topography Features: Elevation, Slope, and Aspect
In order to assess cropland suitability, the team utilized elevation, slope, and aspect (the direction of the slope) as proxies. These selections were made under the assumption that erosion and soil degradation are exacerbated by high slope and elevation, thus making these areas less likely to be strong agricultural sites. Elevation, slope, and aspect can be obtained from the Shuttle Radar Topography Mission (SRTM) datasource, which is available at a resolution of 30m. The Shuttle Radar Topography Mission, flown on the space shuttle Endeavour in February, 2000, was used by the National Aeronautics and Space Administration (NASA) and the National Geospatial-Intelligence Agency (NGA) to create near-global land elevation data. This was accomplished through the use of two antennas on the Endeavour, with the surface elevation calculated as the difference between these signals. \
Multiple versions of this SRTM elevation data have been created from the original data collection, with subsequent versions performing tasks such as identifying and correcting for water bodies and filling voids. These versions are made available in 1 and 3 arc-second resolutions, where 1 arc-second corresponds to roughly 30m and 3 arc-second corresponds to apprximately 90m. For this model, we'll leverage Version 3.0 (also known as SRTM Plus), which was developed by NASA's Jet Propulsion Laboratory (JPL) to fill the voids in the elevation data with non-commercial data sources ASTER GDEM2 (a joint project of NASA and components of the Japanese government), GMTED2010 (developed by NGA and the U.S. Geological Survey (USGS)), and NED (developed by USGS). For more information on the SRTM versions and methodology, see the SRTM User Guide. \
Weather Features: Temperature and Precipitation
Weather considerations are also an important component of cropland suitability, so these can also be incorporated into this crop classification model. Deines et al. included a number of hand-selected weather variables, such as average temperature, total precipitation, and growing degree days (GDD), with a growing degree day defined as a day where the mean temperature is greater than a base value that must be exceeded for a crop to grow. However, we will use a slightly different approach based on harmonic regression. \
Because weather stations are relatively sparse globally, it is often helpful or necessary to use an interpolated weather product of some kind. Here, we will use the ERA5 Land Reanalysis. This product combines information from weather stations around the world with a climate model to produce hourly estimates of a number of weather variables at a 0.1 arc-degree resolution. We will use a daily version of this product for our analysis.
DEVELOPING YIELD PREDICTION MODEL
To predict yields, we will train a random forest model on our Earth Observation features. A random forest model consists of an ensemble of decision trees (ie, a set of multiple, individual decision trees), where each decision tree utilizes a series of features to determine the optimal classification for given observations. For an approachable overview of Random Forest Models, see this Towards Data Science post by Tony Yiu, or for a more thorough overview see Chapter 8 of An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani, which can be downloaded for free here. Further information on Random Forest models can also be found in The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Rob Tibshirani, and Jerome Friedman, which can be downloaded free here, and an introduction to Random Forests including Python example code can be found in Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurelien Geron. \
To begin, in order for a random forest model to successfully predict yields, we need to provide it information (features) about each location we hope to predict at. We will need to provide these features for both the data set we use for training and for any additional locations we hope to predict at using our trained model. The data sources we mentioned above - Landsat Imagery, Survey Data, and the additional Earth Observation data - will provide the features for our model. \
Note: Most of the code below was written by Natalie Ayers for the purpose of developing a learning module for the Geo4Dev Initiative. In some places, I have inserted or changed code to fit or particular task and data.
Data Ingestion and Preprocessing
Variety Trial Data
As mentioned above, the yield data we'll be using to model this process was collected by CIMMYT and can be obtained from their Dataverse. While the format and contents of each data source will vary, to be used in this model a dataset must contain information including: \
- Georeferenced locations, providing an approximate location or boundary of where the yield measurement took place
- The yield actually measured
- The planting date \
Since each variety trial typically contains yield measurements pertaining to different varieties and managments, we'll need to be able to identify each distinct trial as we collect the Earth Observation data. We'll use the provided GPS coordinates to collect the associated SRTM, Landsat, and ERA5 Land information for each trial, but we should also create a unique identifier for each trial that we can use to link the trial and Earth Observation data together. While this will be different depending on the survey data you use, in our example, if we will not be receiving any additional data, we could randomly assign unique identifiers to every record we have. We could also distinctly identify each trial by the sitecode
xPlantingDate
combination. \
To work with our data tables, we will be using Pandas, a popular and versitile tool for data analysis in Python, and GeoPandas, Geopandas is a popular Python package for working with geographic data. If your data comes in as plot corner points or center points, without being in a geographic type, you can use GeoPandas to convert to a geographic type. \
To start, we will import our variety trial data which contains location GPS coordinates. These coordinates will use the Coordinate Reference System (CRS) World Geodetic System 1984 (WGS84), which has the code EPSG:4326 and is one of the most common Coordinate Reference Systems. This system defines locations on a three-dimensional surface using degrees, which is what we know as latitude and longitude. For more information on Coordinate Reference Systems in Python, see this Towards Data Science tutorial by Abdishakur. \
Before we begine, we will need to set up our Python environment. While the environment Colab is deployed with contains most of the libraries we'll need, such as pandas
, geopandas
, and earthengine
, we do need to install geemap
, which will allow us to interact with data and imagery on Google Earth Engine. When setting up an environment on your personal computer, a server, or a virtual machine such as and AWS EC2 instance, it's usually best to use some kind of package management software such as Conda to avoid dependency issues. Here, we'll use pip to install the missing library.
!pip install geemap
Requirement already satisfied: geemap in /usr/local/lib/python3.11/dist-packages (0.35.1) Requirement already satisfied: bqplot in /usr/local/lib/python3.11/dist-packages (from geemap) (0.12.44) Requirement already satisfied: colour in /usr/local/lib/python3.11/dist-packages (from geemap) (0.1.5) Requirement already satisfied: earthengine-api>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from geemap) (1.4.6) Requirement already satisfied: eerepr>=0.0.4 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.1.0) Requirement already satisfied: folium>=0.17.0 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.19.4) Requirement already satisfied: geocoder in /usr/local/lib/python3.11/dist-packages (from geemap) (1.38.1) Requirement already satisfied: ipyevents in /usr/local/lib/python3.11/dist-packages (from geemap) (2.0.2) Requirement already satisfied: ipyfilechooser>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.6.0) Requirement already satisfied: ipyleaflet>=0.19.2 in /usr/local/lib/python3.11/dist-packages (from geemap) (0.19.2) Requirement already satisfied: ipytree in /usr/local/lib/python3.11/dist-packages (from geemap) (0.2.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (from geemap) (3.10.0) Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from geemap) (1.26.4) Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from geemap) (2.2.2) Requirement already satisfied: plotly in /usr/local/lib/python3.11/dist-packages (from geemap) (5.24.1) Requirement already satisfied: pyperclip in /usr/local/lib/python3.11/dist-packages (from geemap) (1.9.0) Requirement already satisfied: pyshp>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from geemap) (2.3.1) Requirement already satisfied: python-box in /usr/local/lib/python3.11/dist-packages (from geemap) (7.3.2) Requirement already satisfied: scooby in /usr/local/lib/python3.11/dist-packages (from geemap) (0.10.0) Requirement already satisfied: google-cloud-storage in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.19.0) Requirement already satisfied: google-api-python-client>=1.12.1 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.155.0) Requirement already satisfied: google-auth>=1.4.1 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.27.0) Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (0.2.0) Requirement already satisfied: httplib2<1dev,>=0.9.2 in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (0.22.0) Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from earthengine-api>=1.0.0->geemap) (2.32.3) Requirement already satisfied: branca>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from folium>=0.17.0->geemap) (0.8.1) Requirement already satisfied: jinja2>=2.9 in /usr/local/lib/python3.11/dist-packages (from folium>=0.17.0->geemap) (3.1.5) Requirement already satisfied: xyzservices in /usr/local/lib/python3.11/dist-packages (from folium>=0.17.0->geemap) (2025.1.0) Requirement already satisfied: ipywidgets in /usr/local/lib/python3.11/dist-packages (from ipyfilechooser>=0.6.0->geemap) (7.7.1) Requirement already satisfied: jupyter-leaflet<0.20,>=0.19 in /usr/local/lib/python3.11/dist-packages (from ipyleaflet>=0.19.2->geemap) (0.19.2) Requirement already satisfied: traittypes<3,>=0.2.1 in /usr/local/lib/python3.11/dist-packages (from ipyleaflet>=0.19.2->geemap) (0.2.1) Requirement already satisfied: traitlets>=4.3.0 in /usr/local/lib/python3.11/dist-packages (from bqplot->geemap) (5.7.1) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->geemap) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->geemap) (2024.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->geemap) (2025.1) Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (8.1.8) Requirement already satisfied: future in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (1.0.0) Requirement already satisfied: ratelim in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (0.1.6) Requirement already satisfied: six in /usr/local/lib/python3.11/dist-packages (from geocoder->geemap) (1.17.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (1.3.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (4.55.7) Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (1.4.8) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (24.2) Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (11.1.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib->geemap) (3.2.1) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.11/dist-packages (from plotly->geemap) (9.0.0) Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5 in /usr/local/lib/python3.11/dist-packages (from google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (2.19.2) Requirement already satisfied: uritemplate<5,>=3.0.1 in /usr/local/lib/python3.11/dist-packages (from google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (4.1.1) Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (5.5.1) Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.11/dist-packages (from google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (0.4.1) Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.11/dist-packages (from google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (4.9) Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (5.5.6) Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.0) Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.6.10) Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (7.34.0) Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.0.13) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2>=2.9->folium>=0.17.0->geemap) (3.0.2) Requirement already satisfied: google-cloud-core<3.0dev,>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from google-cloud-storage->earthengine-api>=1.0.0->geemap) (2.4.1) Requirement already satisfied: google-resumable-media>=2.7.2 in /usr/local/lib/python3.11/dist-packages (from google-cloud-storage->earthengine-api>=1.0.0->geemap) (2.7.2) Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /usr/local/lib/python3.11/dist-packages (from google-cloud-storage->earthengine-api>=1.0.0->geemap) (1.6.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->earthengine-api>=1.0.0->geemap) (2024.12.14) Requirement already satisfied: decorator in /usr/local/lib/python3.11/dist-packages (from ratelim->geocoder->geemap) (4.4.2) Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.11/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (1.66.0) Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5 in /usr/local/lib/python3.11/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (4.25.6) Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /usr/local/lib/python3.11/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client>=1.12.1->earthengine-api>=1.0.0->geemap) (1.26.0) Requirement already satisfied: jupyter-client in /usr/local/lib/python3.11/dist-packages (from ipykernel>=4.5.1->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.1.12) Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.11/dist-packages (from ipykernel>=4.5.1->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.4.2) Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (75.1.0) Requirement already satisfied: jedi>=0.16 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.19.2) Requirement already satisfied: pickleshare in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.7.5) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.0.50) Requirement already satisfied: pygments in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.18.0) Requirement already satisfied: backcall in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.0) Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.1.7) Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.11/dist-packages (from ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.9.0) Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.11/dist-packages (from pyasn1-modules>=0.2.1->google-auth>=1.4.1->earthengine-api>=1.0.0->geemap) (0.6.1) Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.11/dist-packages (from widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.5.5) Requirement already satisfied: parso<0.9.0,>=0.8.4 in /usr/local/lib/python3.11/dist-packages (from jedi>=0.16->ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.8.4) Requirement already satisfied: pyzmq<25,>=17 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (24.0.1) Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (23.1.0) Requirement already satisfied: jupyter-core>=4.6.1 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (5.7.2) Requirement already satisfied: nbformat in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (5.10.4) Requirement already satisfied: nbconvert>=5 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (7.16.6) Requirement already satisfied: nest-asyncio>=1.5 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.6.0) Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.8.3) Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.18.1) Requirement already satisfied: prometheus-client in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.21.1) Requirement already satisfied: nbclassic>=0.4.7 in /usr/local/lib/python3.11/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.2.0) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.11/dist-packages (from pexpect>4.3->ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.7.0) Requirement already satisfied: wcwidth in /usr/local/lib/python3.11/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.13) Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.11/dist-packages (from jupyter-core>=4.6.1->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.3.6) Requirement already satisfied: notebook-shim>=0.2.3 in /usr/local/lib/python3.11/dist-packages (from nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.2.4) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.12.3) Requirement already satisfied: bleach!=5.0.0 in /usr/local/lib/python3.11/dist-packages (from bleach[css]!=5.0.0->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (6.2.0) Requirement already satisfied: defusedxml in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.7.1) Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.3.0) Requirement already satisfied: mistune<4,>=2.0.3 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.1.1) Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.10.2) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.11/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.5.1) Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.11/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.21.1) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.11/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.23.0) Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.11/dist-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (21.2.0) Requirement already satisfied: webencodings in /usr/local/lib/python3.11/dist-packages (from bleach!=5.0.0->bleach[css]!=5.0.0->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.5.1) Requirement already satisfied: tinycss2<1.5,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from bleach[css]!=5.0.0->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.4.0) Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (25.1.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2024.10.1) Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.36.2) Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.11/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (0.22.3) Requirement already satisfied: jupyter-server<3,>=1.8 in /usr/local/lib/python3.11/dist-packages (from notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.24.0) Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.17.1) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.6) Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (2.22) Requirement already satisfied: anyio<4,>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from jupyter-server<3,>=1.8->notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (3.7.1) Requirement already satisfied: websocket-client in /usr/local/lib/python3.11/dist-packages (from jupyter-server<3,>=1.8->notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.8.0) Requirement already satisfied: typing-extensions>=4.4.0 in /usr/local/lib/python3.11/dist-packages (from referencing>=0.28.4->jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (4.12.2) Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio<4,>=3.1.0->jupyter-server<3,>=1.8->notebook-shim>=0.2.3->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets->ipyfilechooser>=0.6.0->geemap) (1.3.1)
Now we're ready to begin working with our variety trial data!
from google.colab import drive #mounting Gdrive to be able to access data files
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import pandas as pd
import geopandas as gpd
# loading yields and georeferences
file_id = '1UDz_P75VYZzxiy5mbJWC2tpUBUYvB6Kd' # Variety trial data from Gdrive link
link = f'https://drive.google.com/uc?id={file_id}'
yields = pd.read_csv(link)
yields
logYield | Management | PlantingDate | AnthesisDate | ASI | DaysToSilk | vargroup | yrcode | sitecode | gs150.gdd | ... | gs150.tmin | tmin2 | presilk.tmax2 | silk.tmax2 | postsilk.tmax2 | presilk.gdd830 | silk.gdd830 | postsilk.gdd830 | presilk.gdd30 | silk.gdd30 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.116713 | Optimal | 20031205 | 64.7 | 2.0 | 66.7 | EPOP | 2004 | 1041 | 2057.781964 | ... | 15.750548 | 248.079765 | 803.539520 | 762.833373 | 762.645584 | 742.054326 | 281.274711 | 1023.444977 | 1.125855 | 0.020164 |
1 | 1.859403 | Optimal | 20031205 | 65.2 | 2.0 | 67.2 | ILPO | 2004 | 1041 | 2057.781964 | ... | 15.750548 | 248.079765 | 799.009864 | 767.489960 | 765.981249 | 755.186385 | 282.395275 | 1009.192354 | 1.125855 | 0.020164 |
2 | 1.662258 | Optimal | 20031205 | 69.9 | 2.5 | 72.4 | EIHY | 2004 | 1041 | 2057.781964 | ... | 15.750548 | 248.079765 | 793.657903 | 764.914204 | 779.044136 | 821.639768 | 291.092979 | 934.041267 | 1.125855 | 0.020164 |
3 | 2.094564 | Optimal | 20031205 | 67.8 | 2.3 | 70.1 | ILHY | 2004 | 1041 | 2057.781964 | ... | 15.750548 | 248.079765 | 794.168510 | 772.895256 | 773.155186 | 792.787401 | 289.850278 | 964.136335 | 1.125855 | 0.020164 |
4 | 1.950514 | Optimal | 20031205 | 68.0 | 3.4 | 71.4 | ILHY | 2004 | 1041 | 2057.781964 | ... | 15.750548 | 248.079765 | 793.751548 | 770.235062 | 775.340270 | 807.085180 | 291.329302 | 948.359532 | 1.125855 | 0.020164 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
26137 | -1.044977 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | 2753.377492 | ... | 22.276206 | 496.229341 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26138 | -0.868930 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | 2753.377492 | ... | 22.276206 | 496.229341 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26139 | -0.621571 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | 2753.377492 | ... | 22.276206 | 496.229341 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26140 | -0.696954 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | 2753.377492 | ... | 22.276206 | 496.229341 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26141 | -1.679324 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | 2753.377492 | ... | 22.276206 | 496.229341 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26142 rows × 37 columns
file_id = '1fzOZbo6DMraeNVsri8FmI1R2No2oMWR3' # locations from Gdrive link
link = f'https://drive.google.com/uc?id={file_id}'
latlon = pd.read_csv(link)
latlon
LocationID | Country | Location | Region | Latitude | Longitude | ElevM | |
---|---|---|---|---|---|---|---|
0 | 5 | ANGOLA | ANGOLA | Eastern and Southern Africa | 0.00 | 0.00 | 0 |
1 | 6 | ANGOLA | CABINDA | Eastern and Southern Africa | -5.60 | 12.20 | 22 |
2 | 7 | ANGOLA | CACUSO | Eastern and Southern Africa | -9.42 | 15.75 | 1067 |
3 | 8 | ANGOLA | CELA | Eastern and Southern Africa | -11.42 | 15.12 | 1305 |
4 | 9 | ANGOLA | CHIANGA | Eastern and Southern Africa | -12.73 | 15.83 | 1693 |
... | ... | ... | ... | ... | ... | ... | ... |
180 | 1065 | ZIMBABWE | RATTRAY FARM | Eastern and Southern Africa | -17.70 | 31.20 | 1403 |
181 | 1066 | ZIMBABWE | RUWA | Eastern and Southern Africa | -17.89 | 31.24 | 1554 |
182 | 1067 | ZIMBABWE | SAVE VALLEY | Eastern and Southern Africa | -20.35 | 32.33 | 443 |
183 | 1068 | ZIMBABWE | SHAMVA ZIM | Eastern and Southern Africa | -17.32 | 31.57 | 1106 |
184 | 1069 | ZIMBABWE | ZIMBABWE | Eastern and Southern Africa | 0.00 | 0.00 | 0 |
185 rows × 7 columns
# Filter out locations with missing latlon
latlon = latlon.loc[(latlon['Latitude'])!=0|(latlon['Longitude']!=0)].reset_index()
latlon
index | LocationID | Country | Location | Region | Latitude | Longitude | ElevM | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 6 | ANGOLA | CABINDA | Eastern and Southern Africa | -5.60 | 12.20 | 22 |
1 | 2 | 7 | ANGOLA | CACUSO | Eastern and Southern Africa | -9.42 | 15.75 | 1067 |
2 | 3 | 8 | ANGOLA | CELA | Eastern and Southern Africa | -11.42 | 15.12 | 1305 |
3 | 4 | 9 | ANGOLA | CHIANGA | Eastern and Southern Africa | -12.73 | 15.83 | 1693 |
4 | 5 | 10 | ANGOLA | HUMPATA | Eastern and Southern Africa | -15.03 | 13.43 | 1890 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 179 | 1064 | ZIMBABWE | RATTRAY ARNOLD | Eastern and Southern Africa | -17.67 | 31.17 | 1458 |
146 | 180 | 1065 | ZIMBABWE | RATTRAY FARM | Eastern and Southern Africa | -17.70 | 31.20 | 1403 |
147 | 181 | 1066 | ZIMBABWE | RUWA | Eastern and Southern Africa | -17.89 | 31.24 | 1554 |
148 | 182 | 1067 | ZIMBABWE | SAVE VALLEY | Eastern and Southern Africa | -20.35 | 32.33 | 443 |
149 | 183 | 1068 | ZIMBABWE | SHAMVA ZIM | Eastern and Southern Africa | -17.32 | 31.57 | 1106 |
150 rows × 8 columns
import shapely.geometry as shg
print(f"First latitude: {latlon['Latitude'].iloc[0]}")
print(f"First longitude: {latlon['Longitude'].iloc[0]}")
lat = latlon['Latitude'].iloc[0]
lon = latlon['Longitude'].iloc[0]
shg.Point(lon,lat)
First latitude: -5.6 First longitude: 12.2
# Make geometry and turn it into a GeoDataFrame
import shapely.geometry as shg
latlon['geometry'] = latlon.apply(lambda x: shg.Point(x['Longitude'],x['Latitude']),axis=1)
latlon = gpd.GeoDataFrame(latlon,geometry='geometry',crs="EPSG:4326")
latlon.plot('Country')
<Axes: >
latlon
index | LocationID | Country | Location | Region | Latitude | Longitude | ElevM | geometry | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 6 | ANGOLA | CABINDA | Eastern and Southern Africa | -5.60 | 12.20 | 22 | POINT (12.2 -5.6) |
1 | 2 | 7 | ANGOLA | CACUSO | Eastern and Southern Africa | -9.42 | 15.75 | 1067 | POINT (15.75 -9.42) |
2 | 3 | 8 | ANGOLA | CELA | Eastern and Southern Africa | -11.42 | 15.12 | 1305 | POINT (15.12 -11.42) |
3 | 4 | 9 | ANGOLA | CHIANGA | Eastern and Southern Africa | -12.73 | 15.83 | 1693 | POINT (15.83 -12.73) |
4 | 5 | 10 | ANGOLA | HUMPATA | Eastern and Southern Africa | -15.03 | 13.43 | 1890 | POINT (13.43 -15.03) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 179 | 1064 | ZIMBABWE | RATTRAY ARNOLD | Eastern and Southern Africa | -17.67 | 31.17 | 1458 | POINT (31.17 -17.67) |
146 | 180 | 1065 | ZIMBABWE | RATTRAY FARM | Eastern and Southern Africa | -17.70 | 31.20 | 1403 | POINT (31.2 -17.7) |
147 | 181 | 1066 | ZIMBABWE | RUWA | Eastern and Southern Africa | -17.89 | 31.24 | 1554 | POINT (31.24 -17.89) |
148 | 182 | 1067 | ZIMBABWE | SAVE VALLEY | Eastern and Southern Africa | -20.35 | 32.33 | 443 | POINT (32.33 -20.35) |
149 | 183 | 1068 | ZIMBABWE | SHAMVA ZIM | Eastern and Southern Africa | -17.32 | 31.57 | 1106 | POINT (31.57 -17.32) |
150 rows × 9 columns
Visualizing Geographic Areas
We can visualize the locations using Google Earth Engine's mapping capabilities. If you're using Google Colab or other setups where you're running into errors with standard geemap (for example, Jupyter notebooks in VS Code using Windows Subsystem for Linux), you'll need to use geemap.foliumap.Map(). Otherwise, you can use geemap.Map(). \
Our first step will be to Authenticate the connection to Google Earth Engine. The first time you run this code, you'll need to Authenticate using the authentication code you're provided when you are granted GEE access. For more detailed instruction, see this Introduction to Google Earth Engine in the World Bank's Nighttime Lights tutorial. \
Finally, we'll convert our GeoPandas polygons to geographic types that geemap will recognize, which we can do easily using geemap.geopandas_to_ee(). (For more information on converting between geemap and geopandas, see this documentation.) Once we've converted our plots, we can add them to a basemap from geemap to visualize the areas we'll be analyzing. \
Note: As of spring 2022, there are problems using the Earth Engine package with Python 3.9. If you run into an error, for example ModuleNotFoundError: No module named 'StringIO', it may mean that you need to use a different version of Python. I've found 3.7 to work with all the packages required for this module. If you are following the World Bank OpenNightLights tutorial to create a conda environment, this will require creating the environment with conda create --name onl python=3.7 rather than python=3.9. \
import geemap.foliumap, ee
try:
ee.Initialize()
except Exception as e:
ee.Authenticate()
ee.Initialize(project='ee-nltryan')
WARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
site_a = latlon[['LocationID','geometry']].iloc[0:1]
site_b = latlon[['LocationID','geometry']].iloc[50:51]
site_c = latlon[['LocationID','geometry']].iloc[100:101]
ee_site_a = geemap.geopandas_to_ee(site_a).geometry()
center_lat = latlon['Latitude'].iloc[0]
center_lon = latlon['Longitude'].iloc[0]
map_site_a = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=17)
map_site_a.add_basemap('SATELLITE')
map_site_a.addLayer(ee_site_a,vis_params={'color':'a4c639'},opacity=0.8)
map_site_a
ee_site_b = geemap.geopandas_to_ee(site_b).geometry()
center_lat = latlon['Latitude'].iloc[50]
center_lon = latlon['Longitude'].iloc[50]
map_site_b = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=17)
map_site_b.add_basemap('SATELLITE')
map_site_b.addLayer(ee_site_b,vis_params={'color':'a4c639'},opacity=0.8)
map_site_b
ee_site_c = geemap.geopandas_to_ee(site_c).geometry()
center_lat = latlon['Latitude'].iloc[100]
center_lon = latlon['Longitude'].iloc[100]
map_site_c = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=17)
map_site_c.add_basemap('SATELLITE')
map_site_c.addLayer(ee_site_c,vis_params={'color':'a4c639'},opacity=0.8)
map_site_c
# Merge locations onto yields
yields = pd.merge(yields,latlon,
left_on='sitecode',right_on='LocationID')
yields = yields.drop('index',axis=1).reset_index()
yields
index | logYield | Management | PlantingDate | AnthesisDate | ASI | DaysToSilk | vargroup | yrcode | sitecode | ... | presilk.gdd30 | silk.gdd30 | LocationID | Country | Location | Region | Latitude | Longitude | ElevM | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2.116713 | Optimal | 20031205 | 64.7 | 2.0 | 66.7 | EPOP | 2004 | 1041 | ... | 1.125855 | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) |
1 | 1 | 1.859403 | Optimal | 20031205 | 65.2 | 2.0 | 67.2 | ILPO | 2004 | 1041 | ... | 1.125855 | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) |
2 | 2 | 1.662258 | Optimal | 20031205 | 69.9 | 2.5 | 72.4 | EIHY | 2004 | 1041 | ... | 1.125855 | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) |
3 | 3 | 2.094564 | Optimal | 20031205 | 67.8 | 2.3 | 70.1 | ILHY | 2004 | 1041 | ... | 1.125855 | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) |
4 | 4 | 1.950514 | Optimal | 20031205 | 68.0 | 3.4 | 71.4 | ILHY | 2004 | 1041 | ... | 1.125855 | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
26137 | 26137 | -1.044977 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) |
26138 | 26138 | -0.868930 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) |
26139 | 26139 | -0.621571 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) |
26140 | 26140 | -0.696954 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) |
26141 | 26141 | -1.679324 | Optimal | 20070118 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) |
26142 rows × 46 columns
This dataframe can be easily joined by PlantingDate
and sitecode
to the other sources of data we'll need for our model.
EO Data
Landsat Imagery
To obtain our Landsat imagery for each of the trials, we will use Google Earth Engine's API to obtain the Landsat 5 TM and Landsat 7 ETHM+ Tier 1 ImageCollections. The Landsat imagery needs to be pre-processed to mask pixels with clouds and shadows so these do not influence our predictions of crop type. Masking pixels in an image makes the pixels transparent, so they will not be used in any of the operations or analysis performed. Masks are Images themselves that act as a guide for which pixels should be kept and which pixels should be made transparent. These mask Images can then be applied to other Images which cover the same area. Once applied, any pixels which have mask values which below a threshold (eg, mask values of 0) are made transparent on the image being masked. For more information and an example of masking, see Google Earth Engine's Masking tutorial. We will make use of Landsat's Quality band, which can be used to construct measures of cloud and shadow likelihood. This enables us to identify and mask (make transparent) the pixels which don't represent land. \
First, though, we need to figure our where and when we'll look for imagery. To do so, we'll need to upload the geographic and temporal data related to the trials to Google Earth Engine. We'll look at a 90 day window after the planting date, as this is about how long it takes for maize to mature, and a 5km buffer around the GPS position. This information will be uploaded to GEE as a FeatureCollection
, Earth Engine's analog to a row of a GeoPandas GeoDataFrame
.
import datetime as dt
yields['PlantingDate'] = yields['PlantingDate'].apply(lambda x: dt.date(int(str(x)[0:4]),
int(str(x)[4:6]),
int(str(x)[6:])))
yields['EndDate'] = yields['PlantingDate'].apply(lambda x: x+dt.timedelta(days=90))
yields['PlantingDate'] = yields['PlantingDate'].apply(lambda x: x.isoformat())
yields['EndDate'] = yields['EndDate'].apply(lambda x: x.isoformat())
# Only need to get data for unique trials
yield_site_dates = yields[['sitecode','PlantingDate','EndDate','geometry']].drop_duplicates()
yield_site_dates
sitecode | PlantingDate | EndDate | geometry | |
---|---|---|---|---|
0 | 1041 | 2003-12-05 | 2004-03-04 | POINT (28.4 -14.2) |
121 | 1041 | 2004-12-27 | 2005-03-27 | POINT (28.4 -14.2) |
211 | 1041 | 2002-12-16 | 2003-03-16 | POINT (28.4 -14.2) |
297 | 1041 | 2006-01-04 | 2006-04-04 | POINT (28.4 -14.2) |
332 | 1041 | 2006-12-12 | 2007-03-12 | POINT (28.4 -14.2) |
... | ... | ... | ... | ... |
25937 | 729 | 2007-02-01 | 2007-05-02 | POINT (38.87 -13.24) |
26025 | 520 | 2006-12-10 | 2007-03-10 | POINT (27.78 -29.55) |
26027 | 520 | 2006-12-14 | 2007-03-14 | POINT (27.78 -29.55) |
26067 | 737 | 2006-12-19 | 2007-03-19 | POINT (34.75 -24.49) |
26117 | 731 | 2007-01-18 | 2007-04-18 | POINT (38.26 -16.84) |
505 rows × 4 columns
yield_site_dates['eeGeom'] = yield_site_dates['geometry'].apply(lambda x: ee.Geometry.Point(x.x,x.y)) # EE GPS data
yield_site_dates['eeFeature'] = yield_site_dates.apply(lambda x: ee.Feature(x['eeGeom'].buffer(5_000), # Buffer out 5km
{'sitecode':x['sitecode'],'PlantingDate':x['PlantingDate'],'EndDate':x['EndDate']}), # Include sitecode and PlantingDate for merging later
axis=1)
ee_site_dates = ee.FeatureCollection(yield_site_dates['eeFeature'].tolist())
# Visualize the sites
center_lat = -12.042079
center_lon = 25.756629
map_sites = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=4)
map_sites.add_basemap('SATELLITE')
map_sites.addLayer(ee_site_dates,vis_params={'color':'a4c639'},opacity=0.8)
map_sites
With these areas to use as our geometries, we can now begin to pull in the Landsat data, mask the clouds, and calculate the indices. I wrote the following functions to work with Landsat data and have used them in multiple projects. \
def cloud_mask(img):
'''
Function to mask cloudy pixels in a ls image using QA_PIXEL bits. Note: the relevant bits of QA_PIXEL are the same across satellites.
Inputs:
- img (ee.Image): the ls image to mask
Outputs:
- mask_img (ee.Image): img with cloudy pixels masked
'''
qa = img.select(['QA_PIXEL'])
mask = ee.Image.constant(1).subtract((qa.bitwiseAnd(1<<3).And(qa.bitwiseAnd(1<<9))).Or(qa.bitwiseAnd(1<<4).And(qa.bitwiseAnd(1<<10))))
return(img.updateMask(mask))
def calc_ndvi(img,red,nir):
'''
Function to calculate NDVI in an image
Inputs:
- img (ee.Image): The ls image to calculate on
- red (str): red band name
- nir (str): nir band name
Outputs:
- ndvi (ee.Image): image with one ndvi band named 'ndvi'
'''
# NDVI = (NIR-Red) / (NIR+Red)
ndvi = ee.Image(img.normalizedDifference([nir,red]))
ndvi = ndvi.rename('ndvi')
return(ndvi.copyProperties(img))
def scale_facts(img):
opticalBands = img.select('SR_B.').multiply(0.0000275).add(-0.2)
thermalBands = img.select('ST_B.*').multiply(0.00341802).add(149.0)
return(img.addBands(opticalBands,overwrite=True).addBands(thermalBands,overwrite=True))
def calc_mean(feature,img,scale):
img = ee.Image(img)
feature_mean = img.reduceRegion(
ee.Reducer.mean(),
feature.geometry(),
scale,
maxPixels=15_000_000,
bestEffort=True
)
updated = feature.set(feature_mean)
return(updated)
def append_imgs(current,prev):
next_img = ee.Image(prev).addBands(current)
return(next_img)
Our workflow will proceed as follows. For each trial, we will restrict attention to the set of Landsat 4, 5, and 7 images that intersect the 5km radius circle centered on the provided GPS coordinates during the 90 days after the planting date. For each of these images, we will then mask clouds and shadows, mask non-agricultural land using a landuse product developed by the European Space Agency, calculate NDVI, and record the value as a new column for the trial along with the date the image was taken. \
Since trials may have different numbers of images, this will end up being a ragged array, with some locations having many more images than others.
We'll first get the land use image that we'll use to mask non-agricultural pixels.
agland = ee.ImageCollection("ESA/WorldCover/v200").first().eq(40).rename('ag')
Next, we'll need to add the date of capture information to images, apply relevant scaling factors, mask clouds and non-agricultural land, and calculate NDVI. We'll use a function to do this and apply it to each image. \
def process_ls_img(img,red,nir,sat):
sat_band = ee.Image.constant(sat).rename("sat")
date = img.get('DATE_ACQUIRED')
date = ee.Date(date).millis()
date = ee.Image.constant(date).rename('d')
img = scale_facts(img)
img = cloud_mask(img)
img = img.updateMask(agland)
img_ndvi = calc_ndvi(img,'SR_B3','SR_B4')
img_ndvi = ee.Image(img_ndvi)
img = img_ndvi.addBands(date).addBands(sat_band)
return(img)
We'll apply this function within a function that will use the trial location and timing information to restrict to the relevant set of images.
def site_ndvis(site_date):
planting_date = ee.String(site_date.get('PlantingDate'))
end_date = ee.String(site_date.get('EndDate'))
ls5 = ee.ImageCollection("LANDSAT/LT05/C02/T1_L2").filterBounds(site_date.geometry()).filterDate(planting_date,end_date)
ls7 = ee.ImageCollection("LANDSAT/LE07/C02/T1_L2").filterBounds(site_date.geometry()).filterDate(planting_date,end_date)
ls5 = ee.Algorithms.If(ls5.size().gt(0),
ls5.map(lambda x: process_ls_img(x,'SR_B3','SR_B4',5)),
ee.ImageCollection([]))
ls7 = ee.Algorithms.If(ls7.size().gt(0),
ls7.map(lambda x: process_ls_img(x,'SR_B3','SR_B4',7)),
ee.ImageCollection([]))
all_ls = ee.ImageCollection(ls5).merge(ee.ImageCollection(ls7))
all_ls_size = all_ls.size()
all_ls = ee.Algorithms.If(all_ls_size.gt(0),
all_ls.iterate(append_imgs,ee.Image()),
all_ls)
site_data = ee.Algorithms.If(all_ls_size.gt(0),
calc_mean(site_date,all_ls,30),
site_date)
return(site_data)
We'll now apply this function to every image in our S2 ImageCollection using the function imageCollection.map(), which cycles through every image in an ImageCollection, applies the provided function to the image, and returns each processed image as a new ImageCollection.
ee_sites_ls = ee_site_dates.map(lambda x: site_ndvis(x))
Now, we can look into the FeatureCollection
we've created to ensure we have what we need.
ls_cols = ['sitecode','PlantingDate','ndvi','d','sat']
ls_cols = ls_cols + [f'ndvi_{i}' for i in range(1,18)]
ls_cols = ls_cols + [f'd_{i}' for i in range(1,18)]
ls_cols = ls_cols + [f'sat_{i}' for i in range(1,18)]
#sites_ls = geemap.ee_to_pandas(ee_sites_ls,col_names=ls_cols)
sites_ls = geemap.ee_to_df(ee_sites_ls,columns=ls_cols)
sites_ls
sitecode | PlantingDate | ndvi | d | sat | ndvi_1 | ndvi_2 | ndvi_3 | ndvi_4 | ndvi_5 | ... | sat_8 | sat_9 | sat_10 | sat_11 | sat_12 | sat_13 | sat_14 | sat_15 | sat_16 | sat_17 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1041 | 2003-12-05 | 0.639729 | 1.075853e+12 | 5.0 | 0.512278 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1041 | 2004-12-27 | 0.593166 | 1.107648e+12 | 5.0 | 0.596977 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1041 | 2002-12-16 | 0.542198 | 1.043366e+12 | 7.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1041 | 2006-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1041 | 2006-12-12 | NaN | 1.169856e+12 | 5.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
500 | 729 | 2007-02-01 | NaN | 1.172534e+12 | 5.0 | 0.703562 | 0.671129 | 0.673045 | 0.751673 | 0.691734 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
501 | 520 | 2006-12-10 | 0.347322 | 1.165882e+12 | 5.0 | 0.446907 | 0.514929 | 0.348674 | 0.440482 | 0.510405 | ... | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN |
502 | 520 | 2006-12-14 | 0.446907 | 1.167264e+12 | 5.0 | 0.514929 | 0.440482 | 0.510405 | 0.431683 | 0.381406 | ... | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN |
503 | 737 | 2006-12-19 | NaN | 1.171757e+12 | 5.0 | 0.718362 | 0.678643 | 0.757944 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
504 | 731 | 2007-01-18 | 0.548135 | 1.169770e+12 | 5.0 | 0.604690 | 0.626434 | 0.646203 | 0.598807 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
505 rows × 56 columns
# Clean up landsat data
from math import pi
import re
import datetime as dt
import numpy as np
sites_ls = sites_ls.rename(columns={'d':'d_0','sat':'sat_0','ndvi':'ndvi_0'})
ls_stack = sites_ls.set_index(['sitecode','PlantingDate']).stack().reset_index(-1).rename(columns={'level_2':'var'})
ls_stack['t'] = ls_stack['var'].apply(lambda x: re.search("(?<=_)[0-9]+",x)[0])
ls_stack['var'] = ls_stack['var'].apply(lambda x: re.search(".*(?=_)",x)[0])
ls_stack = ls_stack.reset_index().set_index(['sitecode','PlantingDate','t','var']).unstack()
def fix_cols(tup):
if tup[0]!=0:
return(tup[0])
else:
return(tup[1])
ls_stack = ls_stack.reset_index()
ls_stack.columns = [fix_cols(c) for c in ls_stack.columns]
ls_stack = ls_stack.dropna(subset='ndvi')
ls_stack['d'] = ls_stack['d'].apply(lambda x: dt.datetime.fromtimestamp(x / 1000.0, tz=dt.timezone.utc))
ls_stack['d'] = ls_stack['d'].apply(lambda x: x.date())
ls_stack['PlantingDate'] = ls_stack['PlantingDate'].apply(dt.date.fromisoformat)
ls_stack['t'] = ls_stack['d']-ls_stack['PlantingDate']
ls_stack['t'] = ls_stack['t'].apply(lambda x: x.days/90)
ls_stack
sitecode | PlantingDate | t | d | ndvi | sat | |
---|---|---|---|---|---|---|
0 | 6 | 2002-12-13 | 0.433333 | 2003-01-21 | 0.422388 | 7.0 |
1 | 6 | 2002-12-13 | 0.611111 | 2003-02-06 | 0.469667 | 7.0 |
3 | 6 | 2002-12-16 | 0.400000 | 2003-01-21 | 0.422388 | 7.0 |
4 | 6 | 2002-12-16 | 0.577778 | 2003-02-06 | 0.469667 | 7.0 |
6 | 6 | 2002-12-17 | 0.388889 | 2003-01-21 | 0.422388 | 7.0 |
... | ... | ... | ... | ... | ... | ... |
2352 | 1067 | 2006-05-20 | 0.788889 | 2006-07-30 | 0.289856 | 5.0 |
2353 | 1067 | 2006-05-20 | 0.088889 | 2006-05-28 | 0.331522 | 7.0 |
2354 | 1068 | 2004-12-24 | 0.611111 | 2005-02-17 | 0.679722 | 5.0 |
2355 | 1068 | 2004-12-24 | 0.166667 | 2005-01-08 | 0.576108 | 7.0 |
2356 | 1068 | 2004-12-24 | 0.344444 | 2005-01-24 | 0.457912 | 7.0 |
1911 rows × 6 columns
While these indices were created for us in the code, let's quickly go over calculation with image bands in case you need different indices or calculations. Complex expressions can be performed using the Image.expression()
method. To use this, you provide the equation and the bands to use the equation and apply this equation to every image using .map()
. For example, to calculate NDVI, we use the formula (NIR - RED) / (NIR + RED), where NIR is the 'NIR' band and RED is the 'RED' band.
Now that we've seen how everything works in code, let's head over to the Earth Engine code editor to get a sense of what's happening behind the scenes and how you can start to develop new projects using Earth Engine. \
This script has one piece to fill in at line 10. Look back at the code in this notebook to help figure out how to finish that line. Once you've completed the script, you'll be able to explore qualities of an image collection interactively and visualize a Landsat image. \
This next script has you obtain statistics over an area. You'll also get to see how Earth Engine combines data from many images into a single image. \
This final script has you practice doing "band math" and masking images. You'll see how you can perform calculations on images and combine information from multiple images.
Index Time Series and Crop Phenology
Since we've obtained Landsat imagery for our entire growing season duration, we can also visualize our data over time to view how NDVI changes. The change in index values over time can provide important information, including a crop's growth cycle and pattern, known as crop phenology. As a crop grows from seed to full plant, the values of the indices for that area will also change, allowing researchers to obtain detailed information about a crop's growth. \
The FeatureCollection we obtained is comprised of multiple satellite Images taken at different times and areas, and each image contains multiple bands - including those used to calculate NDVI, GCVI, and the other vegetation indices. \
We'll visualize the data two ways here: \
- We'll look at all NDVI values relative to their time in the 90 day window \
- We'll look at one particular NDVI series
ls_stack.plot.scatter(x='t',y='ndvi')
<Axes: xlabel='t', ylabel='ndvi'>
series_ex = ls_stack.loc[(ls_stack['sitecode']==1067)&(ls_stack['PlantingDate'].apply(str)=='2006-05-20')]
series_ex[['d','ndvi']].set_index('d').plot()
<Axes: xlabel='d'>
Here, it looks like NDVI generally increases, though there is quite a bit of noise. This probably has to do with spatial error in the reported location of the variety trial.
We've plotted the mean NDVI values for our sample trial over our growing season. The values are fairly choppy, especially with the number of images masked due to cloud cover, which can make it more difficult to determine the pattern over time. This is why smoothing is often applied to indices in order to use them in determining crop phenology. \
Deines et al. fit a harmonic regression model to the time series of NDVI, which means they include the time values within sine and cosine terms in their regression equation. For a brief introduction to using the sum of sine and cosine terms, known as a Fourier series, in functions representing periodic motion and to harmonic analysis, see this Britannica explainer. While other methods have been explored for accurately capturing the changes in crops over seasons, the use of harmonic regression has been found to be successful in representing crop behavior (Jin et al. 2019). \
For our use case with one harmonic pair, we'll be running the regression \
$NDVI_{it}=\alpha_i+\delta_i t+\beta^{cos}_icos(2\pi t)+\beta^{sin}_isin(2\pi t)+ɛ_{it}$
where $t$ is the fraction of the 90 day window that has passed at the time of observation. \
To actually implement this, we'll need to add the cosine and sine terms, then run a trial-specific linear regression. Because we have four terms to fit, we need at least four observations within the 90 day period to actually estimate all of the parameters of the model. \
import warnings
from sklearn.linear_model import LinearRegression
ls_stack['cos_t'] = (2*pi*ls_stack['t']).apply(np.cos)
ls_stack['sin_t'] = (2*pi*ls_stack['t']).apply(np.sin)
yield_site_dates['ls_int'] = np.nan
yield_site_dates['ls_t_coef'] = np.nan
yield_site_dates['ls_cos_coef'] = np.nan
yield_site_dates['ls_sin_coef'] = np.nan
for i in range(0,len(yield_site_dates.index)):
site_date = ls_stack.loc[(ls_stack['sitecode']==yield_site_dates['sitecode'].iloc[i])&(ls_stack['PlantingDate'].apply(str)==yield_site_dates['PlantingDate'].iloc[i])]
#print(len(site_date.index))
if len(site_date.index)>=4:
y=site_date['ndvi'].to_numpy()
x=site_date[['t','cos_t','sin_t']].to_numpy()
ols = LinearRegression()
ols.fit(x,y)
# Lots of value setting warnings from pandas that we'll ignore
with warnings.catch_warnings():
warnings.simplefilter("ignore")
yield_site_dates['ls_int'].iloc[i] = ols.intercept_
yield_site_dates['ls_t_coef'].iloc[i] = ols.coef_[0]
yield_site_dates['ls_cos_coef'].iloc[i] = ols.coef_[1]
yield_site_dates['ls_sin_coef'].iloc[i] = ols.coef_[2]
yield_site_dates
sitecode | PlantingDate | EndDate | geometry | eeGeom | eeFeature | ls_int | ls_t_coef | ls_cos_coef | ls_sin_coef | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1041 | 2003-12-05 | 2004-03-04 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN |
121 | 1041 | 2004-12-27 | 2005-03-27 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN |
211 | 1041 | 2002-12-16 | 2003-03-16 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN |
297 | 1041 | 2006-01-04 | 2006-04-04 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN |
332 | 1041 | 2006-12-12 | 2007-03-12 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
25937 | 729 | 2007-02-01 | 2007-05-02 | POINT (38.87 -13.24) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | -0.161036 | 1.316836 | -0.227792 | 0.159079 |
26025 | 520 | 2006-12-10 | 2007-03-10 | POINT (27.78 -29.55) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | 0.366722 | 0.230899 | -0.000952 | -0.008080 |
26027 | 520 | 2006-12-14 | 2007-03-14 | POINT (27.78 -29.55) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | 0.523215 | -0.050080 | 0.009552 | -0.091911 |
26067 | 737 | 2006-12-19 | 2007-03-19 | POINT (34.75 -24.49) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN |
26117 | 731 | 2007-01-18 | 2007-04-18 | POINT (38.26 -16.84) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | 0.578816 | 0.036213 | -0.016432 | -0.035037 |
505 rows × 10 columns
The coefficients that we've added as columns now summarize the trajectory of NDVI over the 90 day period using four values. We'll use these four values as inputs into our predictions later.
OPTIONAL: Plot Smoothed Index Time Series
As an optional step, let's now try to recreate our time series plot above using the smoothed values for NDVI. \
To do so, we'll make a dataframe of predicted values at many values of $t$, using the coefficients we just estimated. We'll then merge on the observed values of NDVI and compare the predictions to the truth.
ndvi_pred = pd.DataFrame({'t':np.linspace(0,1,100)})
ndvi_pred['cos_t'] = (2*pi*ls_stack['t']).apply(np.cos)
ndvi_pred['sin_t'] = (2*pi*ls_stack['t']).apply(np.sin)
int_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_int'].iloc[0]
t_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_t_coef'].iloc[0]
cos_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_cos_coef'].iloc[0]
sin_coef = yield_site_dates.loc[(yield_site_dates['sitecode']==1067)&(yield_site_dates['PlantingDate'].apply(str)=='2006-05-20'),'ls_sin_coef'].iloc[0]
ndvi_pred['ndvi_hat'] = int_coef+t_coef*ndvi_pred['t']+cos_coef*ndvi_pred['cos_t']+sin_coef*ndvi_pred['sin_t']
ndvi_pred = pd.merge(ndvi_pred,
ls_stack.loc[(ls_stack['sitecode']==1067)&(ls_stack['PlantingDate'].apply(str)=='2006-05-20'),['t','ndvi']],
on='t',
how='outer')
ndvi_pred
t | cos_t | sin_t | ndvi_hat | ndvi | |
---|---|---|---|---|---|
0 | 0.000000 | -0.913545 | 0.406737 | 0.308813 | NaN |
1 | 0.010101 | -0.766044 | -0.642788 | 0.316720 | NaN |
2 | 0.020202 | NaN | NaN | NaN | NaN |
3 | 0.030303 | -0.809017 | 0.587785 | 0.310176 | NaN |
4 | 0.040404 | -0.882948 | -0.469472 | 0.316090 | NaN |
... | ... | ... | ... | ... | ... |
113 | 0.969697 | -0.848048 | 0.529919 | 0.355034 | NaN |
114 | 0.977778 | NaN | NaN | NaN | 0.420683 |
115 | 0.979798 | -0.809017 | -0.587785 | 0.362338 | NaN |
116 | 0.989899 | 0.173648 | -0.984808 | 0.374155 | NaN |
117 | 1.000000 | NaN | NaN | NaN | NaN |
118 rows × 5 columns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
ndvi_pred.loc[~ndvi_pred['ndvi_hat'].isna(),['t','ndvi_hat']].set_index('t').plot(ax=ax)
ndvi_pred.loc[~ndvi_pred['ndvi'].isna(),['t','ndvi']].plot.scatter(x='t',y='ndvi',ax=ax)
<Axes: xlabel='t', ylabel='ndvi'>
Shuttle Radar Topography Mission Data
To obtain the elevation, slope, and aspect information for our plots, we will use the Shuttle Radar Topography Mission Version 3.0, which we can find in Google Earth Engine as the Image ee.Image("USGS/SRTMGL1_003").
First, we'll set the variables to use in downloading the elevation data:
# define GEE id of SRTM V3 Image
srtm = ee.Image('USGS/SRTMGL1_003').updateMask(agland)
We can obtain the elevation itself using the srtm_id and selecting the band elevation. Then, we can use this elevation Image along with the ee.Terrain package to obtain the slope and aspect information.
elevation = srtm.select('elevation')
slope = ee.Terrain.slope(elevation).rename('slope')
aspect = ee.Terrain.aspect(elevation).rename('aspect')
We want to calculate the elevation, slope, and aspect for each plot of land and store the values for use in our model. This will again require the geographic boundaries of the plots to define the areas over which we'll calculate average elevation, slope, and aspect. We can use the same geemap sample plots created above to calculate average elevation, slope, and aspect with the reduceRegions() method. To begin, though, let's first demonstrate with a single Image and plot: we'll calculate the mean slope in plot c using reduceRegion() and the slope Image we just created.
# Calculate mean slope for area
slope_c = slope.reduceRegion(
reducer = ee.Reducer.mean(),
geometry = ee_site_c.buffer(1000),
scale = 30
)
mean_slope_c = slope_c.get('slope')
print('Mean slope for site c:',mean_slope_c.getInfo())
Mean slope for site c: 2.747879962126414
We can also visualize these topography characteristics on a map to get a better sense of what we're calculating. For example, if we'd like to view areas of higher slope, we can plot slope on top of a map of the areas we're interested in.
center_lat = yield_site_dates['geometry'].iloc[100].y
center_lon = yield_site_dates['geometry'].iloc[100].x
elev_map = geemap.foliumap.Map(center=[center_lat,center_lon], zoom=9)
elev_map.addLayer(slope, {min:0, max: 90}, 'slope')
elev_map
In this example, we can visualize areas of higher and lower slope as shades of grey. While this provides the information we need, often it can be helpful to display satellige imagery using a more extensive color palette to more easily distinguish the gradient of values. Here's an example of the elevation we obtained from SRTM using a white to red color palette.
center_lat = yield_site_dates['geometry'].iloc[100].y
center_lon = yield_site_dates['geometry'].iloc[100].x
map_plot_c = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=6)
terrain_params = {'min': 0, 'max': 1000,
'palette': [ '#ffffff','#ff0000']}
map_plot_c.addLayer(elevation, terrain_params, 'elevation')
map_plot_c
In order to obtain the elevation, slope, and aspect for all plots of interest for use in our model, we can use ee.Image.reduceRegion()
to perform the same calculations for the mean and .map()
to apply it to multiple areas. We'll use our ee_site_dates
feature collection for this purpose.
srtm_img = elevation.addBands(slope)
srtm_img = srtm_img.addBands(aspect)
ee_site_srtm = ee_site_dates.map(lambda x: calc_mean(x,srtm_img,30))
# srtm_df = geemap.ee_to_pandas(ee_site_srtm)
srtm_df = geemap.ee_to_df(ee_site_srtm)
srtm_df
EndDate | PlantingDate | aspect | elevation | sitecode | slope | |
---|---|---|---|---|---|---|
0 | 2004-03-04 | 2003-12-05 | 141.450640 | 1181.855266 | 1041 | 1.977909 |
1 | 2005-03-27 | 2004-12-27 | 141.450640 | 1181.855266 | 1041 | 1.977909 |
2 | 2003-03-16 | 2002-12-16 | 141.450640 | 1181.855266 | 1041 | 1.977909 |
3 | 2006-04-04 | 2006-01-04 | 141.450640 | 1181.855266 | 1041 | 1.977909 |
4 | 2007-03-12 | 2006-12-12 | 141.450640 | 1181.855266 | 1041 | 1.977909 |
... | ... | ... | ... | ... | ... | ... |
500 | 2007-05-02 | 2007-02-01 | 176.217421 | 498.452234 | 729 | 2.160150 |
501 | 2007-03-10 | 2006-12-10 | 174.417739 | 1994.903339 | 520 | 5.144013 |
502 | 2007-03-14 | 2006-12-14 | 174.417739 | 1994.903339 | 520 | 5.144013 |
503 | 2007-03-19 | 2006-12-19 | 156.100872 | 89.777128 | 737 | 2.426769 |
504 | 2007-04-18 | 2007-01-18 | 168.197365 | 73.112299 | 731 | 2.247579 |
505 rows × 6 columns
We can see that this DataFrame has values for mean elevation, slope, and aspect for each of our trials. These values can be merged on to our Landsat data to provide extra information useful for predicting yield.
Climate Data
For this demonstration, we will utilize ERA 5 Reanalysis data on daily precipitation and maximum temperature. These are available for direct download or through Google Earth Engine; here, we'll use Google Earth Engine to obtain mean values for the 5km radius circles centered on the provided GPS locations.
First, let's visualize precipitation for one trial.
# this is not a real plot from any survey data; it was randomly selected for demonstration
center_lat = yield_site_dates['geometry'].iloc[0].y
center_lon = yield_site_dates['geometry'].iloc[0].x
precip_point_g = geemap.foliumap.Map(center=[center_lat,center_lon],zoom=7)
precip_point_g.add_basemap('SATELLITE')
# select precipitation for the range of dates in the growing season
precip = ee.ImageCollection("ECMWF/ERA5/DAILY").\
filter(ee.Filter.date(yield_site_dates['PlantingDate'].iloc[0], yield_site_dates['EndDate'].iloc[0])).select('total_precipitation')
precip_params = {'min': 0.0, 'max': 0.10,\
'palette': ['001137', '0aab1e', 'e7eb05', 'ff4a2d', 'e90000'],}
precip_point_g.addLayer(precip, precip_params, 'total_precipitation');
precip_point_g
Now, we'll get the total precipitation and temperature over the course of the growing season for each of our plots:
def process_era_img(img):
img = img.select(['maximum_2m_air_temperature','total_precipitation'])
img = img.rename(['t','p'])
img = img.updateMask(agland)
return(img)
def site_weather(site_date):
planting_date = ee.String(site_date.get('PlantingDate'))
end_date = ee.String(site_date.get('EndDate'))
era5 = ee.ImageCollection("ECMWF/ERA5/DAILY").filterBounds(site_date.geometry()).filterDate(planting_date,end_date)
era5 = ee.Algorithms.If(era5.size().gt(0),
ee.ImageCollection(era5.map(process_era_img)),
ee.ImageCollection([]))
era5 = ee.ImageCollection(era5)
era5_size = era5.size()
era5 = ee.Algorithms.If(era5_size.gt(0),
ee.Image(era5.iterate(append_imgs,ee.Image())),
era5)
site_data = ee.Algorithms.If(era5_size.gt(0),
calc_mean(site_date,era5,30),
site_date)
return(site_data)
ee_site_era5 = ee_site_dates.map(lambda x: site_weather(x))
site_era5 = geemap.ee_to_df(ee_site_era5)
site_era5
EndDate | PlantingDate | p | p_1 | p_10 | p_11 | p_12 | p_13 | p_14 | p_15 | ... | t_81 | t_82 | t_83 | t_84 | t_85 | t_86 | t_87 | t_88 | t_89 | t_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2004-03-04 | 2003-12-05 | 0.015568 | 0.004844 | 0.004068 | 0.013641 | 0.002943 | 0.000320 | 0.002526 | 0.009870 | ... | 297.579468 | 297.346741 | 297.674896 | 297.569794 | 299.302246 | 301.228638 | 298.878235 | 299.837128 | 300.256134 | 298.602142 |
1 | 2005-03-27 | 2004-12-27 | 0.003703 | 0.026521 | 0.000781 | 0.000152 | 0.000142 | 0.003902 | 0.008237 | 0.011529 | ... | 302.013550 | 301.217163 | 299.966095 | 301.336060 | 300.738037 | 301.758209 | 301.622437 | 299.063843 | 299.259888 | 300.890137 |
2 | 2003-03-16 | 2002-12-16 | 0.004360 | 0.014276 | 0.020998 | 0.020011 | 0.011626 | 0.029256 | 0.021670 | 0.012067 | ... | 301.148163 | 299.238129 | 296.953491 | 300.799744 | 297.679199 | 296.279633 | 299.398712 | 300.254852 | 300.157043 | 296.176147 |
3 | 2006-04-04 | 2006-01-04 | 0.008878 | 0.005565 | 0.016839 | 0.021182 | 0.014174 | 0.009605 | 0.005294 | 0.004315 | ... | 299.669495 | 299.427521 | 301.432129 | 297.776428 | 299.015594 | 298.784180 | 298.875183 | 299.045349 | 297.762024 | 300.792847 |
4 | 2007-03-12 | 2006-12-12 | 0.014733 | 0.006304 | 0.007241 | 0.006208 | 0.000174 | 0.000274 | 0.004535 | 0.010779 | ... | 299.051453 | 298.938599 | 302.099487 | 301.654236 | 299.694794 | 297.407623 | 296.389435 | 299.265564 | 298.629211 | 299.326202 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
500 | 2007-05-02 | 2007-02-01 | 0.000886 | 0.014860 | 0.012758 | 0.011453 | 0.004728 | 0.002353 | 0.000262 | 0.001151 | ... | 302.756393 | 302.214558 | 301.248064 | 302.707794 | 303.163194 | 303.672526 | 305.190473 | 301.593207 | 302.151482 | 303.419322 |
501 | 2007-03-10 | 2006-12-10 | 0.002900 | 0.000003 | 0.036347 | 0.002470 | 0.001266 | 0.014713 | 0.008835 | 0.000695 | ... | 301.010653 | 301.718612 | 302.090371 | 296.883295 | 292.387048 | 296.363779 | 296.472263 | 295.078919 | 296.334451 | 301.033064 |
502 | 2007-03-14 | 2006-12-14 | 0.000833 | 0.000897 | 0.008835 | 0.000695 | 0.003145 | 0.008284 | 0.011418 | 0.004879 | ... | 292.387048 | 296.363779 | 296.472263 | 295.078919 | 296.334451 | 297.331167 | 297.696334 | 298.753351 | 301.303561 | 298.308230 |
503 | 2007-03-19 | 2006-12-19 | 0.000522 | 0.000130 | 0.000028 | 0.000627 | 0.027878 | 0.008867 | 0.000165 | 0.006064 | ... | 301.915191 | 304.183616 | 305.861503 | 306.734443 | 305.456852 | 304.723269 | 306.804111 | 307.411262 | 300.775432 | 306.256657 |
504 | 2007-04-18 | 2007-01-18 | 0.011238 | 0.042136 | 0.001714 | 0.012431 | 0.005236 | 0.002775 | 0.001222 | 0.000129 | ... | 301.759778 | 302.310618 | 304.353557 | 305.352067 | 304.986025 | 303.722007 | 303.911944 | 304.914958 | 303.700500 | 305.237236 |
505 rows × 183 columns
Similar to what we did with the NDVI data, we'll run another set of harmonic regressions to parsimoniously characterize weather over the growing season. Rather than using one set of harmonic pairs, however, we'll use two. The specification we'll estimate is given by
$NDVI_{it}=\alpha_i+\delta_i t+\sum\limits_{w=1}^2[\beta^{cos_w}_icos(2w\pi t)+\beta^{sin_w}_isin(2w\pi t)]+ɛ_{it}$
site_era5 = site_era5.drop('EndDate',axis=1)
site_era5 = site_era5.rename(columns={'t':'t_0','p':'p_0'})
site_era5 = site_era5.set_index(['sitecode','PlantingDate']).stack().reset_index()
site_era5['time'] = site_era5['level_2'].apply(lambda x: re.search("(?<=_)[0-9]+",x)[0])
site_era5['var'] = site_era5['level_2'].apply(lambda x: re.search(".*(?=_)",x)[0])
site_era5 = site_era5.drop('level_2',axis=1)
site_era5 = site_era5.set_index(['sitecode','PlantingDate','time','var']).unstack()
site_era5 = site_era5.reset_index()
site_era5.columns = [fix_cols(c) for c in site_era5.columns]
site_era5['time'] = site_era5['time'].apply(int)/90
site_era5['cos_t'] = (2*pi*site_era5['time']).apply(np.cos)
site_era5['cos_2t'] = (4*pi*site_era5['time']).apply(np.cos)
site_era5['sin_t'] = (2*pi*site_era5['time']).apply(np.sin)
site_era5['sin_2t'] = (4*pi*site_era5['time']).apply(np.sin)
yield_site_dates['temp_int'] = np.nan
yield_site_dates['temp_t_coef'] = np.nan
yield_site_dates['temp_cos1_coef'] = np.nan
yield_site_dates['temp_sin1_coef'] = np.nan
yield_site_dates['temp_cos2_coef'] = np.nan
yield_site_dates['temp_sin2_coef'] = np.nan
yield_site_dates['prec_int'] = np.nan
yield_site_dates['prec_t_coef'] = np.nan
yield_site_dates['prec_cos1_coef'] = np.nan
yield_site_dates['prec_sin1_coef'] = np.nan
yield_site_dates['prec_cos2_coef'] = np.nan
yield_site_dates['prec_sin2_coef'] = np.nan
for i in range(0,len(yield_site_dates.index)):
site_date = site_era5.loc[(site_era5['sitecode']==yield_site_dates['sitecode'].iloc[i])&(site_era5['PlantingDate']==yield_site_dates['PlantingDate'].iloc[i])]
if len(site_date.index)>=4:
y=site_date['t'].to_numpy()
x=site_date[['t','cos_t','sin_t','cos_2t','sin_2t']].to_numpy()
ols = LinearRegression()
ols.fit(x,y)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
yield_site_dates['temp_int'].iloc[i] = ols.intercept_
yield_site_dates['temp_t_coef'].iloc[i] = ols.coef_[0]
yield_site_dates['temp_cos1_coef'].iloc[i] = ols.coef_[1]
yield_site_dates['temp_sin1_coef'].iloc[i] = ols.coef_[2]
yield_site_dates['temp_cos2_coef'].iloc[i] = ols.coef_[3]
yield_site_dates['temp_sin2_coef'].iloc[i] = ols.coef_[4]
y=site_date['p'].to_numpy()
ols = LinearRegression()
ols.fit(x,y)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
yield_site_dates['prec_int'].iloc[i] = ols.intercept_
yield_site_dates['prec_t_coef'].iloc[i] = ols.coef_[0]
yield_site_dates['prec_cos1_coef'].iloc[i] = ols.coef_[1]
yield_site_dates['prec_sin1_coef'].iloc[i] = ols.coef_[2]
yield_site_dates['prec_cos2_coef'].iloc[i] = ols.coef_[3]
yield_site_dates['prec_sin2_coef'].iloc[i] = ols.coef_[4]
yield_site_dates
sitecode | PlantingDate | EndDate | geometry | eeGeom | eeFeature | ls_int | ls_t_coef | ls_cos_coef | ls_sin_coef | ... | temp_cos1_coef | temp_sin1_coef | temp_cos2_coef | temp_sin2_coef | prec_int | prec_t_coef | prec_cos1_coef | prec_sin1_coef | prec_cos2_coef | prec_sin2_coef | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1041 | 2003-12-05 | 2004-03-04 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN | ... | -1.736619e-17 | 3.205925e-17 | 4.018757e-18 | 1.174892e-17 | 0.540462 | -0.001775 | 0.000133 | 0.000033 | 0.001089 | 0.000317 |
121 | 1041 | 2004-12-27 | 2005-03-27 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN | ... | 5.916547e-17 | 1.179240e-17 | 5.920669e-17 | -1.556963e-17 | 0.509888 | -0.001681 | 0.000162 | 0.000994 | -0.001183 | 0.001088 |
211 | 1041 | 2002-12-16 | 2003-03-16 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN | ... | 9.197232e-17 | 1.661072e-17 | 1.027760e-16 | 1.126899e-16 | 0.577168 | -0.001894 | 0.001244 | -0.000235 | 0.001430 | 0.000089 |
297 | 1041 | 2006-01-04 | 2006-04-04 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN | ... | -3.508322e-17 | 1.072939e-17 | -4.996104e-17 | -1.178115e-17 | 0.576704 | -0.001901 | -0.000920 | 0.001219 | -0.000539 | 0.003002 |
332 | 1041 | 2006-12-12 | 2007-03-12 | POINT (28.4 -14.2) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN | ... | 4.547606e-17 | 1.546745e-18 | 3.371382e-17 | 1.789307e-17 | 0.810949 | -0.002677 | -0.004516 | -0.001545 | 0.002296 | 0.003972 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
25937 | 729 | 2007-02-01 | 2007-05-02 | POINT (38.87 -13.24) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | -0.161036 | 1.316836 | -0.227792 | 0.159079 | ... | -3.152655e-17 | -4.061944e-17 | -3.205936e-17 | -3.706137e-17 | 0.154494 | -0.000498 | 0.001377 | 0.002959 | 0.000369 | 0.001259 |
26025 | 520 | 2006-12-10 | 2007-03-10 | POINT (27.78 -29.55) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | 0.366722 | 0.230899 | -0.000952 | -0.008080 | ... | -1.323362e-16 | 1.279006e-18 | -1.153060e-16 | 2.235635e-17 | 0.064505 | -0.000205 | 0.000514 | 0.000491 | 0.000680 | 0.001376 |
26027 | 520 | 2006-12-14 | 2007-03-14 | POINT (27.78 -29.55) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | 0.523215 | -0.050080 | 0.009552 | -0.091911 | ... | -9.379284e-17 | 8.731942e-18 | -7.848041e-17 | 8.685789e-18 | 0.088653 | -0.000285 | 0.000721 | 0.000257 | 0.001376 | 0.000666 |
26067 | 737 | 2006-12-19 | 2007-03-19 | POINT (34.75 -24.49) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | NaN | NaN | NaN | NaN | ... | -1.574434e-16 | 3.032031e-17 | -1.472751e-17 | 3.793571e-17 | 0.113799 | -0.000366 | 0.000331 | 0.000091 | -0.000853 | 0.000985 |
26117 | 731 | 2007-01-18 | 2007-04-18 | POINT (38.26 -16.84) | ee.Geometry({\n "functionInvocationValue": {\... | ee.Feature({\n "functionInvocationValue": {\n... | 0.578816 | 0.036213 | -0.016432 | -0.035037 | ... | 1.020221e-16 | -1.126183e-18 | 7.934377e-17 | 4.851794e-18 | 0.431510 | -0.001400 | -0.000721 | 0.002653 | 0.000046 | -0.000007 |
505 rows × 22 columns
We now have the precipitation and temperature for each trial. Along with the trial data, Landsat NDVI, and SRTM terrain measures, we now have all the features we'll use to build out our model for yield prediction.
Generate Composites
In order to utilize all our information in a random forest model, we'll need to combine all the information corresponding to each plot in the same table. Python makes it easy to use Pandas DataFrames in machine learning models, so we will combine each individual DataFrame or GeoDataFrame we created above (the selected variety trial information, the Landsat NDVI data, the Shuttle Radar Topography Mission data, and the weather data) into the same table and join by trial. Pandas provides a few options for joining tables, but for this purpose we can use pd.merge()
, which performs a database-like join using according to specified fields. For more information on merge, see the Pandas documentation, and for information on the ways to combine tables in Pandas, see this user guide. \
yields
index | logYield | Management | PlantingDate | AnthesisDate | ASI | DaysToSilk | vargroup | yrcode | sitecode | ... | silk.gdd30 | LocationID | Country | Location | Region | Latitude | Longitude | ElevM | geometry | EndDate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2.116713 | Optimal | 2003-12-05 | 64.7 | 2.0 | 66.7 | EPOP | 2004 | 1041 | ... | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) | 2004-03-04 |
1 | 1 | 1.859403 | Optimal | 2003-12-05 | 65.2 | 2.0 | 67.2 | ILPO | 2004 | 1041 | ... | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) | 2004-03-04 |
2 | 2 | 1.662258 | Optimal | 2003-12-05 | 69.9 | 2.5 | 72.4 | EIHY | 2004 | 1041 | ... | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) | 2004-03-04 |
3 | 3 | 2.094564 | Optimal | 2003-12-05 | 67.8 | 2.3 | 70.1 | ILHY | 2004 | 1041 | ... | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) | 2004-03-04 |
4 | 4 | 1.950514 | Optimal | 2003-12-05 | 68.0 | 3.4 | 71.4 | ILHY | 2004 | 1041 | ... | 0.020164 | 1041 | ZAMBIA | ZAMSEED, FARM | Eastern and Southern Africa | -14.20 | 28.40 | 1178 | POINT (28.4 -14.2) | 2004-03-04 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
26137 | 26137 | -1.044977 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) | 2007-04-18 |
26138 | 26138 | -0.868930 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) | 2007-04-18 |
26139 | 26139 | -0.621571 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) | 2007-04-18 |
26140 | 26140 | -0.696954 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) | 2007-04-18 |
26141 | 26141 | -1.679324 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | NaN | 731 | MOZAMBIQUE | MOCUBA | Eastern and Southern Africa | -16.84 | 38.26 | 66 | POINT (38.26 -16.84) | 2007-04-18 |
26142 rows × 47 columns
yield_site_dates = yield_site_dates.dropna(subset='ls_int')
yields_w_features = pd.merge(yields,yield_site_dates,
on=['sitecode','PlantingDate'])
yields_w_features = pd.merge(yields_w_features,srtm_df,
on=['sitecode','PlantingDate'])
yields_w_features
index | logYield | Management | PlantingDate | AnthesisDate | ASI | DaysToSilk | vargroup | yrcode | sitecode | ... | prec_int | prec_t_coef | prec_cos1_coef | prec_sin1_coef | prec_cos2_coef | prec_sin2_coef | EndDate | aspect | elevation | slope | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 406 | -0.749718 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 0.273267 | -0.000894 | -0.000462 | 0.000345 | 0.000767 | -0.001314 | 2004-02-18 | 152.694000 | 1541.680776 | 2.851092 |
1 | 407 | 0.011830 | Optimal | 2003-11-20 | NaN | NaN | NaN | ILPO | 2004 | 517 | ... | 0.273267 | -0.000894 | -0.000462 | 0.000345 | 0.000767 | -0.001314 | 2004-02-18 | 152.694000 | 1541.680776 | 2.851092 |
2 | 408 | -0.898451 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 0.273267 | -0.000894 | -0.000462 | 0.000345 | 0.000767 | -0.001314 | 2004-02-18 | 152.694000 | 1541.680776 | 2.851092 |
3 | 409 | -0.613412 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 0.273267 | -0.000894 | -0.000462 | 0.000345 | 0.000767 | -0.001314 | 2004-02-18 | 152.694000 | 1541.680776 | 2.851092 |
4 | 410 | -0.178768 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 0.273267 | -0.000894 | -0.000462 | 0.000345 | 0.000767 | -0.001314 | 2004-02-18 | 152.694000 | 1541.680776 | 2.851092 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12674 | 26137 | -1.044977 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 0.431510 | -0.001400 | -0.000721 | 0.002653 | 0.000046 | -0.000007 | 2007-04-18 | 168.197365 | 73.112299 | 2.247579 |
12675 | 26138 | -0.868930 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 0.431510 | -0.001400 | -0.000721 | 0.002653 | 0.000046 | -0.000007 | 2007-04-18 | 168.197365 | 73.112299 | 2.247579 |
12676 | 26139 | -0.621571 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 0.431510 | -0.001400 | -0.000721 | 0.002653 | 0.000046 | -0.000007 | 2007-04-18 | 168.197365 | 73.112299 | 2.247579 |
12677 | 26140 | -0.696954 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 0.431510 | -0.001400 | -0.000721 | 0.002653 | 0.000046 | -0.000007 | 2007-04-18 | 168.197365 | 73.112299 | 2.247579 |
12678 | 26141 | -1.679324 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 0.431510 | -0.001400 | -0.000721 | 0.002653 | 0.000046 | -0.000007 | 2007-04-18 | 168.197365 | 73.112299 | 2.247579 |
12679 rows × 71 columns
Features
We've generated three sets of features we can use to predict yields: harmonic coefficients characterizing the trajectory of NDVI over the growing season, SRTM feature related to topography, and harmonic coefficients describing temperature and precipitation. In addition to these features, the variety trial data contain a number of variables describing the conditions of the particular trial which should be useful for predicting variation in yields within a given trial, including management practices (e.g. optimal, low nitrogen) and the "family" the variety is derived from. We'll make a list to keep track of all these features we'll be using for prediction.
yields_w_features = pd.merge(yields_w_features,pd.get_dummies(yields_w_features['Management']),
left_index=True,right_index=True)
yields_w_features = pd.merge(yields_w_features,pd.get_dummies(yields_w_features['vargroup']),
left_index=True,right_index=True)
yields_w_features
index | logYield | Management | PlantingDate | AnthesisDate | ASI | DaysToSilk | vargroup | yrcode | sitecode | ... | slope | Drought | Low N | Low pH | Maize Streak Virus | Optimal | EIHY | EPOP | ILHY | ILPO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 406 | -0.749718 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 2.851092 | False | False | False | False | True | False | True | False | False |
1 | 407 | 0.011830 | Optimal | 2003-11-20 | NaN | NaN | NaN | ILPO | 2004 | 517 | ... | 2.851092 | False | False | False | False | True | False | False | False | True |
2 | 408 | -0.898451 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 2.851092 | False | False | False | False | True | False | True | False | False |
3 | 409 | -0.613412 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 2.851092 | False | False | False | False | True | False | True | False | False |
4 | 410 | -0.178768 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | 2.851092 | False | False | False | False | True | False | True | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12674 | 26137 | -1.044977 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 2.247579 | False | False | False | False | True | False | True | False | False |
12675 | 26138 | -0.868930 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 2.247579 | False | False | False | False | True | False | True | False | False |
12676 | 26139 | -0.621571 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 2.247579 | False | False | False | False | True | False | True | False | False |
12677 | 26140 | -0.696954 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 2.247579 | False | False | False | False | True | False | True | False | False |
12678 | 26141 | -1.679324 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | 2.247579 | False | False | False | False | True | False | True | False | False |
12679 rows × 80 columns
features = ['elevation','slope','aspect', # SRTM features
'ls_int','ls_t_coef','ls_cos_coef','ls_sin_coef', # NDVI features
'temp_int','temp_t_coef','temp_cos1_coef','temp_sin1_coef','temp_cos2_coef','temp_sin2_coef', # temperature features
'prec_int','prec_t_coef','prec_cos1_coef','prec_sin1_coef','prec_cos2_coef','prec_sin2_coef', # precipitation features
'Drought','Low N','Low pH','Maize Streak Virus','Optimal','EIHY','EPOP','ILHY','ILPO', # Management and variety features
#'Latitude','Longitude','yrcode' # context
]
Train and Tune Model
Train, Validation, and Test Subsets
Once we have the complete feature set for use in our model, we will prepare the data for use in model training by splitting into five folds. These folds will be used for training, validation, and testing. To be used for valid inference, errors in the model need to be uncorrelated across folds, though the particularities of fold construction depend on the use case for the predictions.
In the context of spatial data, ensuring errors are uncorrelated across folds will typically require some kind of spatial splitting procedure. Here, we'll assign trials to folds, rather than individual observations. This ensures that we can't over fit by learning some function of the features that can identify trials and use the within-trial correlation in yields to obtain high-predictive performance. The issue with learning such a function is that it will generalize poorly outside of the particular sites and years we're training on. Ensuring all observations within a trial are only used for training, validation, or testing helps ameliorate this concern.
np.random.seed(8105) # Random.org Min: 1, Max: 10000 2023-09-20 18:36:37 UTC
yield_sites = yields_w_features[['sitecode','yrcode']].drop_duplicates()
yield_sites['fold'] = np.random.choice(range(0,5),size=len(yield_sites.index))
yield_sites
sitecode | yrcode | fold | |
---|---|---|---|
0 | 517 | 2004 | 1 |
52 | 517 | 2005 | 3 |
54 | 517 | 2001 | 1 |
187 | 513 | 2004 | 1 |
239 | 513 | 2001 | 4 |
... | ... | ... | ... |
12359 | 1043 | 2007 | 3 |
12403 | 1057 | 2007 | 1 |
12520 | 729 | 2007 | 4 |
12612 | 520 | 2007 | 2 |
12654 | 731 | 2007 | 3 |
155 rows × 3 columns
yields_w_features = pd.merge(yields_w_features,yield_sites,
on=['sitecode','yrcode'])
yields_w_features
index | logYield | Management | PlantingDate | AnthesisDate | ASI | DaysToSilk | vargroup | yrcode | sitecode | ... | Drought | Low N | Low pH | Maize Streak Virus | Optimal | EIHY | EPOP | ILHY | ILPO | fold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 406 | -0.749718 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | False | False | False | False | True | False | True | False | False | 1 |
1 | 407 | 0.011830 | Optimal | 2003-11-20 | NaN | NaN | NaN | ILPO | 2004 | 517 | ... | False | False | False | False | True | False | False | False | True | 1 |
2 | 408 | -0.898451 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | False | False | False | False | True | False | True | False | False | 1 |
3 | 409 | -0.613412 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | False | False | False | False | True | False | True | False | False | 1 |
4 | 410 | -0.178768 | Optimal | 2003-11-20 | NaN | NaN | NaN | EPOP | 2004 | 517 | ... | False | False | False | False | True | False | True | False | False | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12674 | 26137 | -1.044977 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | False | False | False | False | True | False | True | False | False | 3 |
12675 | 26138 | -0.868930 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | False | False | False | False | True | False | True | False | False | 3 |
12676 | 26139 | -0.621571 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | False | False | False | False | True | False | True | False | False | 3 |
12677 | 26140 | -0.696954 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | False | False | False | False | True | False | True | False | False | 3 |
12678 | 26141 | -1.679324 | Optimal | 2007-01-18 | NaN | NaN | NaN | EPOP | 2007 | 731 | ... | False | False | False | False | True | False | True | False | False | 3 |
12679 rows × 81 columns
Hyperparameter Tuning
We now can utilize the training dataset to tune our hyperparameters with a grid search and cross-validation. Hyperparameters are parameters of a machine learning model which do not come from the data but instead are set prior to training the final model. For example, a random forest model can use multiple different criteria for determining how 'close' each field is to the others according to the training data provided, and the criterion used is a hyperparameter which is set prior to model training. \
A grid search is a process which is performed prior to the final model training in order to determine which hyperparameters provide the best model performance. In a grid search, multiple combinations of hyperparameters are tested to determine the best combination for performance. \
We'll also use cross-validation to make this testing more comprehensive. Cross-validation essentially trains and tests multiple models on different subsets, or folds, of the data to arrive at a more robust estimate of model performance. For example, if our training dataset was an (unrealistically small) 100 records and we set our number of 'folds' to be k = 5, we would have 5 folds of 20 records. In the first cross-validation step, the first fold of 20 will be left out while three folds are used to train a model and the final fold is used to choose the best combination of hyperparameters. Then, the model which is trained will be used on the first fold and the performance assessed. Next, the second fold will be left out and the other four folds (the first, third, fourth, and fifth) will be used to train the model and choose hyperparameters again, with the performance assessed on the held-out second fold. This process wil continue until all five folds have been used to evaluate a model, and the combined performance of all five models will be used as our performance estimate. \
from sklearn.ensemble import RandomForestRegressor
pred_dfs = []
for fold in range(0,5):
train_val = yields_w_features.loc[yields_w_features['fold']!=fold]
val_fold = (fold+1)%5
val = yields_w_features.loc[yields_w_features['fold']==val_fold]
train = train_val.loc[train_val['fold']!=val_fold]
train_x = train[features].to_numpy()
train_y = train['logYield'].to_numpy()
val_x = val[features].to_numpy()
val_y = val['logYield'].to_numpy()
trainval_x = train_val[features].to_numpy()
trainval_y = train_val['logYield'].to_numpy()
test_df = yields_w_features.loc[yields_w_features['fold']==fold]
test_x = test_df[features].to_numpy()
best_r2 = -1
best_depth = -1
for d in range(1,11):
rf = RandomForestRegressor(max_depth=d)
rf.fit(train_x,train_y)
r2 = rf.score(val_x,val_y)
print(r2)
if r2>best_r2:
best_r2 = r2
best_depth = d
rf = RandomForestRegressor(best_depth)
rf.fit(trainval_x,trainval_y)
preds = rf.predict(test_x)
test_df['logYield_hat'] = preds
pred_dfs.append(test_df)
yields_w_features = pd.concat(pred_dfs)
yields_w_features
-0.21670445974577524 -0.20868851382220166 -0.27323653217950494 -0.16932273773026774 -0.1581475251104807 -0.1624630786586283 -0.1723100627995504 -0.2211647525979481 -0.2545038334207088 -0.16777324725148368
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['logYield_hat'] = preds
0.04399194345913193 0.11540378363361647 0.16470541476121825 0.13855971211419893 0.2516018729748746 0.3217334135712866 0.2806350919036348 0.3523249988674868 0.3910143539379939 0.3632055721266426
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['logYield_hat'] = preds
-0.13536442589038034 -0.13205149981689757 -0.078517327768584 -0.13945995682836232 -0.1614833831956628 -0.17953220032491557 -0.2478425538748279 -0.21674798821739483 -0.22752016392971397 -0.20751149526080392
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['logYield_hat'] = preds
0.1034746510774972 0.19800642110865874 0.03487763034138236 0.0011729896206109647 0.20708913540142238 0.19464295600074177 0.13592546342225298 0.16135349673663624 0.14723650188138582 0.1426360007570635
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['logYield_hat'] = preds
-0.20439082696241662 -0.10735924697463384 0.12677745685625463 -0.05718765886008037 -0.1410356847509262 -0.190792448335396 -0.2050362691701979 -0.17318619275435276 -0.2085622703504546 -0.16457193251032276
<ipython-input-59-137561aa4091>:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test_df['logYield_hat'] = preds
index | logYield | Management | PlantingDate | AnthesisDate | ASI | DaysToSilk | vargroup | yrcode | sitecode | ... | Low N | Low pH | Maize Streak Virus | Optimal | EIHY | EPOP | ILHY | ILPO | fold | logYield_hat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
263 | 697 | 1.087450 | Optimal | 2003-11-21 | 66.0 | 7.0 | 73.0 | EPOP | 2004 | 896 | ... | False | False | False | True | False | True | False | False | 0 | 1.330768 |
264 | 698 | 1.128495 | Optimal | 2003-11-18 | 72.4 | 3.9 | 76.3 | ILPO | 2004 | 896 | ... | False | False | False | True | False | False | False | True | 0 | 1.185352 |
265 | 699 | 1.323541 | Optimal | 2003-11-21 | 72.0 | 4.0 | 76.0 | EPOP | 2004 | 896 | ... | False | False | False | True | False | True | False | False | 0 | 1.330768 |
266 | 700 | 1.249902 | Optimal | 2003-11-21 | 68.0 | 5.3 | 73.3 | EPOP | 2004 | 896 | ... | False | False | False | True | False | True | False | False | 0 | 1.330768 |
267 | 701 | 1.496650 | Optimal | 2003-11-21 | 72.4 | 1.7 | 74.1 | EPOP | 2004 | 896 | ... | False | False | False | True | False | True | False | False | 0 | 1.330768 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12607 | 26020 | 1.348938 | Optimal | 2007-01-11 | 54.0 | 8.7 | 62.7 | EIHY | 2007 | 729 | ... | False | False | False | True | True | False | False | False | 4 | 0.305599 |
12608 | 26021 | 1.527577 | Optimal | 2007-01-10 | 58.3 | 5.2 | 63.5 | ILHY | 2007 | 729 | ... | False | False | False | True | False | False | True | False | 4 | 1.362675 |
12609 | 26022 | 1.662596 | Optimal | 2007-01-10 | 59.0 | 5.4 | 64.4 | ILHY | 2007 | 729 | ... | False | False | False | True | False | False | True | False | 4 | 1.362675 |
12610 | 26023 | 1.214847 | Optimal | 2007-01-10 | 60.3 | 4.8 | 65.1 | ILHY | 2007 | 729 | ... | False | False | False | True | False | False | True | False | 4 | 1.362675 |
12611 | 26024 | 1.476491 | Optimal | 2007-01-10 | 60.0 | 5.2 | 65.2 | ILHY | 2007 | 729 | ... | False | False | False | True | False | False | True | False | 4 | 1.362675 |
12679 rows × 82 columns
yields_w_features[['logYield','logYield_hat']].corr()**2
logYield | logYield_hat | |
---|---|---|
logYield | 1.000000 | 0.124822 |
logYield_hat | 0.124822 | 1.000000 |
HOMEWORK: GENERATE PREDICTED YIELD MAPS \
Apply Trained Model in Earth Engine
Now that we've trained our random forest we're ready to predict yields where we don't observe ground truth. There are at least a couple reasonable ways to go about this. The first is to construct the features, just as we have done for this workshop, for a larger collection of locations. For example, you may make a grid of 5 km x 5 km cells. Once you've constructed the features for these new locations, you can use the random forest's .predict()
method to generate predicted yields. \
Apply Trained Model in Earth Engine
The second approach is notably more complicated and involves forming the predictions within Earth Engine. Rather than making a dataframe with all of the necessary features, it is possible to make an eeImage
with bands corresponding to each of the features. To do so, you would need to run the harmonic regressions within Earth Engine using ee.Reducer.linearRegression()
. To implement the random forest on earth engine, you can create a string representation of the model using geemap.ml.rf_to_strings()
and upload it to Earth Engine with ee.Classifier.decisionTreeEnsemble()
.
References \
“About | 50 by 2030.” 2019. https://www.50x2030.org/about.
Azzari, George, Shruti Jain, Graham Jeffries, Talip Kilic, and Siobhan Murray. 2021. “Understanding the Requirements for Surveys to Support Satellite-Based Crop Type Mapping: Evidence from Sub-Saharan Africa.” Remote Sensing 13, no. 23: 4749. https://doi.org/10.3390/rs13234749.
Deines, Jillian M., Rinkal Patel, Sang-Zi Liang, Walter Dado, and David B. Lobell. 2021. “A Million Kernels of Truth: Insights into Scalable Satellite Maize Yield Mapping and Yield Gap Analysis from an Extensive Ground Dataset in the US Corn Belt.” Remote Sensing of Environment 253: 112174. https://doi.org/10.1016/j.rse.2020.112174.
Earth Resources Observation And Science (EROS) Center. 2017. “Sentinel.” Tiff,jpg. U.S. Geological Survey. https://doi.org/10.5066/F76W992G. “Food Sustainability Index.” 2021. Economist Impact and Barilla Centre for Food and Nutrition Foundation. https://impact.economist.com/projects/foodsustainability/.
Funk, C.C., Peterson, P.J., Landsfeld, M.F., Pedreros, D.H., Verdin, J.P., Rowland, J.D., Romero, B.E., Husak, G.J., Michaelsen, J.C., and Verdin, A.P.. 2014. A quasi-global precipitation time series for drought monitoring: U.S. Geological Survey Data Series 832, 4 p. http://pubs.usgs.gov/ds/832/.
Funk, Chris, Pete Peterson, Seth Peterson, Shraddhanand Shukla, Frank Davenport, Joel Michaelsen, Kenneth R. Knapp, Martin Landsfeld, Gregory Husak, Laura Harrison, James Rowland, Michael Budde, Alex Meiburg, Tufa Dinku, Diego Pedreros, and Nicholas Mata. 2019. A high-resolution 1983–2016 Tmax climate data record based on infrared temperatures and stations by the Climate Hazard Center. Journal of Climate 32, 5639-5658. https://doi.org/10.1175/JCLI-D-18-0698.1.
Jin, Zhenong, George Azzari, Calum You, Stefania Di Tommaso, Stephen Aston, Marshall Burke, and David B. Lobell. 2019. “Smallholder Maize Area and Yield Mapping at National Scales with Google Earth Engine.” Remote Sensing of Environment 228: 115–28. https://doi.org/10.1016/j.rse.2019.04.016.
“Missions - Sentinel Online - Sentinel Online.” n.d. Accessed July 5, 2021. https://sentinel.esa.int/web/sentinel/missions.
“Sentinel-2 - Satellite Description - Sentinel Online - Sentinel Online.” n.d. Accessed September 2, 2021. https://sentinel.esa.int/web/sentinel/missions/sentinel-2/satellite-description.
“User Guides - Sentinel-2 MSI - Resolutions - Sentinel Online - Sentinel Online.” n.d. Accessed September 2, 2021. https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions.
“Sentinel-2 - Missions - Resolution and Swath - Sentinel Handbook - Sentinel Online.” n.d. Accessed September 2, 2021. https://sentinel.esa.int/web/sentinel/missions/sentinel-2/instrument-payload/resolution-and-swath.