Predicting Soybean Yield with NDVI Using a Flexible Fourier Transform Model

Chang Xu; Ani L. Katchova

doi:10.1017/aae.2019.5

Predicting Soybean Yield with NDVI Using a Flexible Fourier Transform Model

Published online by Cambridge University Press: 21 May 2019

Chang Xu and

Ani L. Katchova

Show author details

Chang Xu: Affiliation:
Department of Agricultural, Environmental, and Development Economics, The Ohio State University, Columbus, Ohio, USA
Ani L. Katchova*: Affiliation:
Department of Agricultural, Environmental, and Development Economics, The Ohio State University, Columbus, Ohio, USA
*: *Corresponding author. Email: [email protected]

Article contents

Abstract
Introduction
Background and related literature
Data and methods
Results
Conclusions
Author ORCIDs
References

Rights & Permissions

Abstract

We use models incorporating the normalized difference vegetation index (NDVI) derived from remote sensing satellites to improve soybean yield predictions in 10 major producing states in the United States. Unlike traditional methods that assume an ordinary least squares model applies to all observations, we allow for global flexibility in the relationship between NDVI and soybean yield by using the flexible Fourier transform (FFT) model. FFT results confirm that there is a nonlinear response of soybean yield to NDVI over the growing season. Out-of-sample predictions indicate that allowing for global flexibility with the FFT improves the predictions in time-series prediction and forecasting.

Keywords

Flexible Fourier transform model forecasting NDVI remote sensing soybean yield C14 C53 Q16

Type: Research Article
Information: Journal of Agricultural and Applied Economics , Volume 51 , Issue 3 , August 2019 , pp. 402 - 416

DOI: https://doi.org/10.1017/aae.2019.5 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s) 2019

1. Introduction

Many agencies, both public and private, exert significant efforts to make crop yield forecasts (Irwin, Sanders, and Good, Reference Irwin, Sanders and Good2014). Accurate and timely crop yield forecasts are valuable in many ways for market participants. At the aggregate level, crop yield forecasts help the price discovery process and improve market efficiency; they also aid decision makers in formulating rapid decisions to accommodate humanitarian actions and provide disaster assistance. At the individual level, crop yield forecasts are used to set crop insurance premiums by insurance companies, and they provide critical information for producers to make adjustments to improve their farm profitability.

In recent years, there has been an increasing interest in using remote sensing data to help improve crop yield forecasting. Remote sensing collects, archives, processes, and distributes satellite-derived data (Senay, Reference Senay2016). For example, the normalized difference vegetation index (NDVI) contains helpful information generated by remote sensing procedures that can be used to predict crop yields. NDVI is a measure of biomass density on the surface of the earth, usually produced by a space platform. NDVI is defined as follows:

$${\rm{NDVI}} = \left( {{\rm{NIR}} - {\rm{RED}}} \right)/\left( {{\rm{NIR}} + {\rm{RED}}} \right),$$

where NIR stands for the reflectance of the near-infrared bands and RED stands for the reflectance of the visible bands of the electromagnetic spectrum. According to electromagnetic theory, live vegetation absorbs the blue and red bands of sunlight and reflects most of the green band of sunlight. Dying vegetation, to the contrary, absorbs mostly the green band of sunlight and reflects mostly the blue and red bands of sunlight. Barren soil reflects moderately both the visible and near-infrared bands of the electromagnetic spectrum. Generally, the higher the NDVI, the more NIR light is reflected and the less RED light is reflected, and therefore, the target area includes more vegetation.

Because remote sensing provides information with a similar level of accuracy and accessibility regardless of the location and economic development of the country, using remote sensing data to predict crop yield has the potential to be applied in less developed countries in a cost-effective manner. In comparison, traditional, survey-based forecasts are relatively expensive and labor intensive.

Previous NDVI-based forecasting studies (Lv, Reference Lv2014) utilized ordinary least squares (OLS) regression, which assumes that a global coefficient applies to each location invariantly. However, a global coefficient may hide location variation. Because of differences in local climate, soil conditions, and farm practices, the correlation between NDVI and crop yields may be highly localized. Using a global coefficient to forecast site-specific crop yield may be biased and thus may cause less informed decisions by market participants.

We use a flexible Fourier transform (FFT) model to allow for global flexibility in crop yield forecasts based on NDVI. This is the first study to our knowledge to examine how the correlation between NDVI and soybean yield varies by location and to use this global flexibility of the FFT model to improve the forecast performance of soybean yields. We then compare FFT with OLS in terms of out-of-sample forecast performance. Two hypotheses are tested: (1) the relationship between NDVI and crop yield is nonlinear using the FFT model; and (2) the proposed FFT model outperforms OLS in terms of ex ante forecasting accuracy, because FFT introduces flexibility in modeling the soybean yield–NDVI relationship and allows the soybean yield–NDVI elasticity estimates to vary across observations.

This article is organized as follows: Section 2 presents some background information on current practices used for crop yield forecasting and remote sensing for crop yield forecasting; Section 3 introduces the data sources and the FFT model we use; Section 4 presents a descriptive analysis and regression and forecasting results, comparing the FFT method and the traditional OLS method; and Section 5 concludes the article.

2. Background and related literature

2.1. Overview of current crop yield forecast methods

There are two types of crop forecasts: survey-based forecasts and regression-based forecasts. Survey-based forecasts tend to be more accurate, especially when the harvest date is approaching, usually available shortly before or around harvest time, but they are also more expensive and labor intensive. Regression-based forecasts are more cost effective and can be available largely ahead of harvest; however, their accuracy may be compromised.

Survey-based forecasts that are used by the U.S. Department of Agriculture, National Agricultural Statistics Service (USDA-NASS) are made by conducting annually an agricultural yield survey (AYS) and an objective yield survey (OYS), the details of which can be found in USDA-NASS (2012). In the AYS, farmers are asked to self-report their anticipated yields, which may become the actual yields if harvest has begun. In the OYS, NASS sends technical personnel to the field to take objective measurements and counts of the plants. Both AYS and OYS are conducted monthly from May to November, but soybean yield data are collected and soybean yield forecasts are published from August to November. The final forecast is released in January of the next year. The typical cycle of soybean production in the major producing states in the United States is as follows: planting is in May and June, flowering is in July (which is its moisture/temperature-sensitive stage), filling is in August, maturation is in September, and harvesting is between October and November.

The second type of crop forecast is the regression-based forecast. This type of forecast is used mostly by private agencies and occasionally as a supplementary forecast by public agencies. For example, the World Agricultural Outlook Board (WAOB) releases the World Agricultural Supply and Demand Estimates regression-based forecasts, which use trend analysis and crop weather regression models. Unlike the forecasts released by NASS at the end of the year, the WAOB releases early forecasts throughout the growing season, from May to August (Irwin, Sanders, and Good, Reference Irwin, Sanders and Good2014). The comparison between NASS and WAOB yield forecasts and an evaluation of WAOB forecast accuracy can be found in Irwin, Good, and Sanders (Reference Irwin, Good and Sanders2015). The crop weather model (also known as the modified Thompson model) utilizes a year trend variable, monthly weather variables, and an indicator if the crop is planted late. The crop condition model utilizes a year trend variable, the proportion of the crop planted after a certain date (e.g., May 30 for soybeans; Irwin, Good, and Tannura, Reference Irwin, Good and Tannura2009), and the proportion of the crop rated as good or excellent by USDA (Crop Progress Report). The model we propose in this study is based on the crop weather model but also adds NDVI variables. According to the literature, the modified Thompson model produces a good fit but performs poorly when events (such as insects and diseases) that cannot be captured by a weather variable negatively affect crop yields. We hypothesize that using NDVI can also monitor for insects and diseases because NDVI is a direct indicator of the greenness/health of the vegetation, with the additional benefit that NDVI data are immediately available at a low cost compared with the methods that rate crop conditions. Because regression-based forecasts typically rely on aggregate-level information, such as climatological variables at the county or regional level, a limitation of the regression-based forecasts is their inability to incorporate farm-level characteristics such as managerial skills or soil characteristics. However, regression-based forecasts can become useful when farm-level data are lacking, which is prevalent in many cases, especially in yield forecasting in developing countries.

2.2. Crop yield forecasting using remote sensing

There have been numerous studies documenting the correlation between NDVI and crop yield forecasts, at the national (Maselli and Rembold, Reference Maselli and Rembold2002), regional, county (Bolton and Friedl, Reference Bolton and Friedl2013), and field level (Ferencz et al., Reference Ferencz, Bognar, Lichtenberger, Hamar, Tarcsai, Timar and Szekely2004). Tucker (Reference Tucker1979) determined that a time-integrated NDVI is largely correlated with crop yields when the vegetation is at the maximum level of greenness. Some studies focus on intra-annual variability showing how the correlation between the vegetation index and crop yields varies by the crop cycle and planting date (Basnyat et al., Reference Basnyat, McConkey, Meinert, Gatkze and Noble2004). These studies suggest choosing NDVI data over a specific period for each type of crop in order to produce better forecasts. The weekly availability of NDVI data makes this crop-specific specification achievable. Lv (Reference Lv2014) suggests using earlier May NDVI and the change in NDVI over the crop planting and harvesting season for the most accurate yield forecasting. Johnson (Reference Johnson2014) finds that crop yields are highly correlated with NDVI and daytime land surface temperature. The author conducts a regression of crop yields on NDVI for every week of the growing season and finds that the week in which the correlation is at its peak is at the beginning of August.

In addition to NDVI derived from the National Aeronautics and Space Administration’s (NASA) Earth Observing System (EOS) Moderate Resolution Imaging Spectoradiometer, called eMODIS, other indexes and images have been used. For example, Doraiswamy and Cook (Reference Doraiswamy and Cook1995) is one of the earliest studies that used Advanced Very High Resolution Radiometer (AVHRR) imagery. AVHRR data are coarser, whereas eMODIS data are finer; AVHRR data are available for an extended period, whereas eMODIS data are only available after 2000. Later, Ferencz et al. (Reference Ferencz, Bognar, Lichtenberger, Hamar, Tarcsai, Timar and Szekely2004) also used AVHRR and a vegetation index called general yield unified reference index. Bolton and Friedl (Reference Bolton and Friedl2013) suggest to incorporate crop phenology and use a combination of the EVI2 (two-end enhanced vegetation index), NDVI, and normalized differenced water index (NDWI) for crop yield forecasting. They distinguish between semiarid and non-semiarid areas. They find that vegetation indexes are the best type of indexes for predicting in non-semiarid areas, whereas the NDWI is the best index for prediction in semiarid areas, because the water index is sensitive to irrigation in these semiarid areas.

Instead of using traditional statistical models, Bose et al. (Reference Bose, Kasabov, Bruzzone and Hartono2016) utilize spiking neural networks from machine learning to analyze a remote sensing spatiotemporal relationship. Their work focuses on finding the optimum number of variables (or “features” in machine learning) to be included in regression analysis using machine learning techniques. They find that this type of prediction can be made 6 weeks before harvest with an average accuracy of 95.64%. They find that the year 2002 had the largest forecast error because of the 2002 drought. Adrian (Reference Adrian2012) applies the Bayesian hierarchical model. This model is suitable for modeling data with clusters. It produces unique estimates for each state while requiring the estimates from each state to also follow a prior distribution. Johnson et al. (Reference Johnson, Hsieh, Cannon, Davidson and Bédard2016) focus on comparing forecast performance using linear versus nonlinear machine learning techniques and find that nonlinear models are not necessarily advantageous compared with linear models. Li et al. (Reference Li, Liang, Wang and Qin2007) find that neural network techniques improve corn predictions compared with multivariate analysis. Kaul, Hill, and Walthall (Reference Kaul, Hill and Walthall2005) find that a nonlinear model only outperforms the linear model for barley. Mkhabela et al. (Reference Mkhabela, Bullock, Raj, Wang and Yang2011) categorize the Census Agricultural Regions (CARs) into three distinct agroclimatic zones; however, even within CARs, there might be multiple soil types. Bolton and Friedl (Reference Bolton and Friedl2013) emphasize the importance of delineating the boundary between farmland and nonfarmland, such as grassland and forests, because nonfarmland may contaminate the NDVI–crop yield relationship. Delineation can be done by using a land cover map such as Landsat Thematic Mapper (TM) data (Bolton and Friedl, Reference Bolton and Friedl2013). Another method of delineation is to identify single pixels as agricultural or nonagricultural vegetation using statistical correction analysis (Maselli and Rembold, Reference Maselli and Rembold2002). Among those studies, there are soybean forecasts in the United States using remote sensing (Lobell and Asner, Reference Lobell and Asner2003; Prasad et al., Reference Prasad, Chai, Singh and Kafatos2006). Chang et al. (Reference Chang, Hansen, Pittman, Carroll and DiMiceli2007) focus on using NDVI to map corn and soybean farmland.

Fieuzal, Sicre, and Baup (Reference Fieuzal, Sicre and Baup2017) make corn yield forecasts using both a real-time approach and a diagnostic approach. The real-time approach updates the estimates dynamically after the newest image is acquired, whereas the diagnostic approach utilizes all the image data throughout the season. The authors find the two best estimates perform comparably. Burke and Lobell (Reference Burke and Lobell2017) regress the agreement between satellite-based yields and field-reported yields as a function of farm size and find the vegetation index can most accurately predict crop yield when the field size is large.

All of the abovementioned studies employ a global model to produce the regression results that fit all observations, with the major difference among the studies being the specific model they use. To the best of our knowledge, this study is the first one to employ models that produce site-specific regression results, allowing heterogeneous responses of soybean yields across counties. This is also the first study to our knowledge that applies the FFT model to examine the yield-NDVI relationship.

3. Data and methods

3.1. Data

We use data for 797 counties from 10 major soybean-producing states in the United States from 2000 to 2016. According to NASS, the soybean production from these 10 states accounted for 78.5% (in 2016) and 79.8% (2000–2016 average) of the total soybean production in the United States (see Table 1 for soybean production and yield by state). Mkhabela et al. (Reference Mkhabela, Bullock, Raj, Wang and Yang2011) state that if a crop is not the dominant crop in the region, NDVI would give a poor prediction of crop yield because it cannot distinguish between different crops. The soybean yield data are obtained from the USDA-NASS QuickStats (https://quickstats.nass.usda.gov [accessed December 1, 2017]). This database provides official published aggregate statistics on U.S. soybean yields and the value of soybean production. Soybean yield is measured in bushels per acre. The NDVI data we use are from eMODIS onboard NASA’s EOS Terra satellite. Landsat TM and eMODIS are two mainstream imagery sources. Though Landsat TM has a better spatial resolution (30 m) than eMODIS (250 m), the latter provides a better temporal resolution (daily) than the former (16-day cycle). For monitoring purposes, we chose the eMODIS data. The eMODIS instrument onboard the Terra satellite achieves global coverage on a daily basis and provides 7-day composited data sets for its suite of products. Each data set provides NDVI information in GeoTIFF format that contains the reflective indices captured by Terra satellite at the resolution of 250 m from 2000 onward. Ag-Analytics converts the 250-m-resolution raw images to county-level NDVI. Ag-Analytics is an open-source, open-access database that provides data on agricultural finance, environmental finance, insurance, and risks (Woodard, Reference Woodard2016). We calculate county-level monthly NDVI values by taking a monthly average of the weekly NDVI values provided by Ag-Analytics. Climatological data are obtained from PRISM (parameter-elevation regressions on independent slopes model) Climate Data from Oregon State University and Ag-Analytics. We include two weather variables: maximum temperature over a month and average monthly precipitation. County boundary shapefiles are obtained from the U.S. Census Bureau. We obtain a sample of 12,027 county-year observations for the FFT analysis.

Table 1. Soybean production and yield in 10 major producing states

^a Soybean production is measured in 1,000 bushels. Soybean yield is measured in bushels/acre.

3.2. Flexible Fourier transform model

When estimating crop yield response to input variables, traditional models use regional and temporal dummies to capture spatial and intertemporal heterogeneity. Adding dummy variables can only capture the difference in the value of the dependent variable across locations and time; it does not take into account how the relationship varies according to site-specific and time-specific characteristics. Another type of model uses a quadratic functional form to estimate the relationship between crop yield and weather variables, assuming that crop yield is nonlinearly related to the weather variable. However, these models may suffer from model misspecification, especially if there is a threshold effect, driven by environmental risks such as drought and flooding (Cooper, Nam Tran, and Wallander, Reference Cooper, Nam Tran and Wallander2017).

Gallant (Reference Gallant1984) first proposed flexible Fourier functional transform to generate unbiased production function approximation and proved its mathematical validity. Cooper, Nam Tran, and Wallander (Reference Cooper, Nam Tran and Wallander2017) applied an FFT function to estimate the relationship between crop yield and temperature. We follow the approach and modeling in Cooper, Nam Tran, and Wallander (Reference Cooper, Nam Tran and Wallander2017) for the flexible Fourier function, which can be presented as follows:

$${\rm{Soybean\,yield}} = {\rm{}}{\beta _0} + \mathop \sum \limits_{m = April}^{August} ({\beta _{1m}}MaxTem{p_m} + {\beta _{2m}}MaxTempSquar{e_m}) + \mathop \sum \limits_{m = April}^{August} ({\beta _{3m}}Precipitatio{n_m} + \,{\beta _{4m}}PrecipitationSquar{e_m}) + \mathop \sum \limits_{m = April}^{September} ({\beta _{5m}}NDV{I_m}) + {\delta _0}TimeTrend + \mathop \sum \limits_{s = 1}^9 {\delta _s}StateDumm{y_s}\, + 2\mathop \sum \limits_{\alpha = 1}^A \mathop \sum \limits_{j = 1}^J \left\{ {{v_{j\alpha }}{\rm{cos}}\left[ {jk_\alpha ^{'}s\left( {NDVI} \right)\left] \ { - \ {w_{j\alpha }}{\rm{sin}}} \right[jk_\alpha ^{'}s\left( {NDVI} \right)} \right]} \right\} + error$$ (1)

In this model, the dependent variable is soybean yield in a county for a given year. β ₀ is the constant term. MaxTemp_m , Precipitation_m , and NDVI_m are the maximum temperature, the average precipitation, and the average NDVI in month m, respectively. We include the weather variables from April to August, following the standard specification in the literature (Cooper, Nam Tran, and Wallander, Reference Cooper, Nam Tran and Wallander2017). We include NDVI variables through September, following the remote sensing literature (Li et al., Reference Li, Liang, Wang and Qin2007). The advantage of the FFT function is that it not only allows for model flexibility but also incorporates multivariate estimation, which is difficult to achieve through other nonparametric models such as kernel regression.

PrecipitationSquare_m and MaxTempSquare_m are the squared terms of MaxTemp_m and Precipitation_m . TimeTrend equals the year minus 1999. StateDummy_s is the state dummy variable. NDVI is a vector with each element being NDVI_m . s(NDVI) is the scaled version of NDVI such that each element of s(NDVI) is in the range of [0, 2π]. In our case, only NDVI variables are transformed.

The ${\beta _0} + \sum\nolimits_{m = April}^{August} {({\beta _{1m}}MaxTem{p_m} + {\beta _{2m}}MaxTempSquar{e_m})} + \sum\nolimits_{m = April}^{August} {({\beta _{3m}}Precipitatio{n_m} + {\beta _{4m}}PrecipitationSquar{e_m})} + \sum\nolimits_{m = April}^{September} {({\beta _{5m}}NDV{I_m})}$ terms represent the quadratic regression part. β ₁_m, β ₂_m, β ₃_m, β ₄_m, and β ₅_m are parameters to be estimated. The $2\sum\nolimits_{\alpha = 1}^A {} \sum\nolimits_{j = 1}^J {\{ {{v_{j\alpha}}\cos [ {jk_\alpha ^{'}s(NDVI)} ] - {w_{j\alpha }}\sin [ {jk_\alpha ^{'}s(NDVI)} ]} \}} $ term models the functional flexibility using FFT. Similar to the Taylor expansion, which uses a series of polynomial terms to approximate the true function, the Fourier function uses a series of trigonometric terms to approximate the true function. The Fourier functional form is believed to be the only known functional form that satisfies the Sobolev condition, meaning that the difference between the approximated function and the true function approaches zero as the sample size becomes arbitrarily large. For a proof that the Fourier function satisfies the Sobolev condition, refer to Gallant (Reference Gallant1994). In the model, k_α (α = 1, 2, …, A) is the elementary multi-index vector, whose dimension equals the dimension of x_FFT, whereas A is the total number of elementary multi-indexes. The vector k_α can be obtained in the following way: first, exhaust the list of k_α, such that k_α has only integer elements and the sum of the absolute value of each element in k_α is no greater than K, where K is predetermined; second, delete any k_α whose first nonzero element is negative; and third, delete any k_α whose elements have a common integer divisor. Monahan (Reference Monahan1981) introduced a Fortran code to produce the set of elementary multi-index vectors. Also in the model, J is the order of the Fourier transformation, whereas v_jα and w_jα are parameters to be estimated. We use the following parametrization: K = 2, J = 2, which are chosen such that the rule of thumb—the number of variables after transformation is roughly the square root of the number of observations (Fenton and Gallant, Reference Fenton and Gallant1996)—is satisfied. Because there are 12,027 observations in the data we use, we include a total of 120 variables after the adding the transformed NDVI variables.

The model degenerates to the traditional OLS model when v_jα = 0 and w_jα = 0. In the following discussion, the OLS model refers to equation (1), with v_jα = 0 and w_jα = 0 imposed. By testing the statistical significance of variable v_jα and w_jα, we can decide whether the traditional quadratic model should be rejected in favor of the more flexible FFT model.

A review of the relevant literature reveals that the FFT model has been used/tested by scholars in different studies, fields, and situations. Chang et al. (Reference Chang, Kim, Miller, Park and Park2016) used the FFT to model the nonlinear effect of temperature on electricity demand. Becker, Enders, and Lee (Reference Becker, Enders and Lee2006) proposed a unit root test with a Fourier functional transform. Enders and Li (Reference Enders and Li2015) approximated structural breaks in U.S. GDP trends using Fourier forms. Jones and Enders (Reference Jones, Enders, Ma and Wohar2014) provided a summary on using Fourier forms to model structural breaks.

3.3. Prediction and forecast

We compare the prediction performance of the FFT model versus the OLS model. We conduct out-of-sample predictions and evaluate prediction performance by comparing prediction errors measured by the root-mean-square error (RMSE) and the mean absolute error (MAE), between FFT and OLS, for three schemes: time-series prediction, cross-sectional prediction, and panel prediction. RMSE and MAE are defined as follows:

$${\rm{RMSE}} = \sqrt {{1 \over {\rm{N}}}\sum\limits_{i = 1}^{\rm{N}} {{{({y_i} - {{\hat y}_i})}^2}} } .$$ (2)

$${\rm{MAE}} = {1 \over {\rm{N}}}\sum\limits_{i = 1}^{\rm{N}} {|{y_i} - {{\hat y}_i}|} .$$ (3)

Both RMSE and MAE are commonly used measures to evaluate prediction performance. They measure the difference between true and fitted values for soybean yield. The unit for both RMSE and MAE is bushels per acre. In a time-series prediction, we first select a year for prediction, then we use observations from all other years to generate the model, and after that we predict the soybean yield for the selected year using the fitted model, weather data, and NDVI data from the selected year. In cross-sectional prediction, similarly, we select a state for prediction, then we use observations from all other states to generate the model, and after that we predict the soybean yield for the selected state using the fitted model, weather data, and NDVI data from the selected state; in panel prediction, similarly, we make the prediction for a selected year and state. Though commonly used, a shortcoming of using RMSE or MAE to measure prediction performance is that we do not know whether the predicted yield overestimates or underestimates the final actual yield.

We make predictions and forecasts using the regression results from the models. In this study, prediction refers to cases where we may use data afterward to predict for a specific time; forecast refers to cases where we only use data up to a certain year to make predictions for that year.

4. Results

4.1. Descriptive analysis

The descriptive statistics for the main variables are reported in Table 2. The average soybean yield across all states and years is 43.11 bushels per acre. From April to July, the average maximum temperature and average NDVI increase steadily and reach their peak levels in August. The average precipitation is highest in the months of May and June. These variables are included as suggested by the modified Thompson model (Thompson, Reference Thompson1963) to account for weather effects.

Table 2. Descriptive statistics

^a Temperatures are measured in degrees Celsius; precipitation is measured in inches.

^b Negative normalized difference vegetation index (NDVI) denotes snow cover.

4.2. Flexible Fourier transform regression results

All FFT models were developed using Matlab R2017a (The MathWorks Inc.), following the methodology in Cooper, Nam Tran, and Wallander (Reference Cooper, Nam Tran and Wallander2017). Figures showing FFT results were made using the ArcMap 10.3 software. The estimation results from the model incorporating FFT terms are reported in Table 3. Because of the substantial number of variables (including 84 transformed NDVI variables), we only report the results for the main variables, including the untransformed weather variables and NDVI variables. However, the rest of the transformed variables are also included in the model-fitting process. We calculate elasticities by applying the mean value theorem to get the numerical approximation of the derivatives and fixing the values of independent variables at the median value for each variable for each county. Thus, we obtain an elasticity estimate for each county. We present the minimum, median, and maximum of FFT elasticity estimates across counties in columns 2 through 4 in Table 3. For comparison purposes, we also use the OLS regression results to calculate elasticity estimates for each county and report the elasticity summary from the OLS regression in columns 5 through 7 in Table 3. The OLS model refers to equation (1) with v _jα = 0 and w _jα = 0 imposed. For the weather variables, except for the July maximum temperature and the April average precipitation, the median of elasticity estimates derived from OLS and the median of elasticity estimates from FFT have the same sign. On average, higher temperatures from April to June and higher precipitation levels from June to August lead to higher soybean yields. On the other hand, higher temperatures in August and higher precipitation levels in May are associated with lower soybean yields.

Table 3. Elasticity estimates from flexible Fourier transform (FFT) and quadratic ordinary least squares (OLS) models

Notes: Because of the nonlinearity of the FFT regression, we report the elasticity estimates rather than the coefficient estimates of the main variables. Significance here indicated by asterisks corresponds to the significance of the untransformed variables. Asterisks (*, **, and ***) denote significance level of 10%, 5%, and 1%, respectively. In addition to these variables, an additional 84 Fourier transformed variables of normalized difference vegetation index (NDVI) are included in the analysis—their coefficient estimates are not reported here, but they are included in the elasticity calculations.

Although the median of elasticity estimates for weather variables across counties is very similar between the FFT and OLS results, the median elasticity estimates of NDVI variables differ significantly between the FFT and OLS results, in terms of both sign (September NDVI) and magnitude (April–August NDVI). NDVI elasticities estimated from the FFT model have a wider range than those generated by OLS, because of the inclusion of the transformed NDVI variables. The OLS results suggest that the August NDVI has a greater impact on soybean yields than the July NDVI, whereas the FFT results suggest the opposite. According to Table 3, when the July NDVI increases by 10%, the median soybean yield significantly increases by 4.5% or 1.94 bushels per acre. The median effect of August NDVI is also positive, though not significant.

By testing the significance of the coefficient estimates for the Fourier terms, we can test whether the FFT specification is overfitting the data. In Table 3, we present an F test of the FFT regression versus the OLS regression; we find that the coefficients on the transformed Fourier terms are jointly significantly different from zero, and thus the OLS is rejected in favor of the FFT regression.

The geographic distribution of coefficient estimates from FFT is presented in Figure 1. In each panel, we present the geographic distribution of the median of the elasticity estimates of NDVI for each month (April, May, June, July, August, and September, respectively) across different counties. For some counties in the north of North Dakota, central Minnesota, central Indiana, western Arkansas, and southwestern Missouri, soybean yields are highly responsive to July NDVI, but less responsive to August NDVI. For most counties in Ohio and in eastern Arkansas, in contrast, the soybean yield is responsive to August NDVI, whereas it is less responsive to July NDVI. For some counties in the western parts of North Dakota and South Dakota, soybean yields are responsive to April NDVI, whereas they are less responsive to August NDVI. These geographic differences in soybean yield responsiveness to NDVI show that global flexibility needs to be considered when making yield predictions.

Figure 1. Geographic distribution by state of elasticity estimates from flexible Fourier transform, April–September. Note: NDVI, normalized difference vegetation index.

4.3. Prediction and forecast results

The results of the time-series prediction and cross-sectional prediction performance for FFT versus OLS are shown in Table 4. The bolded numbers show cases where the FFT error is lower than the OLS error. On average, FFT performs better than OLS in time-series predictions because both MAE and RMSE for FFT are lower than those for the OLS model. For cross-sectional predictions, FFT has a higher RMSE on average, but a lower MAE than OLS does.

Table 4. Out-of-sample prediction performance: time-series and cross-sectional prediction

Notes: Bolded numbers indicate that flexible Fourier transform (FFT) has lower prediction errors and therefore outperforms ordinary least squares (OLS). MAE, mean absolute error; RMSE, root-mean-square error.

Our results show that time-series predictions on average are more accurate than cross-sectional predictions in terms of smaller predicting error. RMSE and MAE from time-series predictions are consistently lower than cross-sectional predictions.

We also conduct out-of-sample panel predictions. We randomly select 1,000 observations from all years and states, and predict the soybean yields for these 1,000 observations by OLS and FFT, using all other observations excluding these 1,000 observations. We then compare the predicted soybean yields with the actual yields and calculate the RMSE and MAE. We then repeat this sampling process 200 times. The histogram shown in Figure 2 is of the distribution of RMSE and MAE. Two findings are interesting. First, panel prediction has much lower prediction error than both time-series and cross-sectional predictions in Table 4. This suggests that when predicting soybean yield for a certain location, it is useful to include the already publicized yield data from other locations into the training sample. Second, FFT has a consistently lower prediction error than the OLS model. FFT can improve the prediction performance by a modest 0.3% according to MAE, or 0.4% according to RMSE. This percentage is obtained by dividing the prediction error by the mean of crop yield (average MAE is 0.138, average RMSE is 0.1684, and mean soybean yield is 43.11).

Figure 2. Histogram of root-mean-square error (RMSE) and mean absolute error (MAE) between ordinary least squares (OLS) and flexible Fourier transform (FFT).

The predictions so far may have used data from future periods to predict current soybean yields. Therefore, we now include forecasts where soybean yield predictions are only based on data from previous periods (Table 5). For RMSE, there are 10 years out of 16 years where FFT outperforms OLS. For MAE, there are 12 years out of 16 years in which FFT outperforms OLS. In terms of average error, FFT has smaller RMSE and MAE than OLS does. Although the forecasts are more realistic in terms of being based only on data from previous periods, the average prediction errors are unsurprisingly higher than those for the predictions using all data including from future periods in Table 4.

Table 5. Out-of-sample forecast performance

Notes: Bolded numbers indicate that flexible Fourier transform (FFT) has lower forecast errors and therefore outperforms ordinary least squares (OLS). MAE, mean absolute error; RMSE, root-mean-square error.

We also conducted a panel model regression, which included county fixed effects, to explore the within variation of the data. The results show lower prediction errors in a time-series prediction and higher errors in cross-sectional prediction for models with county fixed effects (Table 6) when compared with the models without county fixed effects (Table 4). The models with county fixed effects show that OLS has smaller prediction errors than the FFT model when the prediction is cross-sectional. However, for time-series prediction with county fixed effects, out of 17 years, there are 10 (8) times when FFT outperforms OLS in terms of smaller mean MAE (RMSE). Overall, the use of county fixed effects explored the within variation and improved prediction over time, but worsened cross-sectional prediction.

Table 6. Out-of-sample prediction performance with county fixed effects: time-series and cross-sectional prediction

According to the Crop Production report (USDA-NASS, 2016), the root-mean-square percentage error (RMSPE) of AYS/OYS forecasts for soybeans was 6.6% in 2016. In comparison, the RMSPE of our FFT model forecasts in 2016 using time-series prediction was 9.29%. Though the RMSPE from our FFT model is greater than that from USDA survey forecasts, FFT model forecasts can substantially save labor and survey costs.

5. Conclusions

In this study, we used FFT to account for global flexibility in the relationship between NDVI throughout the growing season and soybean yield. We produced county-specific coefficients and elasticities of NDVI on soybean yield. We found that the response of soybean yield to NDVI is different across locations. For some counties located in the northern states, soybean yield is highly positively related with the July NDVI, whereas for other counties located in the south, the August NDVI is a better indicator of the soybean yield. Traditional OLS models seem to underestimate the response of soybean yield to July and August NDVI.

Furthermore, we conducted out-of-sample predictions/forecasts and compared their performances for the OLS and FFT models. On average, predictions in time-series and forecasts from the FFT model outperform those from the OLS models in terms of lower prediction errors. We found that FFT models generally result in better out-of-sample predictions and forecasts than OLS models.

A limitation of this work is that it does not distinguish pixels of soybean crops from those of other crops or vegetation types. Nevertheless, incorporating NDVI in the model still results in significant coefficients and an improved fit. Future work can use filters to select pixels that are highly likely to be soybean crops. However, the use of globally flexible models may capture the heterogeneous soybean to total land ratios across counties by allowing a flexible and nonlinear relationship between NDVI and yield, compared with OLS, thus alleviating the contamination caused by other crops. Future work that applies land cover filters may improve the results even further.

This study uses data from the 10 major soybean-producing states in the United States for which data are readily available. Our results show that using the FFT model helps improve the prediction accuracy (lowers the prediction error) especially in panel predictions. The goal is to improve on the forecast accuracy of soybean yield in order to allow market participants to make more informed decisions with respect to anticipated crop yield and possible resulting prices. The FFT model also has the potential to forecast crop yields in less developed countries where ground fieldwork is too expensive to conduct or where the meteorological network is sparse—making this an alternative feasible solution in making crop yield predictions.

Author ORCIDs

Ani L. Katchova 0000-0002-7307-4073

Acknowledgements

We thank Joseph Cooper for sharing code with us and Joshua Woodard for sharing NDVI data with us. We thank two anonymous referees and the editor. All errors are our own.

References

Adrian, D. “A Model-Based Approach to Forecasting Corn and Soybean Yields.” Paper presented at the Fourth International Conference on Establishment Surveys, Montréal, Québec, Canada, June 11–14, 2012.Google Scholar

Basnyat, P., McConkey, B., Meinert, B., Gatkze, C., and Noble, G.. “Agriculture Field Characterization Using Aerial Photograph and Satellite Imagery.” IEEE Geoscience and Remote Sensing Letters 1, 1(2004):7–10.CrossRef Google Scholar

Becker, R., Enders, W., and Lee, J.. “A Stationarity Test in the Presence of an Unknown Number of Smooth Breaks.” Journal of Time Series Analysis 27, 3(2006):381–409.CrossRef Google Scholar

Bolton, D.K., and Friedl, M.A.. “Forecasting Crop Yield Using Remotely Sensed Vegetation Indices and Crop Phenology Metrics.” Agricultural and Forest Meteorology 173, (May 2013):74–84.CrossRef Google Scholar

Bose, P., Kasabov, N.K., Bruzzone, L., and Hartono, R.N.. “Spiking Neural Networks for Crop Yield Estimation Based on Spatiotemporal Analysis of Image Time Series.” IEEE Transactions on Geoscience and Remote Sensing 54, 11(2016): 6563–73.CrossRef Google Scholar

Burke, M., and Lobell, D.B.. “Satellite-Based Assessment of Yield Variation and Its Determinants in Smallholder African Systems.” Proceedings of the National Academy of Sciences of the United States of America 114, 9(2017):2189–94.CrossRef Google Scholar PubMed

Chang, J., Hansen, M.C., Pittman, K., Carroll, M., and DiMiceli, C.. “Corn and Soybean Mapping in the United States Using Modis Time-Series Data Sets.” Agronomy Journal 99, 6(2007):1654–64.CrossRef Google Scholar

Chang, Y., Kim, C.S., Miller, J.I., Park, J.Y., and Park, S.. “A New Approach to Modeling the Effects of Temperature Fluctuations on Monthly Electricity Demand.” Energy Economics 60, SC(2016):206–16.CrossRef Google Scholar

Cooper, J., Nam Tran, A., and Wallander, S.. “Testing for Specification Bias with a Flexible Fourier Transform Model for Crop Yields.” American Journal of Agricultural Economics 99, 3(2017):800–17.Google Scholar

Doraiswamy, P.C., and Cook, P.W.. “Spring Wheat Yield Assessment Using NOAA AVHRR Data.” Canadian Journal of Remote Sensing 21, 1(1995):43–51.CrossRef Google Scholar

Enders, W., and Li, J.. “Trend-Cycle Decomposition Allowing for Multiple Smooth Structural Changes in the Trend of US Real GDP.” Journal of Macroeconomics 44, SC(2015):71–81.CrossRef Google Scholar

Fenton, V.M., and Gallant, A.R.. “Qualitative and Asymptotic Performance of SNP Density Estimators.” Journal of Econometrics 74, 1(1996):77–118.CrossRef Google Scholar

Ferencz, C., Bognar, P., Lichtenberger, J., Hamar, D., Tarcsai, G., Timar, G., and Szekely, B.. “Crop Yield Estimation by Satellite Remote Sensing.” International Journal of Remote Sensing 25, 20(2004):4113–49.CrossRef Google Scholar

Fieuzal, R., Sicre, C.M., and Baup, F.. “Estimation of Corn Yield Using Multi-Temporal Optical and Radar Satellite Data and Artificial Neural Networks.” International Journal of Applied Earth Observation and Geoinformation 57, (May 2017):14–23.CrossRef Google Scholar

Gallant, A.R. “The Fourier Flexible Form.” American Journal of Agricultural Economics 66, 2(1984):204–8.CrossRef Google Scholar

Gallant, A.R. “Identification and Consistency in Semi-Nonparametric Regression.” Advances in Econometrics 1(1994):145–69.Google Scholar

Irwin, S., Good, D., and Sanders, D.. “Understanding and Evaluating WAOB/USDA Soybean Yield Forecasts.” Farmdoc Daily 5(May 7, 2015):84.Google Scholar

Irwin, S., Good, D., and Tannura, M.. Early Prospects for 2009 Corn Yields in Illinois, Indiana, and Iowa. Urbana: Department of Agricultural and Consumer Economics, University of Illinois at Urbana-Champaign, Marketing and Outlook Brief 09-01, 2009.Google Scholar

Irwin, S.H., Sanders, D.R., and Good, D.L.. Evaluation of Selected USDA WOAB and NASS Forecasts and Estimates in Corn and Soybeans. Urbana: Department of Agricultural and Consumer Economics, University of Illinois at Urbana-Champaign, Marketing and Outlook Research Report 2014-01, 2014.Google Scholar

Johnson, D.M. “An Assessment of Pre- and Within-Season Remotely Sensed Variables for Forecasting Corn and Soybean Yields in the United States.” Remote Sensing of Environment 141, (February 2014):116–28.CrossRef Google Scholar

Johnson, M.D., Hsieh, W.W., Cannon, A.J., Davidson, A., and Bédard, F.. “Crop Yield Forecasting on the Canadian Prairies by Remotely Sensed Vegetation Indices and Machine Learning Methods.” Agricultural and Forest Meteorology 218–219, (March 2016):74–84.CrossRef Google Scholar

Jones, P.M., and Enders, W.. “On the Use of the Flexible Fourier Form in Unit Root Tests, Endogenous Breaks, and Parameter Instability.” Recent Advances in Estimating Nonlinear Models. Ma, J., and Wohar, M., eds. New York: Springer, 2014, pp. 59–83.CrossRef Google Scholar

Kaul, M., Hill, R.L., and Walthall, C.. “Artificial Neural Networks for Corn and Soybean Yield Prediction.” Agricultural Systems 85, 1(2005):1–18.CrossRef Google Scholar

Li, A., Liang, S., Wang, A., and Qin, J.. “Estimating Crop Yield from Multi-Temporal Satellite Data Using Multivariate Regression and Neural Network Techniques.” Photogrammetric Engineering and Remote Sensing 73, 10(2007):1149–57.CrossRef Google Scholar

Lobell, D.B., and Asner, G.P.. “Climate and Management Contributions to Recent Trends in U.S. Agricultural Yields.” Science 299, 5609(2003):1032.CrossRef Google Scholar PubMed

Lv, X. “Remote Sensing, Normalized Difference Vegetation Index (NDVI), and Crop Yield Forecasting.” Master’s thesis, University of Illinois at Urbana-Champaign, Urbana, 2014.Google Scholar

Maselli, F., and Rembold, F.. “Integration of LAC and GAC NDVI Data to Improve Vegetation Monitoring in Semi-Arid Environments.” International Journal of Remote Sensing 23, 12(2002):2475–88.CrossRef Google Scholar

Mkhabela, M., Bullock, P., Raj, S., Wang, S., and Yang, Y.. “Crop Yield Forecasting on the Canadian Prairies Using MODIS NDVI Data.” Agricultural and Forest Meteorology 151, 3(2011):385–93.CrossRef Google Scholar

Monahan, J.H. Enumeration of Elementary Multi-Indices for Multivariate Fourier Series. Raleigh: North Carolina State University, Institute of Statistics Mimeo Series No. 1338, 1981.Google Scholar

Prasad, A.K., Chai, L., Singh, R.P., and Kafatos, M.. “Crop Yield Estimation Model for Iowa Using Remote Sensing and Surface Parameters.” International Journal of Applied Earth Observation and Geoinformation 8, 1(2006):26–33.CrossRef Google Scholar

Senay, G. “The Power of Remote Sensing: Global Monitoring of Weather, Water, and Crops with Satellites and Data Integration.” Resource Magazine 23, 2(2016):6–9.Google Scholar

Thompson, L.M. Weather and Technology in the Production of Corn and Soybeans. Ames: Iowa State University, CARD Reports 17, 1963. Internet site: https://lib.dr.iastate.edu/card_reports/17/ (Accessed May 8, 2017).Google Scholar

Tucker, C.J. “Red and Photographic Infrared Linear Combinations for Monitoring Vegetation.” Remote Sensing of Environment 8, 2(1979):127–50.CrossRef Google Scholar

U.S. Department of Agriculture, National Agricultural Statistics Service (USDA-NASS). The Yield Forecasting and Estimating Program of NASS. Washington, DC: USDA-NASS, Statistical Methods Branch, Staff Report No. SMB 12-01, 2012.Google Scholar

U.S. Department of Agriculture, National Agricultural Statistics Service (USDA-NASS). Crop Production. Washington, DC: USDA-NASS, 2016. Internet site: https://nass.usda.gov/Publications/Todays_Reports/reports/crop0818.pdf (Accessed May 8, 2017).Google Scholar

Woodard, J. “Big Data and Ag-Analytics: An Open Source, Open Data Platform for Agricultural and Environmental Finance, Insurance, and Risk.” Agricultural Finance Review 76, 1(2016):15–26.CrossRef Google Scholar

Table 1. Soybean production and yield in 10 major producing states

Table 2. Descriptive statistics

Table 3. Elasticity estimates from flexible Fourier transform (FFT) and quadratic ordinary least squares (OLS) models

Figure 1. Geographic distribution by state of elasticity estimates from flexible Fourier transform, April–September. Note: NDVI, normalized difference vegetation index.

Table 4. Out-of-sample prediction performance: time-series and cross-sectional prediction

Figure 2. Histogram of root-mean-square error (RMSE) and mean absolute error (MAE) between ordinary least squares (OLS) and flexible Fourier transform (FFT).

Table 5. Out-of-sample forecast performance

Table 6. Out-of-sample prediction performance with county fixed effects: time-series and cross-sectional prediction

Article contents

Predicting Soybean Yield with NDVI Using a Flexible Fourier Transform Model

Abstract

Keywords

1. Introduction

2. Background and related literature

2.1. Overview of current crop yield forecast methods

2.2. Crop yield forecasting using remote sensing

3. Data and methods

3.1. Data

3.2. Flexible Fourier transform model

3.3. Prediction and forecast

4. Results

4.1. Descriptive analysis

4.2. Flexible Fourier transform regression results

4.3. Prediction and forecast results

5. Conclusions

Author ORCIDs

Acknowledgements

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests