Impact Statement
This paper presents a state-of-the-art streamflow prediction workflow in poorly gaged areas that are plagued by missing data. We demonstrate the effectiveness of the workflow in forecasting daily streamflow at 10 gaging stations in the Benin Republic. The proposed workflow enhances decision-making based on streamflow model-based forecasts, reducing bias introduced by missing in-situ data and producing high-fidelity forecasts. This work particularly relevant in areas where data collection is too costly, under-resourced, or not possible across Sub-Saharan Africa.
1. Introduction
In hydrology, streamflow refers to the volume of water that moves through a natural channel, such as a river or a stream, during a specific duration. This quantity is generally quantified in cubic meters per second or cubic feet per second and plays a crucial role in managing water resources, predicting floods, and monitoring the environment (Rantz, Reference Rantz1982).
Streamflow predictions are critical for decision-making in many areas of climate change adaptation, including flood and drought prevention, agricultural planning, and hydroelectric power system operations (Kratzert et al., Reference Kratzert, Herrnegger, Klotz, Hochreiter and Klambauer2019). Predictions can be generated by physical and statistical models of the hydrological process within the river catchment (Adounkpe et al., Reference Adounkpe, Alamou, Diallo and Ali2021). Regardless of the approach, observational data are required to calibrate models for the localized dynamics within catchments sufficiently. Model-based predictions are even more essential in vulnerable and underdeveloped regions that are less resilient to extreme weather events such as drought and flooding. However, localized observations for hydrological models are not always available due to physical sensor failure events (Hamzah et al., Reference Hamzah, Mohd Hamzah, Mohd Razali, Jaafar and Abdul Jamil2020). Physical sensor systems often endure harsh environmental conditions, destruction and power outages (Tencaliec et al., Reference Tencaliec, Favre, Prieur and Mathevet2015). These phenomena result in prolonged periods of missing or invalid observation readings. In developing regions such as the Benin Republic, the effect of such missing data is further complemented by the already sparse distribution of stream gaging stations (Harrigan et al., Reference Harrigan, Zsoter, Alfieri, Prudhomme, Salamon, Wetterhall, Barnard, Cloke and Pappenberger2020). Fitting statistical models based on incomplete observational data results in inaccurate models that do not capture the full dynamics of the underlying hydrological processes (Hamzah et al., Reference Hamzah, Mohd Hamzah, Mohd Razali, Jaafar and Abdul Jamil2020). Imputation of missing data, therefore, leads to much-needed improved flood and drought predictions in a climate change context where these events are expected to increase in intensity and frequency (Kalantari et al., Reference Kalantari, Ferreira, Keesstra and Destouni2018). Thus, missing data imputation has become a necessary first step in data preprocessing tasks (Pigott, Reference Pigott2001). Since the area of missing data imputation commands a relatively vast amount of statistical literature, a wide variety of techniques have been developed and applied with different data requirements and accuracy (Little and Rubin, Reference Little and Rubin2019).
A simple approach to handling missing data is discarding records where any missingness arises. Gill et al. (Reference Gill, Asefa, Kaheil and McKee2007) show that while deletion of missing data is common practice in hydrology, it is not necessarily optimal as significant amounts of predictive information can be lost. Single imputation methods attempt to replace missing values one feature at a time with a pre-specified summary statistic (e.g., mean or median) of the complete data (Baraldi and Enders, Reference Baraldi and Enders2010). Regression-based methods, such as imputation by random forest (RF), k-nearest neighbors (KNNs), and multilayer perceptron, improve upon single imputation by exploiting correlations between variables in complete data (Pantanowitz and Marwala, Reference Pantanowitz, Marwala, Yu and Sanchez2009; Gao, Reference Gao2017). Both regression and single imputation can lead to biases in the resultant analysis when the assumption that data are missing at random is violated Gao (Reference Gao2017). Multiple imputation methods, such as multivariate imputation by chained equations, reduce this bias by replacing missing data points with predictions from an ensemble of regressors obtained from multiple subsets of the complete data (Van Buuren, Reference Van Buuren2018). Each of the imputation techniques above is used extensively in hydrology (Adeloye and Rustum, Reference Adeloye and Rustum2012; Mwale et al., Reference Mwale, Adeloye and Rustum2012; Hamzah et al., Reference Hamzah, Hamzah, Razali and Samad2021).
Global streamflow prediction systems also present a potential solution to missing data and the sparse distribution of stream gage stations in developing countries. The Group on Earth Observations Global Water and Sustainability Initiative (GEOGloWS) produce one such system in the Global Flood Awareness System (GloFAS) with the GEOGloWS ECMWF streamflow service (GESS) Ashby et al. (Reference Ashby, Hales, Nelson, Ames and Williams2021). The GESS extends GloFAS to local sites using an updated digital elevation model that allows for increased resolutions in areas of the world with narrow watersheds (Ashby et al., Reference Ashby, Hales, Nelson, Ames and Williams2021). Like GloFAS, runoff predictions from the HTESSEL land surface model are further downscaled and routed through a high-resolution stream network using the Muskingum algorithm of the routing application for parallel-discharge computation (Gill, Reference Gill1978). While GESS provides much-needed streamflow data, localized calibration remains a significant challenge. A global calibration of these systems is biased primarily to developed regions with dense and complete observations data Ashby et al. (Reference Ashby, Hales, Nelson, Ames and Williams2021). Thus, the utility of streamflow forecasts from these systems can be significantly enhanced by bias and variance correction to minimize systemic biases between model output and gaging station observations. Statistical Learning methods are broadly popular when learning transfer functions between global model forecasts and local observations (Piani et al., Reference Piani, Weedon, Best, Gomes, Viterbo, Hagemann and Haerter2010; Rischard et al., Reference Rischard, Pillai and McKinnon2018; Wang et al., Reference Wang, Zhang and Villarini2021). Some statistical learning methods that have been put forward for bias correction tasks include penalized regression, quantile mapping (QM), artificial neural networks, and Gaussian processes (GPs) (Pastén-Zapata et al., Reference Pastén-Zapata, Jones, Moggridge and Widmann2020; Xu and Liang, Reference Xu and Liang2021; Hunt et al., Reference Hunt, Matthews, Pappenberger and Prudhomme2022).
A natural next step from the imputation of missing data is streamflow forecasting. This area has recently been dominated by recurrent neural network (RNN) architectures given their successes in sequence-to-sequence modeling tasks such as natural language processing (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017; Kratzert et al., Reference Kratzert, Herrnegger, Klotz, Hochreiter and Klambauer2019; Adounkpe et al., Reference Adounkpe, Alamou, Diallo and Ali2021; Castangia et al., Reference Castangia, Grajales, Aliberti, Rossi, Macii, Macii and Patti2023). This is mainly due to the ability of architectures like the long short-term memory (LSTM) networks to capture dependencies at longer time lags and the use of attention mechanisms that can focus on salient parts of the input sequence that are relevant to predictions for a particular time point (Alizadeh et al., Reference Alizadeh, Bafti, Kamangir, Zhang, Wright and Franz2021; Lim et al., Reference Lim, Arık, Loeff and Pfister2021).
In this work, we address two gaps in streamflow prediction literature: (a) prior work on statistical learning approaches to streamflow prediction isolates the problems of missing data and forecasting and (b) most RNN approaches in this area are restricted to training and generating forecasts for a single catchment at a time due to the inability to incorporate encodings of categorical variables.
This paper closes these gaps by presenting and evaluating a novel workflow for streamflow forecasting that (a) performs missing data imputation through the bias correction of GESS river discharge estimates and (b) employs a state-of-the-art forecasting model, the temporal fusion transformer (TFT) for day head streamflow prediction for 10 catchments in the Benin Republic using a single interpretable model. The benefits of the proposed workflow are that it provides a principled method for dealing with missing observations while also minimizing training time for the forecasting module; competing architectures would require the training of 10 different models (one for each catchment).
2. Study Area
For the purposes of this study, 10 hydrological stations located on the outlets of Benin’s major river catchments were selected and depicted in Figure 1: Athiémè, Beterou, Bonou, Domè, Kaboua, Kompongou, Koubéri, Lanta, Porga, and Yakin. Benin’s catchments are not limited to only Benin’s administrative boundary but extend to its neighboring countries: Togo, Burkina Faso, and Nigeria. The Kouffo, Mono, Okpara, Ouémé, and Zou rivers originate from central Benin and Togo and generally flow into the Atlantic Ocean. The Alibori, Mékrou, and Sota rivers also originate from the center of Benin but have their outlet up north in the Niger river. The Pendjari river (sometimes referred to as the Oti river) originates from Northern Benin and flows into the Volta river in Togo and Ghana.
3. Data
In-situ hydrological data (daily river discharge record at 10 locations from 1980 to 2021) was acquired from Benin’s hydrological service, Direction Générale de l’Eau (DG-Eau). The missing data percentage of the selected hydrological stations is 27.9%, with the station of Athiémè having the least missing data and the Kompongou station having the most missing data. Supplementary Figure 6 details the periods of missing data for each station. On the other hand, the GESS forecasts present no missing data and provide daily river discharge data ranging from 1979 to the present at targeted river sections (Ashby et al., Reference Ashby, Hales, Nelson, Ames and Williams2021). Daily precipitation, maximum temperature, and minimum temperature were obtained from Benin’s meteorological agency, Agence Nationale de la Météorologie du Bénin. The static datasets of elevation, soil, land use, and geology were, respectively, obtained from the CGIAR Consortium for Spatial Information (Reuter et al., Reference Reuter, Nelson and Jarvis2007), Office de la Recherche Scientifique et Technique Outre-Mer (Fauck, Reference Fauck1977), Centre National de Télédétection, and Office Béninois des Mines. Further details on the data used can be found in the Supplementary Material.
4. Proposed Workflow
We propose the workflow depicted in Figure 2 that allows for both imputations of missing in-situ observations and multi-horizon streamflow forecasting. We now provide a detailed exposition of our imputation and forecasting methods.
4.1. Imputation of missing observations
We investigate streamflow imputation by bias-correcting GESS forecasts. We propose an imputation scheme where we replace a missing in-situ observation at time $ t $ with
where $ {\boldsymbol{x}}_t^{\prime } $ is the GESS forecast of streamflow at time $ t $ . $ h $ is a regressor trained on periods where $ {\boldsymbol{x}}_t^{\prime } $ and $ {\boldsymbol{x}}_t $ are both available. We investigate settings where $ h $ takes various forms, including a simple lookup from GESS, a QM, elastic net, and GP.
4.1.1. Elastic net
The elastic net is a penalized linear regression model that employs weighted $ {l}_1 $ and $ {l}_2 $ norm regularization terms. The regularization terms serve as conservative priors (bias toward zero) on the coefficients that prevent overfitting to training data. Similar to the GP, we utilized a multi-output formulation of the Elastic Net that allows information sharing between stations. In our imputation formulation,
where $ \boldsymbol{\beta} $ is a matrix of coefficients on the GESS predictions $ {\boldsymbol{x}}_t^{\prime } $ and $ \unicode{x025B} $ is an error term of independent identically distributed Gaussian noise. The matrix of coefficients is obtained by optimizing the posterior:
with the hyperparameters $ {\lambda}_1 $ and $ {\lambda}_2 $ obtained by cross-validation with the constraint $ {\lambda}_1+{\lambda}_2=1 $ . Given limited in-situ data, the elastic net has the advantage of limiting overfitting the bias correction relative to a regularized regression.
4.1.2. Gaussian processes
GPs are nonparametric probabilistic models that are well known for regression problems (Cohen et al., Reference Cohen, Mbuvha, Marwala and Deisenroth2020). A GP is a collection of random variables, where each of its finite subsets is jointly Gaussian (Rasmussen and Williams, Reference Rasmussen and Williams2006). GPs are sufficiently defined by a mean function $ \mu \left({\boldsymbol{x}}^{\prime}\right) $ and a kernel function $ k\left({\boldsymbol{x}}_i^{\prime },{\boldsymbol{x}}_j^{\prime}\right) $ .
Given a training dataset $ {\left\{{\boldsymbol{x}}_t^{\prime },{\boldsymbol{x}}_t\right\}}_{t=1}^t $ with $ t $ noisy observations, $ {\boldsymbol{x}}_t=h\left({\boldsymbol{x}}_t^{\prime}\right)+\unicode{x025B} $ , where $ \unicode{x025B} \sim \mathbf{\mathcal{N}}\left(0,{\sigma}_X^2\right) $ . The GP prior on $ h $ is such that $ h\left({\boldsymbol{x}}^{\prime}\right)\sim \mathbf{\mathcal{N}}\left({\boldsymbol{\mu}}_{x^{\prime }},{\boldsymbol{K}}_{x^{\prime }}+{\sigma}_X^2\boldsymbol{I}\right) $ . The mean and variance of the Gaussian posterior predictive distribution of the function $ h\left({\boldsymbol{x}}_{\ast}^{\prime}\right) $ at a test point $ {\boldsymbol{x}}_{\ast}^{\prime }, $ are given by
where $ {\boldsymbol{k}}_{\ast \ast }=k\left({\boldsymbol{x}}_{\ast}^{\prime },{\boldsymbol{x}}_{\ast}^{\prime}\right) $ and $ {\boldsymbol{k}}_{\ast }=k\left({\boldsymbol{x}}^{\prime },{\boldsymbol{x}}_{\ast}^{\prime}\right) $ . The kernel hyperparameters $ \boldsymbol{\theta} $ and the noise parameter $ {\sigma}_{\boldsymbol{X}} $ are obtained by maximizing the log-marginal likelihood
In this work, we use a multi-output formulation of the GP where all 10 stations share a squared exponential kernel, thus allowing for implicit inference of connectivity relationships between stations.
4.1.3. Quantile mapping
QM is a well-known method for the bias correction of physical model output (Maraun, Reference Maraun2013; Ringard et al., Reference Ringard, Seyler and Linguet2017). In QM, we seek to learn a mapping between the empirical cumulative distribution functions (CDFs) of the in-situ and GESS prediction. In this case, the bias correction transfer function is
where $ {F}_X^{-1} $ is the inverse empirical CDF of the in-situ data and $ {F}_{X^{\prime }} $ is the empirical CDF of the GESS data. Both empirical CDFs are obtained during a specified training period.
4.2. Temporal fusion transformer
We compare imputation by bias correction using the abovementioned methods to traditional regression imputation methods that rely only on complete in-situ data. We consider regression-based imputation by RF (Pantanowitz and Marwala, Reference Pantanowitz, Marwala, Yu and Sanchez2009), QM (Maraun, Reference Maraun2013; Ringard et al., Reference Ringard, Seyler and Linguet2017), GP (Rasmussen and Williams, Reference Rasmussen and Williams2006) regression, and KNNs (Hamzah et al., Reference Hamzah, Hamzah, Razali and Samad2021) as our baselines. A description of these methods can be found in the Supplementary Material.
4.3. Forecasting methods
We investigate the LSTM network and TFT RNN models as candidates for our forecasting module.
The LSTM (Hochreiter and Schmidhuber, Reference Hochreiter and Schmidhuber1997) is an RNN architecture that utilizes a memory cell state to represent long-term dependencies in sequential data. The LSTM memory cell state at time $ t $ can be formulated as
where $ x $ are the inputs, $ c $ are the cell states, and $ h $ are the hidden states indexed by time $ t $ . $ {W}_k $ are the model weights. At the end of the input sequence, the cell and hidden states are connected to an output ( $ {y}_{out} $ ) via a dense neural network layer to make predictions:
$ \Phi $ is the dense layer, and $ {W}_D $ is its weights.
The TFT (Lim et al., Reference Lim, Arık, Loeff and Pfister2021) is a state-of-the-art multi-horizon forecasting model that takes a different approach to learn long-term dependencies in sequential data. The TFT uses a transformer encoder to create an abstract representation of the input sequence; this representation (encoding) is then transformed into predictions by a decoder. In general, a vital element of the TFT and transformer architectures is the self-attention mechanism which adds sparsity to the encoding. The sparsity added by the self-attention allows the decoder to focus on parts of the input sequence that are influential for prediction. Attention can be considered a differentiable dictionary lookup that matches a query to a key linked to a value in the encoding (a representation of past inputs). The TFT augments the transformer architecture with variable selection networks and static covariate encoders, increasing its utility for multivariate time series forecasting with heterogeneous inputs. The static covariate encoders are particularly important in streamflow forecasting. This allows us to use the catchment name as a feature and train one model for all catchments rather than multiple models (one for each catchment) in the case of traditional RNN models like the LSTM. A detailed depiction of the TFT is provided in Supplementary Figure 7. We consider two TFT models a TFT-Lite, which considers only the temporal inputs, that is, historical streamflow and climate variables and a larger TFT-full model that adds static catchment characteristics to the inputs. This allows us to test for the additional value of the static input encoder in the TFT.
We measure the quality of the imputation using the Kling–Gupta efficiency (KGE) (Gupta et al., Reference Gupta, Kling, Yilmaz and Martinez2009) and the Nash–Sutcliffe efficiency (NSE) (Nash and Sutcliffe, Reference Nash and Sutcliffe1970). These are discussed in more detail in the Supplementary Material.
5. Results and Discussion
We start by presenting the results of the missing data imputation and then proceed to discuss the results of forecasting elements.
5.1. Missing data imputation
We partition the data such that 60% of the period (January 1980 to November 2004) is for training and 40% (November 2004 to June 2021) is for testing. We evaluate the performance of the imputation methods by randomly simulating missingness on the complete testing data at the rates of 5, 10, 20, 30, and 50%.
Figure 3 shows the mean KGE and NSE of the imputation methods across gaging stations and levels of missingness. It can be seen that a simple lookup of the GESS predictions yields the worst imputation performance.
The hypothesis around significant bias in the GESS predictions is reinforced by the outperformance of the GP and elastic net bias correction imputation. Importantly the imputation by bias correction yields better performance than simply employing RF, KNN, and MLP regressors on the complete in-situ data. This can be reasonably expected as there are limited periods where the data are jointly complete for training the multiple imputation regressors. Thus, the GESS data provide the necessary signals regarding discharge seasonality and hydrologic connectivity to enhance imputation quality significantly.
Table 1 shows the KGE values at each of the stations at a level of 20% missingness. It can be seen that GESS yields the best performance at Yankin and Kompongou stations. Yankin and Kompongou had the highest prevalence of missing data at 47.67 and 59.48%, respectively. A possible reason for this result is that, given the elevated levels of missingness at these stations, the complete data could not sufficiently calibrate the bias correction. GP and Elastic Net imputation methods provide the best performance in 8 of the 10 stations with the highest KGE values obtained at stations with highly complete data in the Athiémè and Bonou stations. QM as bias correction yields KGEs superior to multiple imputations by KNN and RF, suggesting that bias correction would remain the most efficient approach even when constrained by computational resources.
Note. The highest KGE for each station is in bold.
Abbreviations: GEOGloWS, Group on Earth Observations Global Water and Sustainability Initiative; GESS, GEOGloWS ECMWF streamflow service; KGE, Kling–Gupta efficiency.
Figure 4a depicts the embedded bias in the GESS forecasts at the Koubéri station. The positive bias in the GESS manifests in that GESS lookups lead to new discharge extremes that are six times the extremes recorded in the in-situ observations. In operational settings, such significant positive biases can lead to false flood alerts and suboptimal use of already limited resources. The remedy to this extreme bias is seen in Figure 4b, where bias correction removes the significant positive bias while retaining the seasonal periodicity in discharge.
5.2. Day ahead streamflow forecasting
In testing the performance of the forecasting models, we trained all models with 30 years of data (January 1983 to April 2013) and tested the models on data from May 2013 to December 2019. The date range for training and testing of the forecasting models was adjusted due to the lack of in-situ observations of climate variables in periods prior to 1983. Table 2 summarizes the testing performance of forecasting models across each of the 10 gaging stations. It can be seen that the TFT models outperform the LSTM significantly across all gaging stations. The TFT-full model outperforms the much smaller TFT-lite model in 6 of the 10 stations. This suggests that the additional static inputs on specific catchment characteristics add predictive skill in forecasting streamflow. An essential consideration in explaining the relative performance of TFT lite is that an encoding of the catchment name is added as an input to the model. Thus, some catchment characteristics are implicitly represented in the input. It can also be seen that the models underperform on the Lanta catchment—this result is most likely explained by the size of the catchment being tiny relative to the others. Our results significantly improve on previous studies at overlapping catchments in the study area, mainly based on physical hydrological models. These include Biao (Reference Biao2017), which employs the hydrological model based on the least action principle at Beterou and Bonou, and Badou (Reference Badou2016) uses Soil and Water Assessment Tool, Hydrologiska Byråns Vattenavdelning (HBV light), and Universal Hydrological Program—Hydrological Response Unit at Kouberi, Yakin, and Kompongou. Details on other overlapping studies are contained in the Supplementary Material.
Note. Best-performing models for each Station (and metric) are highlighted in bold.
Abbreviations: KGE, Kling–Gupta efficiency; LSTM, long short-term memory; NSE, Nash–Sutcliffe efficiency; TFT, temporal fusion transformer.
The attention mechanism of the TFT further allows us to endow our forecasts with some explainability of predictions. Figure 5 depicts the average temporal attention over the testing period of each TFT model. From Figure 5a, it can be seen that the TFT-lite model focuses more on inputs at lags of around 80–90 days with very sparse attention at earlier lags. On the other hand, in Figure 5b, the TFT full spreads attention of a similar scale to the TFT-lite model with a minor kink around 40 days with a similar pick around 80–90 days. This suggests that the TFT model has the potential to capture seasonal influences that typically have the delays of up to 90 days. The variable importance plots in the static variables in Supplementary Figure 9 show an agreement between the TFT models that the catchment encoding is an essential feature. This further supports the hypothesis around the advantages of using TFT with a static encoder for this task. Other variables that have high importance necessarily include the previous day’s streamflow, runoff rations, and mean elevation.
6. Conclusion
We evaluated the state-of-the-art imputation methods for streamflow prediction, demonstrating the need and importance of bias-correcting GESS streamflow forecast information. Our findings show that the bias introduced by GESS forecasts can be significant and result in possible false flooding alerts. We minimize such bias using Elastic Net regressions trained in periods where in-situ observations are available. The resultant imputation under simulated missingness shows that GESS bias correction outperforms train KNN and RF imputation trained on available data alone. We also implement a TFT architecture in streamflow prediction for the first time in literature and show that it significantly outperforms the LSTM and previous work on catchments in the Benin Republic.
Our findings enhance decision-making based on streamflow model-based forecasts. This enhancement is critical, particularly in areas where data collection is too costly, under-resourced, or not possible across Sub-Saharan Africa. Continued work aims to enhance and expand on the operational implementation of our bias-correction-based methodology. Specifically, we aim to interpolate and extrapolate the best-performing bias-correction model to ungaged areas. This is anticipated to vastly increase the utility of GESS to large areas across Sub-Saharan Africa and Central America that are currently ungaged.
Acknowledgments
We are grateful to Benin’s hydrological service, Direction Générale de l’Eau (DG-Eau) and Agence Nationale de la Météorologie du Bénin (METEO-BENIN) for the provision of hydrological and meteorological data for this work.
Author contribution
Conceptualization: R.M., J.Y.P.A., N.K.N.; Data curation: J.Y.P.A., M.C.M.H.; Investigation: R.M., J.Y.P.A.; Methodology: R.M., J.Y.P.A.; Writing—original draft: R.M., J.Y.P.A.; Writing—review and editing: T.M., N.K.N., W.T.M., Z.N., W.T.M. All authors approved the final submitted draft.
Competing interest
The authors of this manuscript have no competing interests.
Data availability statement
The data used in this study are available from Benin’s Hydrological service, Direction Générale de l’Eau (DG-Eau) for research purposes only.
Funding statement
R.M. is supported by a DeepMind Academic fellowship in Machine Learning.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/eds.2023.11.
Provenance
This article is part of the Climate Informatics 2023 proceedings and was accepted in Environmental Data Science on the basis of the Climate Informatics peer-review process.