Acceso abierto

Using machine learning techniques to reconstruct the signal observed by the GRACE mission based on AMSR-E microwave data


Cite

Introduction

Water, a crucial Earth resource (De Villiers 2001), necessitates continuous monitoring in order to understand planetary processes and predict extreme weather events. Utilizing hydrological models, remote sensing, and gravimetric sensors has become integral to climate-related research. Satellite gravimetry is a unique technique for monitoring mass transport and Earth’s processes on a global scale. One such satellite mission is GRACE (Tapley et al. 2004; Wahr et al. 1998). SM is a pivotal hydrological variable, constituting a fundamental component of ΔTWS (Robinson et al. 2008). Both active and passive microwave observations enable the analysis of SM at both global and regional scales (Babaeian et al. 2019). One of the key remote sensors providing SM data was the AMSR-E mission. AMSR-E is an advanced microwave scanning radiometer, serving as a component of the Earth observation system (Njoku et al. 2005). Given the constraints associated with the spatiotemporal resolution of GRACE data, it is essential to uphold statistical significance when examining the potential synergy of GRACE data with sensors that offer higher measurement frequencies and spatial resolutions (Ioannidis 2005). In the literature, there is a significant amount of research that involves processing, merging, and analyzing data from gravimetric sensors in conjunction with other data with varying spatiotemporal resolutions (Eicker et al. 2014). The year-long gap between GRACE and GRACE-FO missions raises concerns about our understanding of ongoing climate change. With the increasing computational capabilities of computers, machine learning methods have gained significance in solving signal reconstruction problems. After the completion of the GRACE-FO mission, there will be a continued need to map various phenomena and their changes using existing data, along with further research in hydrological modeling (Hamshaw et al. 2018).

Numerous studies have already demonstrated the utility of machine learning models in various contexts. Early attempts to use Empirical Orthogonal Functions (EOF) were described by Becker et al. (2011). Sun et al. (2020) presented time series reconstructions of ΔTWS for sixty selected river basins, employing a comprehensive comparison of deep neural networks (DNN) and seasonal autoregressive integrated moving average (SARIMAX) models. Sun et al. (2021) presented a ΔTWS reconstruction using NOAH and CLSM data for major North American river basins. Artificial neural networks (ANN) were utilized by Seyoum and Milewski (2017), while Irrgang et al. (2020) demonstrated the efficiency of convolutional neural networks (CNN). Babaeian et al. (2019) conducted studies focusing on African river basins, and Sun (2013) used multilayer perceptron (MLP) and ANN architectures. Spatiotemporal analysis, using random forest, extreme gradient boosting (XGBoost), and logistic regression was employed by Jing et al. (2020) in the case of the Nile River basin, setting standards for highly accurate hydrological parameter reconstructions based on GLDAS-2 data. Seyoum et al. (2019) applied decision trees to enhance high-resolution groundwater level anomalies, improving GLDAS model data with field observations. Additionally, Sun et al. (2019) utilized CNN models with VGG16, Unet, and Segnet architectures for the Indian subcontinent, proving the effectiveness of encoder-decoder network structures in ΔTWS reconstruction.

The main goal of this publication is to show the possibility of using satellite microwave data (AMSR-E) to recreate the waveform observed by the gravimetric GRACE satellite mission on a global and local scale. To validate the experiment on a local scale, absolute gravimetric measurements were used. The underlying hypothesis posits that the ML methods applied to remote sensing AMSR-E data can effectively bridge the gaps in the GRACE mission data, serving as the independent variable in various regression approaches. Previous research endeavors encompassed diverse models and techniques to reconstruct the temporal variations in ΔTWS, frequently relying on SM data sourced from alternative sensors, indicating the tremendous importance of this feature over other predictors. The present study seeks to augment the existing body of literature by comprehensively examining various regression methodologies to reconstruct the GRACE-derived signal.

Target data – GRACE

The processed data from the GRACE mission, representing ΔTWS, are available on the PODAAC website (Physical Oceanography Distributed Active Archive Center 2023) and are distributed by the Center for Space Research (CSR) in Texas. The spatial resolution of the GRACE data used in this study is approximately 300 km × 300 km. The data concerning changes in mass on the Earth’s surface and subsurface are based on the RL06 standard (Dahle et al. 2013) at the level of L2 data processing. During the GRACE data processing, the C20 coefficient, representing the Earth’s gravitational flattening (Swenson et al. 2008a), was replaced with observations using the Satellite Laser Ranging (SLR) technique (Cheng & Tapley 2004). The error associated with the N-S stripes, resulting from orbit inclination, was removed using a modified decorrelation filter (Chen et al. 2007; Swenson & Wahr 2006). Additionally, during the GRACE data processing, the static part of the gravitational field was corrected using the GGM05C model (Ries et al. 2016). While processing GRACE data, degree 1 coefficients (Geocenter) were estimated using the methods presented in the work of Swenson et al. (2008b). Correction due to the glacial isostatic adjustment (GIA) was considered, based on the ICE6G-D model, as presented in the study by Peltier et al. (2018).

Predictors data – AMSR-E

The AMSR-E dataset is available as daily measurement files on the NASA website (NASA’s Goddard Earth Sciences Data and Information Services Center, 2023). The AMSR-E/Aqua surface SM descending dataset, version 2, is a Level 3 dataset in grid format, with a daily temporal resolution and a spatial resolution of approximately 25 km × 25 km. AMSR-E uses the X-band and C-band to measure the water cycle and SM content retrievals, corresponding to depths of 2.5–3.75 cm and 3.75–7.5 cm, respectively. Land surface SM measurements are derived from passive microwave remote sensing data using LPRM (Land Parameter Retrieval Model). LPRM leverages a radiative transfer model to obtain near-surface SM and optical depth of signal penetration. AMSAT-E on NASA’s EOS Aqua satellite discontinued data provision in October 2011 due to issues with its antenna rotation (van der Vliet et al. 2020). This study exclusively utilized descending orbits, primarily due to their superior stability for soil temperature, vegetation cover, and nighttime air conditions (Liu et al. 2012).

True validation data – absolute gravimetric measurements

The JOZE gravimetric station is situated beneath the facilities of the Astronomical-Geodetic Observatory (AGO) in Józefosław, Poland, precisely 5.7 meters below the surface. It is anchored to a concrete pillar measuring 2 × 2 meters. Absolute measurements were conducted at approximately monthly intervals, from May 2005 to November 2016, employing the FG-5 gravimeter, serial number 230. This dataset represents the longest and most uniformly collected time series of absolute gravimetric values in Poland. The total uncertainty in determining the gravity amounted to ±2 μGal. The results of absolute measurements are meticulously adjusted to account for Earth tides (following the Wenzel model), oceanic tides (based on the FES2004 model), atmospheric pressure fluctuations, and polar motion in accordance with ITGRS standards (Wziontek et al. 2021). In addition, gravitational values are further refined by incorporating the outcomes of ICAG and ECAG international comparison meetings to define common international gravity reference values and metrological factors stemming from variations in clock and laser frequencies.

Methods

Machine learning encompasses various problem types where these algorithms find application. Signal reconstruction is effectively addressed through regression modeling, using machine learning algorithms. The regression problem involves predicting a continuous response variable based on a given set of predictors. Regression models are techniques for creating a mathematical equation defining y as a function of the variables X. Linear regression is the simplest and most widely used technique for predicting a continuous variable and is defined by the formula: Y=Xβ1+β0+ε \text{Y}=\text{X}\beta 1+\beta 0+\varepsilon Here, β0 is the intercept, β1 is the regression coefficient associated with the predictor variable (feature or attribute) X, and ϵ represents Gaussian noise.

The selection of regression algorithms for this study was guided by their established efficacy in handling diverse datasets and features, and their suitability for modeling the phenomenon under investigation. Drawing from prior research (Bonaccorso 2018; Maulud & Abdulazeez 2020), which evaluated various machine learning algorithms for regression tasks, we identified several popular methods for their robust performance across different data characteristics. Leveraging diverse methods ensured a comprehensive exploration of the regression landscape and facilitated robust modeling of the target phenomenon. The selected algorithms encompass a range of approaches, from ensemble methods such as Random Forest Regressor and Extra Trees Regressor, known for their ability to capture complex relationships in large datasets, to gradient boosting algorithms such as Extreme Gradient Boosting (XGBoost) and Gradient Boosting Regressor, which excel in handling high-dimensional data and achieving superior predictive accuracy. Additionally, traditional linear models, such as linear regression and ridge regression, were included, which, despite their simplicity, often serve as reliable baselines for comparison. Bayesian Ridge regression was chosen for its ability to balance model complexity and goodness of fit through Bayesian analysis. At the same time, the Huber Regressor was selected for its robustness to outliers, a common challenge in real-world datasets. Furthermore, more specialized techniques, such as Orthogonal Matching Pursuit and Lasso Regression, offer sparse solutions by selecting only the most relevant features, thus aiding in model interpretability. Elastic Net, a hybrid regularization method combining L1 (Lasso) and L2 (Ridge) penalties, was included to address potential collinearity among predictor variables, enhancing the stability of parameter estimates. To provide a comprehensive evaluation, ensemble methods, such as AdaBoost Regressor, known for their ability to combine multiple weak learners in order to improve predictive performance adaptively, were also considered, and Passive Aggressive Regressor, a variant of the passive aggressive algorithm, adapted specifically for regression tasks, offering flexibility in adjusting model parameters in response to observed errors.

Methods of fitting trigonometric functions were chosen as a reference point. A custom script, written in Python, was used for calculations using open-source libraries such as scikit-learn, PyCaret, and numpy.

The experiment focused on the intersection of GRACE and AMSR-E sensor datasets over the 2002–2012 period, which were strategically chosen to maximize dataset overlap. Input data for model training were meticulously organized into a tabular format, where each data point corresponds to a point in the matrix, and represents different measurement epochs. The columns contain variables derived from SM AMSR-E Band C and Band X determination, latitude, longitude, and monthly/semi-annual factors. This matrix corresponds to a vector, containing continuous ΔTWS variables, which are marked as the target variable. Subsequently, all these matrices have been amalgamated into a single table with dimensions equivalent to the number of epochs multiplied by the number of data points, and further multiplied by the number of variables.

Prior to the training phase, a distinct portion of randomly selected data should be set aside for the purpose of accuracy testing and model evaluation. It is crucial to ascertain the appropriate sample size for the test data to attain statistical significance, given the lack of substantial variance in the mean values between the two groups (Ioannidis 2005). To obtain statistical significance of the model results, the minimum number of samples included in the test set was determined. Data from 2002–2008 were included in the training set, and data from 2008–2012 were included in the test set.

Comparative analyses necessitate quality metrics to evaluate model performance, which is dependent on the analysis type and data characteristics. This work utilizes metrics such as NSE (Nash 1970), the coefficient of determination (R2) (Nagelkerke 1991), Root Mean Square Error (RMSE) (Chai & Draxler 2014), and Normalized Root Mean Square Error. NSE is a normalized statistic that quantifies the relative size of the residual variance to the variance of the measured data. NSE is calculated using the following formula: NSE=1i1nyiy^isim2i1nyiy¯2 NSE=1-\frac{\sum\nolimits_{i-1}^{n}{{{\left( {{y}_{i}}-{{{\hat{y}}}_{isim}} \right)}^{2}}}}{\sum\nolimits_{i-1}^{n}{{{\left( {{y}_{i}}-\bar{y} \right)}^{2}}}} where: n represents the number of observations, yi is the actual value of observation i, y^i {{{\hat{y}}}_{i}} is the predicted value for observation i by the model, and y is the mean value of all observations. The coefficient RMSE is calculated using the following formula: RMSE=1ni1nyiy^i2 RMSE=\sqrt{\frac{1}{n}\sum\limits_{i-1}^{n}{{{\left( {{y}_{i}}-{{{\hat{y}}}_{i}} \right)}^{2}}}} RMSE is a measure of the deviation between actual values yi and model-predicted values y^i {{{\hat{y}}}_{i}} . A lower RMSE value indicates a better fit of the model to the actual data. The coefficient NRMSE is calculated using the following formula: NRMSE=RMSEmaxyminy NRMSE=\frac{RMSE}{\left| max \left( y \right)-min \left( y \right) \right|} where: max(y) and min(y) are the maximum and minimum values in the set of actual data y. In the context of GRACE data analysis, we are dealing with the amplitude of a phenomenon in a specific area. NRMSE is a measure of the deviation between model-predicted values and actual data, normalized to the data value range. R2=1i1nyiy^i2i1nyiy¯2 {{R}^{2}}=1-\frac{\sum\nolimits_{i-1}^{n}{{{\left( {{y}_{i}}-{{{\hat{y}}}_{i}} \right)}^{2}}}}{\sum\nolimits_{i-1}^{n}{{{\left( {{y}_{i}}-\bar{y} \right)}^{2}}}} In this formula: yi is the actual value of observation i, y^i {{{\hat{y}}}_{i}} is the predicted value for observation i by the model, and y¯ {\bar{y}} is the mean value of all observations. R2 is a measure that assesses how well a regression model fits the data. The value of R2 ranges from 0 to 1, where 1 indicates a perfect fit of the model to the data. The Nash-Sutcliffe model efficiency coefficient closely resembles the coefficient of determination, differing from R2 in its application. R2 serves as an indicator of the quality of fit for a statistical model. In contrast, NSE is utilized to quantify a model’s capability to forecast the outcome variable.

Global results and discussion

The achieved results on the test data sample are presented in table 1. The best results were obtained from methods related to Random Forests, such as Random Forest Regressor, Extra Trees Regressor, and Extreme Gradient Boosting, achieving satisfactory R2 values greater than 0.7.

The achieved results on the test data sample

Model RMSE [m] R2 Δ RMSE [%] Δ R2 [%] 1-R2 Δ 1-R2 [%]
Random Forest Regressor 0.035 0.761 51.3 380700.0 0.239 76.1
Extra Trees Regressor 0.035 0.757 50.9 378700.0 0.243 75.7
Extreme Gradient Boosting 0.037 0.739 48.9 369350.0 0.262 73.9
K Neighbors Regressor 0.038 0.725 47.7 362450.0 0.275 72.5
Light Gradient Boosting Machine 0.038 0.715 46.7 357750.0 0.285 71.5
Decision Tree Regressor 0.048 0.546 32.8 273000.0 0.454 54.6
Gradient Boosting Regressor 0.052 0.469 27.3 234600.0 0.531 46.9
Linear Regression 0.069 0.074 3.9 36950.0 0.926 7.4
Least Angle Regression 0.069 0.074 3.9 36950.0 0.926 7.4
Bayesian Ridge 0.069 0.074 3.9 36900.0 0.926 7.4
Ridge Regression 0.069 0.068 3.7 34150.0 0.932 6.8
Huber Regressor 0.070 0.062 3.2 31000.0 0.938 6.2
Orthogonal Matching Pursuit 0.072 0.000 0.2 50.0 1.000 0.0
Lasso Regression 0.072 −0.001 0.2 −150.0 1.001 0.0
Elastic Net 0.072 −0.001 0.2 −150.0 1.001 0.0
Lasso Least Angle Regression 0.072 −0.001 0.2 −150.0 1.001 0.0
Dummy Regressor 0.072 −0.001 0.2 −150.0 1.001 0.0
AdaBoost Regressor 0.073 −0.021 −0.9 −10450.0 1.021 −2.1
Passive Aggressive Regressor 0.086 −0.485 −20.0 −242550.0 1.485 −48.5
sin+cos annual function (baseline) 0.072 0.000 - - 1.000
sin+cos semiannual function 0.095 0.000 −32.7 0.0 1.000 0.0

Source: own elaboration

The results align with established benchmarks. For example, Sun et al. (2020) achieved impressive metrics in their temporal approach, including RMSE of 4.5–4.7 cm and NSE of 0.7. RMSE results of 4.5–4.7 cm can be observed (Szabó 2023) and RMSE of 4.2–4.5 cm, depending on the temporal and spatiotemporal scales considered. In a spatial and temporal context, Sun et al. (2021) achieved strong Nash-Sutcliffe efficiency (ca. 0.85), and low mean Normalized Root Mean Square Error (ca. 0.09) over the US. In a Nile River basin case study, using the spatiotemporal method (Jing et al. 2020), results revealed RMSE of 1.4–3.47 cm and NSE of 0.54–0.94. Using CNN networks, at a grid-based scale, showed promising results, with NSE of 0.87 in the Indian study area. However, RMSE values of 4.5–5.0 cm provide limited insight into solution quality. For a more comprehensive assessment, and considering the error-scale ratio, NRMSE serves as a superior metric, with the spatiotemporal approach benefiting from the increased variance in individual observations.

Local results and discussion

Measurements conducted with an absolute gravimeter are influenced by systematic geodynamic effects, which are accounted for during data processing and the local hydrological environment. Alongside gravity measurements at AGO JOZE, groundwater levels were monitored through a piezometer. By assessing the influence of nearby subsurface water bodies, a comparison between ground-based and satellite data was feasible. The methodology outlined in the work of Kuczynska-Siehien et al. (2019) and Szabó and Marjańska (2020) was employed to process absolute gravity data. The ΔTWS prediction results and the given gravity disturbance are presented in figure 3. The signal is accurately replicated with a high degree of precision, capturing the periodic changes effectively. However, in the presence of anomalies such as floods, the disparities between the model and observed values intensify.

In a study by Szabó and Osińska-Skotak (2023), the investigation reveals that the size of the river basin does not exhibit a direct correlation with the disparities in signals obtained from GRACE and AMSR-E. European rivers such as the Danube and Vistula exhibit concurrent shifts in hydrological signals when observed using both gravimetric and microwave remote sensors. Observations from the X- and C-bands introduce a more pronounced signal variance, compared with GRACE observations. Consequently, the identified anomalies are marked by heightened noise levels within these frequency ranges. Similar to the cited study, the flood wave prediction, based on SM data from AMSR-E, was unsuccessful in this case. Metrics for predictions for this location show the following values: NSE = −0.19, RMSE = 0.04 [m], NRMSE = 0.23, R2 = 0.27, and confirm that the determination of anomalies in this area, visible in local absolute measurements in 2010–2011, is unsuccessful. However, clear correlations between absolute measurements and SM from AMSR-E are visible in given periods. The negative anomaly from December 2009 was visible in both time series. This demonstrates the sensitivity of the gravimetric signal to environmental changes in the top aquifers.

Figure 1.

Random Forest Regressor model: (a) residuals; (b) prediction identity

Source: own elaboration

Figure 2.

Random Forest Regressor model spatial distribution of metrics: (a) NSE; (b) RMSE; (c) NRMSE; (d) R2

Source: own elaboration

Figure 3.

(a) Predicted ΔTWS and true ΔTWS with SM predictors from AMSR-E; (b) Predicted ΔTWS and true ΔTWS with dg validation data

Source: own elaboration

Conclusions

This study employs AMSR-E remote sensing data to model ΔTWS values, based on observations from the GRACE mission, testing various machine learning algorithms. Naturally forested and agricultural open regions exhibit a strong concordance between GRACE and AMSR-E data, emphasizing the importance of well-oxygenated soil root zones (Szabó & Osińska-Skotak 2023). The presence of permafrost restricts the applicability of X- and C-band microwave observations. Despite limited correlation in permafrost areas, ΔTWS values are accurately modeled with an RMSE of 3.5 cm. The Amazon region displays a notable model error, associated with the substantial amplitude of ΔTWS. However, metrics such as NRMSE and NSE affirm the overall quality of the model. AMSR-E SM data effectively models ΔTWS, even in equatorial forests. Challenges arise in the Mississippi River basin, the Great Plains, and Patagonia, where agricultural intensification leads to significant residuals from true observations. Factors influencing this may include connections with the irrigation of agricultural areas, faster water permeability to deeper aquifers in open areas, different vegetation periods, or other phenological factors. In the context of the approximately one-year gap between GRACE and GRACE-FO data, using existing data to model and complement the time series of gravimetric observations is extremely important. Data from remote sensing missions can be successfully used to achieve this goal.

eISSN:
2084-6118
Idioma:
Inglés
Calendario de la edición:
4 veces al año
Temas de la revista:
Geosciences, Geography, other