A trophic state relates to the productivity of the lake (i.e., the availability of nutrients) and acts as one of the key characteristics of aquatic ecosystems. Natural factors largely control the trophic state of aquatic systems; however, they are also prone to anthropogenic disturbances. The latter are responsible for considerable changes in lakes during the Anthropocene. Therefore, much attention is devoted to tracing the long-term trophic evolution of lakes and understanding the controls on the trophic state of lakes in the present-day human-affected environment. Even though the trophic state is a long-term condition (Rodhe 1969, Schindler 1977), measurable indicators that define this state may vary periodically. Factors determining trophy indicators are roughly divided into a stable for a long time and changeable over a shorter period. The former include physiographic properties of the lake and its catchment, and the latter connects with seasonal processes causing changes in the physical and biogeochemical properties of water. Ohle index, Schindler index, mean water depth, river network density, slope steepness, the share of endorheic areas, geological structure and land use in the catchment area are features regarded as controlling the trophy state (Bajkiewicz-Grabowska 2020). In-situ measurements of trophy indicators provide the best sources to build accurate models; however, such data are sparse, often incomplete, or access to it is restricted (Hollister et al. 2016). The information on the status of the trophy itself does not allow for a complex analysis of the responsible factors and thus, the reliable prediction of potential changes. With the rapid diffusion of geoscience and information technologies in the last decades (Chen et al. 2021), there is growing attention on numerical modelling in the many aspects of environmental changes. Thus, a possible solution for such a problem is predictive modelling that involves data obtained from public repositories like land cover/land use, lake basin geometry, and geology or directly calculated using GIS software.
Many papers describe the methods used to predict water trophies using watershed variables (Akbar et al. 2011, Benedini, Tsakiris 2013, Borics et al. 2013, Sun, Scanlon 2019, Gorgoglione et al. 2020). The efficiency of several multivariate analyses, including clustering, discriminant analysis, and principal component analysis, has been proven to reduce data complexity and detect intrinsic patterns in the underlying data. (Simeonov et al. 2010, Su et al. 2011, Li et al. 2017, 2018, Cui et al. 2019, Eliasz-Kowalska, Wojtal 2020). Such an approach has one considerable weakness: the detected patterns always show the internal differences of the explanatory variables, which does not always refer to the trophic state of the lakes. The linear multiple regression models (Jones et al. 2001, 2004, Beaulieu et al. 2014, Leach et al. 2018) provide insight into the relationship between explanatory variables and values of trophy indicators, but those methods are irrespective of the fact that most of the relationships are non-linear (Dormann et al. 2013, Huang et al. 2015).
New data acquisition techniques in geochemical surveys provide hundreds or thousands of observations described by tens or hundreds of features. When clarity of interpretation is more important than the model’s accuracy, simple models such as linear models or regression trees (Froeschke, Froeschke 2011) usually provide sufficient insight into relationships between factors and the outcome at the expense of the prediction quality. Complex models cannot be directly explained because they are not easy to understand. Some of the learners, including random forest (RF) (Breiman 2001), provide a measure allowing to estimate the relative importance of variables used, thus identifying such a subset that influences the variation of trophy factors.
The abundance of the data provides new opportunities, but however, it is challenging to investigate and interpret the role of many environmental features (Dafforn et al. 2015). It is even more difficult to understand relationships between elements of the system and their influence on the outcome because of complex relationships inside the multidimensional data. Regression models are a natural approach in searching relations between explanatory and dependent variables. Simple methods do not provide interpretable results when variables are related to each other or mutually convoluted. Complex models like assembles (Hollister et al. 2016, Li et al. 2016) or neural networks (Li et al. 2015, Rocha et al. 2017, Gebler et al. 2021) are then the only solutions; however, such models cannot directly be used for interpretation because they are not easy to understand. Moreover, many variables are unrelated to the studied phenomenon and are usually removed based on researchers’ experience or previous studies. Such removal, however, leads to the replication of the same, limited set of variables in subsequent studies (Goggin 1986, Harrell 2015).
There is no simple data analysis that would combine the advantages of supervised methods, like finding important variables and detecting relationships between explanatory variables and grouping – i.e. searching for new, possibly unknown patterns in the data. The first solution is optimisation: an iterative search of a combination of variables until the most optimal subset is found (Jasiewicz et al. 2021). An alternative solution is to create a non-linear regression model and then analyse it with interpretative machine learning (EML) (Molnar et al. 2020, Chen et al. 2021). Several tools have been recently proposed including Partial Dependency plots (Friedman 2001), local interpretable model-agnostic explanations (LIME) – Ribeiro et al. 2016), Learning Important Features Through propagating activation differences (DeepLIFT) – (Shrikumar et al. 2017), moDel Agnostic Language and eXplanation (DALEX) – (Biecek 2018) and SHapley Additive exPlanations (SHAP) (Lundberg, Lee 2017). These methods replace original values of explanatory variables with functional relationships between the explanatory and explained variables; in simple words the influence of a given variable on the result. The latter means a function that operates on an original variable value and replaces it with variable influence on the outcome. This paper introduces a new term: the
The research presented in this paper aims to develop a solution allowing for clustering so that the resulting clusters minimise the dissimilarities inside the explanatory features and reduce the variation of the dependent variable inside the received clusters. The method was developed on a sample of 60 lakes from North-Eastern Poland. The lakes selected for this study are small and moderate in terms of their area and have relatively simple morphology but are sufficiently diverse to represent that geographical zone. In addition, the lakes were selected to obtain the strongest trophic gradient possible, expressed in terms of total phosphorus (
The method was developed based on data collected from 60 lakes (Fig. 1) on the border of Suwalki and Masurian Lake District (SML). SML is an area of glacial and fluvioglacial origin, formed during the Pomeranian phase of the Weichselian glaciation between 24 k years and 19 k years BP (Marks 2012, Pochocka-Szwarc 2013). Dominant landforms include undulating morainic plateau with some hummocky and fluted till plains and washboard moraines (Weckwerth et al. 2019). Quaternary deposits are thick and contain typical components: tills, sands, silts, glaciofluvial gravels, and boulders. Lakes are the dominant and feature component of the SML landscape (Morawski 2005). All lakes are of glacial origin and are associated with the moraine plateau, inter-moraines and subglacial gutters. (Kondracki 2009, Pochocka-Szwarc 2013).
Location of lakes and extent of the watershed on the study area.
Epilimnion water samples were collected at the lake’s deepest point 1 m below the water surface with UWITEC sampler. Each lake was sampled once, and the samples were taken during three field campaigns in summers 2018 (east/central sector in Fig. 1), 2019 (central/west sector in Fig. 1), and 2020 (east-central-west sector in Fig. 1). The selection of 1 m depth as a representation of surface water followed the methodology of Tandyrak et al. (2020). The deepest point is routinely regarded as representative for the whole lake (Tylmann et al. 2012, Hernández-Almeida et al. 2017, Apolinarska et al. 2020). Chemical analysis of
The lake’s trophy is influenced by catchment factors responsible for the supply of matter, including nutrients and the morphometric parameters of the lake, which mainly determine the resistance of the lake to the influence of the catchment area. In our work, we considered all the indicators that could be obtained. We collected 25 explanatory variables, having a potential effect on the
The flow rate was omitted due to the lack of information on the amount of flow and the river network density because the area of the studied catchments is too small to develop such a network. According to the authors’ best knowledge, there are no measurement data on strictly hydrological processes, such as flow volume and water change time. Lange (1986) and Kalff (2001) suggest calculating those parameters using lake morphometry, but collected data already includes this information. We also omitted all factors that can be affected by the processes inside the lakes. Thus, the list of variables is limited only to stable variables over a long period; this eliminates factors that change seasonally (such as water temperature) or in short-term cycles (i.e., weather conditions).
The list of variables presented in Table 1 is divided into two groups. The first group includes morphometric parameters of lakes taken from The Atlas of Polish Lakes (Jańczak 1999) or ratios calculated directly from those features. The second group includes watershed geometry, morphometric parameters, lithology and land cover of the catchment area. In the first step, watersheds were delineated over the 30 m resolution Digital Elevation Model (DEM) DETD Level 2, (DEM in the rest of the paper) using GRASS GIS module r.stream.basins (Jasiewicz, Metz 2011). Finally, the geometry of the watershed was directly used to calculate the structure of their coverages and contribute to Schindler (1977) and Ohle (1956) ratios that define the relationship between lake and watershed geometry.
Explanatory variables used in the study.
Variable | Abbreviation | Unit | Source |
---|---|---|---|
Elevation | ELEV | m a.s.l. | Jańczak 1999 |
Lake area | LARE | ha | Jańczak 1999 |
Lake capacity | LCAP | km3 | Jańczak 1999 |
Lake max depth | LDMX | m | Jańczak 1999 |
Lake average depth | LDAV | m | Jańczak 1999 |
Lake max length | LLEN | m | Jańczak 1999 |
Lake max width | LWID | m | Jańczak 1999 |
Lake shoreline length | PRIM | m | Jańczak 1999 |
Lake elongation | LELN | Ratio | |
Lake capacity/length ratio | LVAR | Ratio | LCAP / LLEN |
Lake perim development | LPDV | Ratio | LLEN / sqrt (2 × π × LARE) |
Lake exposition | LEXP | Ratio | LARE / LDAV |
Watershed area | WARE | ha | Calculated |
Mean slope | WSLP | % | Calculated |
Height stddev | WHSD | m | Calculated |
Urbanised | WURB | % | Calculated |
Agriculture | WAGR | % | Calculated |
Forests | WFRS | % | Calculated |
Wetlands | WWET | % | Calculated |
Sands | WSND | % | Calculated |
Tills | WTLS | % | Calculated |
Clay | WCLS | % | Calculated |
Organic | WORG | % | Calculated |
Schindler ratio | SR | Ratio | |
Ohle ratio | OR | Ratio | WARE / LARE |
The upper part contains variables describing lakes morphometry, the lower part watersheds parameters. Source ‘Atlas’ denote variables read from Jańczak (1999), source ‘Calculated’, means variables values were calculated using GIS software and spatial data. See text for details. Column ‘Abbreviation’ contains symbols used later in the text. Variables starting with L refer to lakes morphometry, variables starting with W- to catchment properties.
Information about the land cover, geology, and basic morphometry of watersheds surfaces was obtained using Corine Land Cover (CLC) 2018 (EEA 2018) and Geological Map of Poland (GMP) 1:500,000 (Marks et al. 2006) was used to calculate coverage properties, including land-cover/land-use and surface geology. Because both maps contain units with complex characteristics, they were simplified. CLC was reduced to first-level CLC units (urbanised, agriculture, forest, and wetlands, excluding given lake), while GMP to basic lithological units (tills, sands, clays, and organic). The analysed group of lakes is located within one geomorphological division of the last glaciation thus, the stratigraphic distinction of the lithological unit was neglected. Coverage variables are expressed as a percentage of a given unit in the watershed area (WARE), separately for CLC and GMP. The DEM was used to calculate geomorphometric features such as watersheds height standard deviation (WHSD) and watersheds mean slope (WSLP) inclination of the terrain in the watersheds.
Our preliminary observations show that each of the single collected variables is weakly related to the trophy of the lakes (Fig. 2). Such a situation precludes the creation of simple relations, such as trophy index variable. Our method involves four steps presented in Figure 3 and described below in detail: (1) Building a regression model:
Relations between
The outline of methodology. See Section ‘Methods’ for details.
The first step is to build an explanatory regression model,
The second step of analysis includes the building of the explainer. Explainer
As a result, a vector of new values
The concept of mapping variable values into influence. The size of dots simulates the values of the dependent variable.
The influence plots (Fig. 4) show the relations between
The influence plots provide information on two levels. The first is the variable level, and it describes how changes of the variable impact the outcome in the given range of variable values. Moreover, the range of the
Clustering is a process of searching for similarities between natural objects and separating them into smaller yet consistent groups. Sometimes, preliminary clustering is used to improve the regression models (Kocev et al. 2020). Such clusters, however, only reflect the variability inside the set of explanatory variables; thus, the relationship between independent and dependent variables cannot be inputted into the unsupervised model. On the other hand, regression models by themselves cannot provide unknown information from the training data and cannot cluster the data by discovering their features independently. In that way, the zero-mean and relative to
Because RF does not require any assumptions about data distribution, that part of the data remained in its original form. The explained
The
Therefore, the first step is to assess the quality of the model. Figure 5 shows the relationship between the actual values of
Relation between actual values of dependent variables and the outcome of the model
Figure 6 presents the result of clustering in the form of a cluster map, where
Relation between clusters and variable influence. The red-white-blue gradient denotes the influence of the given variable. The colours marking the clusters are used in the same way in the other figures.
Variation of
Figure 8 allows one to analyse
The influence plots for each variable. The X-axis contains the original values of the variable, Y-axis contains the influence. Colours in legend denote clusters (see Fig. 7), size of dots value of
The
We decided to use hierarchical clustering (AHP) with euclidean dissimilarity and Ward linkage since such parameters gave the most interpretative clusters and allowed for easy exploration of possible sub-clusters. The first three steps, i.e.: the transformation of the explanatory model
The clustering process reveals the main novelty of the proposed method. A typical clustering process minimises differences within clusters and maximises differences between them. For this reason, each grouping process requires the prior selection of a limited number of variables, which are preferably not correlated with each other. In our method, the clusters do not mean lakes with similar characteristics, but rather a group of objects where the content of
The role of variables in explaining the
The SHAP (SHapley Additive exPlanations) plots (Lundberg, Lee 2017) for the five most representative lakes for each class. X-axis presents influence in units of
The high value of the
The value of the
This class also includes major mesotrophic lakes, but the
In the case of this class, the
The last class, including lakes with the lowest trophic index values, results from a strong negative impact of the six most significant variables: OR, WWET, SR, WURB, LARE, and LEXP. The positive impact of WTLS, LDM, and WSND is noted only in selected cases, and in the analysed case of Dmitrowo lakes, the effect of LDMX is negative.
Regardless of the dissimilarity of patterns related to specific values of the
The proposed method allows for a comprehensive assessment of the influence of selected variables on the share of
The RF importance describes only the importance of variables without the information about the direction: whether the variable generally increases or decreases the modelled value. Compared to RF, the LR model provides more information because it also includes the signs of coefficients. After removing least-informative variables, the sign shows whether the variable increases or decreases the outcome and the coefficient value is somehow related to the intensity of this factor.
Figure 10 shows the evaluation results of the importance of the variables for both models. Values of linear coefficient (slopes) of the LR model and Gini importance of RF cannot be compared by numbers, but both methods indicate more or less the same subset of variables. The value of the coefficient sign follows the orientation of the influence plots in Figure 8. Nevertheless, the variables indicated by LR show rather the strength of the general relationship between a given explanatory variable and the explained variable. Thus, linear models may overestimate the role of variables for which there is a linear relationship (even very weak) with the explained variable at the expense of the variables whose influence is significant but not linear.
The variable importance estimated using multiple linear regression (ElasticNet) and random forest. See text for details.
The goal of clustering is to identify stable groups in a dataset. The purpose of the comparison is to check whether grouping only the original explanatory variables will allow to identify groups of lakes with a similar trophy. As the clustering technique is not crucial for the entire method, multidimensional scaling (MDS) was used for comparative analysis. The MDS is an ordination technique, a form of non-linear dimensionality reduction that maps distances between objects in original multidimensional spaces into lower-dimensional space positions, preserving original dissimilarities as much as possible (Cox, Cox 2000). The
Visualisation of dissimilarities between lakes in a form of MDS (multidimensional scaling). The axes of the plot have no units. Lakes morphometry presents dissimilarity for a group of morphometric features (see Table 1); Land cover presents dissimilarity for four land cover variables (urbanised, agriculture, forests, wetlands); ‘All variables’ plate uses all 25 variables; Influence presents dissimilarity between lakes in a space of variables influence. For colours see Figure 7.
Three cases were analysed: an ordination using all variables and two ordinances for a subset of variables describing land cover and lakes’ morphometry. The ordination with all variables shows an amorphous structure and does not reveal
The paper proposes a new, complex method of data analysis, classified as EML. This method allows for a detailed visual analysis of complex nonlinear regressors and introduces a new concept: variable influence. The latter is a transformation of the set of explanatory variables into a form describing the influence of a given variable on the modelled feature. Such a transformation makes it possible to group data based on the functional relations between the explanatory variables and the explained variable instead of the variation in the explanatory variables only. This is also the limitation of the method because its quality depends on the performance of the model. The method was developed to explain the
Nonlinear relationships between the variables and the value of the impact are related to nonlinear natural processes – for example, the bimodal distribution of rainfall intensity. However, this problem requires a separate, dedicated research. The cluster analysis showed that the studied lakes could be divided into several clusters, where the Ptot value is shaped similarly. This means that there is no single pattern, on how the watershed influences the content of Ptot but rather a few repeating patterns representing the studied phenomenon of the trophy value. There are five classes, one for lakes with high and low trophies and three classes with medium trophies, where the