Interpretative Machine Learning as a Key in Recognizing the Variability of Lakes Trophy Patterns

A trophic state relates to the productivity of the lake (i.e., the availability of nutrients) and acts as one of the key characteristics of aquatic ecosystems. Natural factors largely control the trophic state of aquatic systems; however, they are also prone to anthropogenic disturbances. The latter are responsible for considerable changes in lakes during the Anthropocene. Therefore, much attention is devoted to tracing the long-term trophic evolution of lakes and understanding the controls on the trophic state of lakes in the present-day human-affected environment. Even though the trophic state is a long-term condition (Rodhe 1969, Schindler 1977), measurable indicators that define this state may vary periodically. Factors determining trophy indicators are roughly divided into a stable for a long time and changeable over a shorter period. The former include physiographic properties of the lake and its catchment, and the latter connects with seasonal processes causing changes in the physical and biogeochemical properties of water. Ohle index, Schindler index, mean water depth, river network density, slope steepness, the share of endorheic areas, geological structure and land use in the catchment area are features regarded as controlling the trophy state (Bajkiewicz-Grabowska 2020). In-situ measurements of trophy indicators provide the best sources to build accurate models; however, such data are sparse, often incomplete, or access to it is restricted (Hollister et al. 2016). The information on the status of the trophy itself does not allow for a complex analysis of the responsible factors and thus, the reliable prediction of potential changes. With the rapid diffusion of geoscience and information technologies in the last decades (Chen et al. 2021), there is growing attention on numerical modelling in the many aspects of environmental changes. Thus, a possible solution for such a problem is predictive modelling that involves data obtained from public repositories like land cover/land use, lake basin geometry, and geology or directly calculated using GIS software.

Many papers describe the methods used to predict water trophies using watershed variables (Akbar et al. 2011, Benedini, Tsakiris 2013, Borics et al. 2013, Sun, Scanlon 2019, Gorgoglione et al. 2020). The efficiency of several multivariate analyses, including clustering, discriminant analysis, and principal component analysis, has been proven to reduce data complexity and detect intrinsic patterns in the underlying data. (Simeonov et al. 2010, Su et al. 2011, Li et al. 2017, 2018, Cui et al. 2019, Eliasz-Kowalska, Wojtal 2020). Such an approach has one considerable weakness: the detected patterns always show the internal differences of the explanatory variables, which does not always refer to the trophic state of the lakes. The linear multiple regression models (Jones et al. 2001, 2004, Beaulieu et al. 2014, Leach et al. 2018) provide insight into the relationship between explanatory variables and values of trophy indicators, but those methods are irrespective of the fact that most of the relationships are non-linear (Dormann et al. 2013, Huang et al. 2015).

New data acquisition techniques in geochemical surveys provide hundreds or thousands of observations described by tens or hundreds of features. When clarity of interpretation is more important than the model’s accuracy, simple models such as linear models or regression trees (Froeschke, Froeschke 2011) usually provide sufficient insight into relationships between factors and the outcome at the expense of the prediction quality. Complex models cannot be directly explained because they are not easy to understand. Some of the learners, including random forest (RF) (Breiman 2001), provide a measure allowing to estimate the relative importance of variables used, thus identifying such a subset that influences the variation of trophy factors.

The abundance of the data provides new opportunities, but however, it is challenging to investigate and interpret the role of many environmental features (Dafforn et al. 2015). It is even more difficult to understand relationships between elements of the system and their influence on the outcome because of complex relationships inside the multidimensional data. Regression models are a natural approach in searching relations between explanatory and dependent variables. Simple methods do not provide interpretable results when variables are related to each other or mutually convoluted. Complex models like assembles (Hollister et al. 2016, Li et al. 2016) or neural networks (Li et al. 2015, Rocha et al. 2017, Gebler et al. 2021) are then the only solutions; however, such models cannot directly be used for interpretation because they are not easy to understand. Moreover, many variables are unrelated to the studied phenomenon and are usually removed based on researchers’ experience or previous studies. Such removal, however, leads to the replication of the same, limited set of variables in subsequent studies (Goggin 1986, Harrell 2015).

There is no simple data analysis that would combine the advantages of supervised methods, like finding important variables and detecting relationships between explanatory variables and grouping – i.e. searching for new, possibly unknown patterns in the data. The first solution is optimisation: an iterative search of a combination of variables until the most optimal subset is found (Jasiewicz et al. 2021). An alternative solution is to create a non-linear regression model and then analyse it with interpretative machine learning (EML) (Molnar et al. 2020, Chen et al. 2021). Several tools have been recently proposed including Partial Dependency plots (Friedman 2001), local interpretable model-agnostic explanations (LIME) – Ribeiro et al. 2016), Learning Important Features Through propagating activation differences (DeepLIFT) – (Shrikumar et al. 2017), moDel Agnostic Language and eXplanation (DALEX) – (Biecek 2018) and SHapley Additive exPlanations (SHAP) (Lundberg, Lee 2017). These methods replace original values of explanatory variables with functional relationships between the explanatory and explained variables; in simple words the influence of a given variable on the result. The latter means a function that operates on an original variable value and replaces it with variable influence on the outcome. This paper introduces a new term: the variable influence, providing a new insight into the relationship between the lake’s surroundings and the value of trophic indicators. Moreover, the influence allows clustering the data, not by its original values denoted, but on the functional dependence between the explanatory and explained variables, creating a bridge between supervised and unsupervised learning.

The research presented in this paper aims to develop a solution allowing for clustering so that the resulting clusters minimise the dissimilarities inside the explanatory features and reduce the variation of the dependent variable inside the received clusters. The method was developed on a sample of 60 lakes from North-Eastern Poland. The lakes selected for this study are small and moderate in terms of their area and have relatively simple morphology but are sufficiently diverse to represent that geographical zone. In addition, the lakes were selected to obtain the strongest trophic gradient possible, expressed in terms of total phosphorus (P_tot). Although the method was designed for analysing the complex relationships between environmental variables that we believe impact the P_tot index, it can be easily applied to other complex systems. Thus, the data collection will be used as a case study to discuss the possibilities of the proposed method in practice.

Study area

The method was developed based on data collected from 60 lakes (Fig. 1) on the border of Suwalki and Masurian Lake District (SML). SML is an area of glacial and fluvioglacial origin, formed during the Pomeranian phase of the Weichselian glaciation between 24 k years and 19 k years BP (Marks 2012, Pochocka-Szwarc 2013). Dominant landforms include undulating morainic plateau with some hummocky and fluted till plains and washboard moraines (Weckwerth et al. 2019). Quaternary deposits are thick and contain typical components: tills, sands, silts, glaciofluvial gravels, and boulders. Lakes are the dominant and feature component of the SML landscape (Morawski 2005). All lakes are of glacial origin and are associated with the moraine plateau, inter-moraines and subglacial gutters. (Kondracki 2009, Pochocka-Szwarc 2013).

Variables

Lake water sampling and determination of P_tot

Epilimnion water samples were collected at the lake’s deepest point 1 m below the water surface with UWITEC sampler. Each lake was sampled once, and the samples were taken during three field campaigns in summers 2018 (east/central sector in Fig. 1), 2019 (central/west sector in Fig. 1), and 2020 (east-central-west sector in Fig. 1). The selection of 1 m depth as a representation of surface water followed the methodology of Tandyrak et al. (2020). The deepest point is routinely regarded as representative for the whole lake (Tylmann et al. 2012, Hernández-Almeida et al. 2017, Apolinarska et al. 2020). Chemical analysis of P_tot, (µgL⁻¹) was done within a few days after collecting. P_tot was analysed spectrophotometrically using Nanocolor VIS; (Macherey-Nagel) with ammonium molybdate according to PN-EN ISO 6878:2006P after mineralisation with HNO₃ and H₂O₂ in UV Mineral 6.1. The repeatability of P determination expressed as a relative standard deviation (RSD) from duplicate measurements was between 0.3% and 5.5%. Analytical accuracy was estimated using certified reference materials (CRM 398–399: Major elements in seawater; ION 96: Hard river water from Grand River) and was between 87% and 93%. The content of phosphorus in the samples ranges from 0 µgL⁻¹ to 70 µgL⁻¹, but the dominant values are below 20 µgL⁻¹.

Explanatory variables

The lake’s trophy is influenced by catchment factors responsible for the supply of matter, including nutrients and the morphometric parameters of the lake, which mainly determine the resistance of the lake to the influence of the catchment area. In our work, we considered all the indicators that could be obtained. We collected 25 explanatory variables, having a potential effect on the P_tot concentration in the water. According to Bajkiewicz-Grabowska (2020), the first group includes the Ohle index, the type of lake water balance (flow rate), the density of the river network and average slope, the percentage of non-inflow areas in the direct catchment, and lithology land use. The second group encompasses average depth, the ratio of the lake volume to the shoreline length, the percentage of the hypolimnion in the lake volume, the Schindler index, the active bottom area, and the water exchange rate in the lake.

The flow rate was omitted due to the lack of information on the amount of flow and the river network density because the area of the studied catchments is too small to develop such a network. According to the authors’ best knowledge, there are no measurement data on strictly hydrological processes, such as flow volume and water change time. Lange (1986) and Kalff (2001) suggest calculating those parameters using lake morphometry, but collected data already includes this information. We also omitted all factors that can be affected by the processes inside the lakes. Thus, the list of variables is limited only to stable variables over a long period; this eliminates factors that change seasonally (such as water temperature) or in short-term cycles (i.e., weather conditions).

The list of variables presented in Table 1 is divided into two groups. The first group includes morphometric parameters of lakes taken from The Atlas of Polish Lakes (Jańczak 1999) or ratios calculated directly from those features. The second group includes watershed geometry, morphometric parameters, lithology and land cover of the catchment area. In the first step, watersheds were delineated over the 30 m resolution Digital Elevation Model (DEM) DETD Level 2, (DEM in the rest of the paper) using GRASS GIS module r.stream.basins (Jasiewicz, Metz 2011). Finally, the geometry of the watershed was directly used to calculate the structure of their coverages and contribute to Schindler (1977) and Ohle (1956) ratios that define the relationship between lake and watershed geometry.

Table 1

Explanatory variables used in the study.

Variable	Abbreviation	Unit	Source
Elevation	ELEV	m a.s.l.	Jańczak 1999
Lake area	LARE	ha	Jańczak 1999
Lake capacity	LCAP	km³	Jańczak 1999
Lake max depth	LDMX	m	Jańczak 1999
Lake average depth	LDAV	m	Jańczak 1999
Lake max length	LLEN	m	Jańczak 1999
Lake max width	LWID	m	Jańczak 1999
Lake shoreline length	PRIM	m	Jańczak 1999
Lake elongation	LELN	Ratio	LLEN / LW ID
Lake capacity/length ratio	LVAR	Ratio	LCAP / LLEN
Lake perim development	LPDV	Ratio	LLEN / sqrt (2 × π × LARE)
Lake exposition	LEXP	Ratio	LARE / LDAV
Watershed area	WARE	ha	Calculated
Mean slope	WSLP	%	Calculated
Height stddev	WHSD	m	Calculated
Urbanised	WURB	%	Calculated
Agriculture	WAGR	%	Calculated
Forests	WFRS	%	Calculated
Wetlands	WWET	%	Calculated
Sands	WSND	%	Calculated
Tills	WTLS	%	Calculated
Clay	WCLS	%	Calculated
Organic	WORG	%	Calculated
Schindler ratio	SR	Ratio	WARE / LCAP
Ohle ratio	OR	Ratio	WARE / LARE

The upper part contains variables describing lakes morphometry, the lower part watersheds parameters. Source ‘Atlas’ denote variables read from Jańczak (1999), source ‘Calculated’, means variables values were calculated using GIS software and spatial data. See text for details. Column ‘Abbreviation’ contains symbols used later in the text. Variables starting with L refer to lakes morphometry, variables starting with W- to catchment properties.

Information about the land cover, geology, and basic morphometry of watersheds surfaces was obtained using Corine Land Cover (CLC) 2018 (EEA 2018) and Geological Map of Poland (GMP) 1:500,000 (Marks et al. 2006) was used to calculate coverage properties, including land-cover/land-use and surface geology. Because both maps contain units with complex characteristics, they were simplified. CLC was reduced to first-level CLC units (urbanised, agriculture, forest, and wetlands, excluding given lake), while GMP to basic lithological units (tills, sands, clays, and organic). The analysed group of lakes is located within one geomorphological division of the last glaciation thus, the stratigraphic distinction of the lithological unit was neglected. Coverage variables are expressed as a percentage of a given unit in the watershed area (WARE), separately for CLC and GMP. The DEM was used to calculate geomorphometric features such as watersheds height standard deviation (WHSD) and watersheds mean slope (WSLP) inclination of the terrain in the watersheds.

Methods

Our preliminary observations show that each of the single collected variables is weakly related to the trophy of the lakes (Fig. 2). Such a situation precludes the creation of simple relations, such as trophy index variable. Our method involves four steps presented in Figure 3 and described below in detail: (1) Building a regression model: f: X → y; (2) Building a simplified model g: X′ → ŷ over the model, called the explainer; (3) finding a set of mapping functions: I: X → X′, i.e. variable influences; (4) dissimilarity analysis and clustering objects by X′.

Relations between P_tot and selected explanatory variables, see Table 1 for details.

The outline of methodology. See Section ‘Methods’ for details.

Building a regression model

The first step is to build an explanatory regression model, f: X → y, where X is a set of explanatory variables (see also Table 1); f describes how the given variable X_i influences y; where y is an explained variable, here P_tot. We used a RF regression model (Breiman 2001), commonly used in many ecological and natural studies (Hollister et al. 2016, Bourel, Segura 2018, Leach et al. 2018, Li et al. 2018). The RF is a machine learning algorithm that grows a subset of the so-called weak predictors as shallow regression trees by bootstrapping samples of the training set. Those trees are non-parametric models; thus, the entire RF does not require a prior assumption about the variable distribution. It means in practice that RF accepts X in original form without the preceding transformation. Each tree grows recursively until it meets its stop criterion. At each step of growth, y is clustered in two child nodes to maximise y homogeneity inside these clusters. Then the best split on one of the X variables is selected. The RF is a bagging algorithm, which means that each tree grows on the independent subset of cases and variables. The importance of each variable depends only on its potential for reducing mean squared error between actual values of y and the outcome ŷ. For that reason, namely the random selection of variables, RF is more suitable for explanation than other machine learning algorithms.

Building the explainer

The second step of analysis includes the building of the explainer. Explainer g: X′ → ŷ is a transformation of a previously trained complex model (here RF) into its interpretable approximation (Lundberg, Lee 2017), where X′ is a transformed X, and ŷ predicted values of y. All mentioned explainers assign an influence to each explanatory variable for a particular prediction – i.e., single case. The explanation process starts from the prediction when all values are set to their means. Next, for successive variables, their original values are restored. If variables are not independent, what happens almost always is that variables are restored is the order that matters. If two variables in a model are correlated, the first analysed variable will explain its more significant part of the model’s variability, while the role of the second variable will remain depreciated. If variables are restored in the opposite order, the influence of those variables will also be different. For that reason, we decided on the core part of SHAP – a game theory concept of Shapely values (Shapely 1953), because this solution is not sensitive to the order of the variables selection. A unique advantage of the SHAP explainer is that it averages the influence across all possible orderings for a given prediction. In this way, the influence of the variable represents the mean change in the model prediction when conditioning on the given variable. An additional standard deviation of the change informs how a given variable is robust against variable ordering.

Finding variable influence

As a result, a vector of new values X′ now describes the data and represents the influence of the given feature on the final prediction. The influence is a mapping function, I: X → X′, where I denotes non-parametric mapping function, called the influence. In that way, X′ is a set of mean Shapely numbers that thus describe how variables influence the outcome. X′ can be positive, negative or indifferent (Fig. 4). The influence of each variable can be presented in the form of an influence plot, where the x-axis contains original values and y the range of the influences. Each case is represented by a single dot at x and y. The I: X → X′ is in close correspondence with partial dependence plots (PDP) – (Friedman 2001), such that for each specific variable, influence values arrange along the PDP line. If a dependent variable y is standardised, i.e., mean is at 0 and values are represented in units of y standard deviation, both the PDP lines and the influence values acquire particular property: all X′ values are scaled in the same range, relative to the variable importance. The I: X → X′ transformation scales the new values to the same scope as y. Thus, 0 means that the given variable does not influence ŷ for the given case. Values other than zero, positive or negative, determine the increasing or decreasing ŷ, in units of P_tot standard deviation.

The concept of mapping variable values into influence. The size of dots simulates the values of the dependent variable.

The influence plots (Fig. 4) show the relations between X and X′ and detects sections where the influence is positive, negative, or indifferent (no influence). Such an approach extends the notion of variable importance (Jones et al. 2004, Håkanson 2005, Genuer et al. 2010, Leach et al. 2018) and provides new insights into the relations between the studied complex system and the factors that shape it. If the plot identifies the result of clustering, it also allows for identifying sets of cases for which the variable is significant (in the form of positive or negative influence) and cases for which the variable is not. The range of variability of individual variables determines the scale of the impact. The greater the difference in values, the more significant a given variable’s role in the explanatory model.

Dissimilarity analysis and clustering objects

The influence plots provide information on two levels. The first is the variable level, and it describes how changes of the variable impact the outcome in the given range of variable values. Moreover, the range of the y-axis (influence) is proportional to variable importance. The second is the case (individual object) level– the X′_i is calculated for each case (i.e., lake) separately and describes how each factor with a given value contributes to the value of the ŷ.

Clustering is a process of searching for similarities between natural objects and separating them into smaller yet consistent groups. Sometimes, preliminary clustering is used to improve the regression models (Kocev et al. 2020). Such clusters, however, only reflect the variability inside the set of explanatory variables; thus, the relationship between independent and dependent variables cannot be inputted into the unsupervised model. On the other hand, regression models by themselves cannot provide unknown information from the training data and cannot cluster the data by discovering their features independently. In that way, the zero-mean and relative to y values of X′ are essential for further clustering: the most influencing variables with the highest range of X′ contribute most to the dissimilarity between objects, and the impact of the minor influencing variables is weak or negligible. It means that preceding arbitrary variable selection is unwanted, and the clusters will include the distribution of the dependent variable.

Results

Model quality

Because RF does not require any assumptions about data distribution, that part of the data remained in its original form. The explained P_tot variable was transformed into normal-like distribution with Yeo and Johnson’s (2000) power transformation. Power transformation stabilises variance and transforms data into Z-score form. Such a transformation is necessary to correctly estimate the influence of individual variables in a uniform unit, i.e., in proportion to the explained variable’s standard deviation. Forasmuch the primary goal of the model is an analysis of existing data set, not a prediction on a new data; the model was trained and tested on the entire data set with specific hyper-parameters: the depth of the trees was reduced to 3, and the number of trees to 50 and the number of cases and variables selected to train each tree was reduced to 0.3. In the result, the root mean square error (RMSE) of the model was higher than for the best set of tuned parameters, but such structure of the learner guarantees that the role of less significant variables will not be omitted. RF is a stochastic algorithm, so we have trained 3000 candidates and selected the best fitted, with the lowest achieved RMSE. The stochastic nature of the RF model causes the results of each execution to differ slightly; nevertheless, the list of the most influential variables remains the same.

The X′ → ŷ is the basis for the reasoning, namely the influence describes ŷ not y, so the quality of conclusions is a derivative of the quality of the prediction. This is the main limitation of this method. The relation y~ŷ depends on the information about y carried by X. If X does not contain key variables for modelling y, the model has low performance and the error of y~ŷ is the main part of the uncertainty of conclusions. Moreover, such a model only reveals a statistical relationship between the variables and the outcome, which does not yet imply a physical dependency.

Therefore, the first step is to assess the quality of the model. Figure 5 shows the relationship between the actual values of P_tot to the values modelled by the RF model. The quality of the model is moderate. The correlation between actual and outcome is very high (R² = 0.92), the RMSE value is 0.37 of transformed variable standard deviation. It means that collected variables do not describe approximately one-third of the standard deviation. The remaining part of the P_tot variability is most likely the result of processes inside the ecosystems of individual lakes.

Relation between actual values of dependent variables and the outcome of the model P_tot.

Figure 6 presents the result of clustering in the form of a cluster map, where x − axis contains all explanatory variables ordered by their importance, e.g. contribution to explainer and y − axis is ordered by clustering results. The dendrogram was cut at 0.85 providing five classes of P_tot concentration: one with high values, three with moderate values, and one, the largest, containing lakes with the lowest values (Fig. 6). The cluster containing lakes with the highest values of P_tot was labelled as High; three Moderate clusters as Mod.-I, Mod.-II, Mod.-III; and the last as Low. Figure 7 also shows that the quartile range partially refers to the classification of OECD, respectively, such as high to eutrophic-mesotrophic, moderate to mesotrophic, and low to oligotrophic and ultra-oligotrophic, but some lakes do not fit fully to OECD scheme.

Relation between clusters and variable influence. The red-white-blue gradient denotes the influence of the given variable. The colours marking the clusters are used in the same way in the other figures.

Variation of P_tot in clusters. X-labels denote the name of the class: High trophy, Moderate trophy (Mod.-I, Mod.-II, Mod.-III), and Low trophy.

The role of individual variables

Figure 8 allows one to analyse I: X → X′ for each variable and each lake. A partial dependency plot is a tool that allows for tracking the generalised impact of variable values on the outcome, but Shapely numbers show the individual impact of each variable for each case (i.e., lake). I: X → X′ corresponds to PDP, but these relationships get weaker as the variables’ importance decreases, so with the least important variables, the relationship is barely noticeable. The correspondence between variable values and their influence is not always linear, and its dynamics presents new knowledge unavailable with other machine learning (ML) methods but often concordant with intuition and common sense.

The influence plots for each variable. The X-axis contains the original values of the variable, Y-axis contains the influence. Colours in legend denote clusters (see Fig. 7), size of dots value of P_tot. Partial Dependency Plot (PDP) is marked by a light grey line.

The I: X → X′ plots show that these relationships can be very different. The X′ of several variables: Ohle rat. (OR), Schindler rat. (SR), Urbanised (WURB) and Basin Area (WARE) or Lake Depth Max (LDMX) shows a bimodal distribution, with a clear threshold value. Other variables, like Wetlands (WWET), Tills (WTLS), or Agriculture (WAGR) show linear dependency or have an inflection point. The X′ of the last ten variables is minimal, within the range of 0.02–0.05 of P_tot standard deviation, and thus their internal structure has a weak interpretative value. Especially for the last four variables, there is no functional relationship between the value of its feature and the effect on the ŷ. While the linear I: X → X′ are easy to follow, non-linear relationships require more attention. Although a detailed analysis of this phenomenon is beyond the scope of this paper, it should be noted that the occurrence of threshold values or at least changes in the trend requires the searching for natural processes with similar characteristics. Surface runoff can be such a process, primarily when runoff is caused by short-duration intense storms (Kandel et al. 2004). Also, Guan et al. (2016) noticed the bimodal nature of minor and major rainfall events.

The structure of clusters

We decided to use hierarchical clustering (AHP) with euclidean dissimilarity and Ward linkage since such parameters gave the most interpretative clusters and allowed for easy exploration of possible sub-clusters. The first three steps, i.e.: the transformation of the explanatory model f into explainer g and then the explanatory variables X into the influence of X′ is automatic and does not require parameters. Since X′ is in the z-score-like form, the scope of each variable is proportional to its role. Thus, X′ can be used directly to calculate the dissimilarity between lakes without prior scaling, and dissimilarity matrix d subjects to the clustering process. The selection of a number of classes is an entirely arbitrary decision made after analysing the dendrogram structure.

The clustering process reveals the main novelty of the proposed method. A typical clustering process minimises differences within clusters and maximises differences between them. For this reason, each grouping process requires the prior selection of a limited number of variables, which are preferably not correlated with each other. In our method, the clusters do not mean lakes with similar characteristics, but rather a group of objects where the content of P_tot in the water is shaped in a similar way. The latter results follow directly from the fact that the clustering process uses functional dependencies between the lakes features and the content of P_tot in the water. As a result, the coherence of the value of the dependent variable is a natural feature of the generated clusters. The most interesting fact is the way each of these classes is explained. The structure clusters show that values in moderate clusters have a similar trophy level but this value results from different processes. It is also noteworthy that arranging the variables according to their importance reveals that the first nine variables have a real influence on P_tot, while the last ten has no real impact on this trophy index.

The role of variables in explaining the P_tot value can be traced both for classes and individually for each lake. To illustrate the five clusters, the most representative lakes were selected, where the selection criterion was the P_tot and the smallest difference between y and ŷ for a given cluster. The SHAP plot (Fig. 9) clearly shows the impact of each feature on modelled P_tot.

The SHAP (SHapley Additive exPlanations) plots (Lundberg, Lee 2017) for the five most representative lakes for each class. X-axis presents influence in units of P_tot standard deviation scale is the same for each subplot, but ranges are different. Length and direction of arrows denote the scale and influence orientation (negative or positive) brought by a given variable.

High trophy class

The high value of the P_tot index is the effect of the synergy of the five most significant variables: OR, WWET, SR, WURB, and WARE. There is a trace, positive role of LDMX, elevation (ELEV) and Hights Stddev (WHSD), lake exposition (LEXP) and WTLS’s negative role. The remaining variables do not affect the trophy index.

Moderate trophy I class

The value of the P_tot index in this class results from the synergy of the first three variables: OR, WWET and SR, and the secondary role of LEXP, WTLS, and LDMX. Mainly the WURB, ELEV, and WHSD values have a negative effect. The influence of the other variables is mutually exclusive.

Moderate trophy II class

This class also includes major mesotrophic lakes, but the P_tot value results from the positive impact of the U variable and the synergy of less significant variables: LEXP, WTLT, LDMX, ELEV, and Forests (WFRS). The values of OR and SR have a negative effect, but they cannot balance the positive impact of the other variables. In the analysed case of Lake Sunowo, the positive influence of the variables Lake Area (LARE) and Lake Depth Average (LDAV) is also visible, but this is not a feature of the whole class, but only this case.

Moderate trophy III class

In the case of this class, the P_tot value is the effect of the positive influence of variables with a lower influence: LEXP, WTLS, LDMX, Sands (WSND), and ELEV also have a significant influence on the P_tot value in the case of Lake Kiersuń, which is also not a feature of the whole class. The four most influential variables in this class have a negative impact, but the impact is not significant.

Low trophy class

The last class, including lakes with the lowest trophic index values, results from a strong negative impact of the six most significant variables: OR, WWET, SR, WURB, LARE, and LEXP. The positive impact of WTLS, LDM, and WSND is noted only in selected cases, and in the analysed case of Dmitrowo lakes, the effect of LDMX is negative.

Regardless of the dissimilarity of patterns related to specific values of the P_tot index, all cases in each trophy group share features. When the concentration of P_tot is high, it results from a positive synergy of the most influential variables; the remaining variables have only a trace effect. Lakes with a moderate concentration of P_tot are positively influenced by several variables – different in different groups, and few variables with a negative impact slightly reduce this positive influence. The low concentration of P_tot is an effect of the negative synergy of the most important variables, slightly balanced by the influence of less essential features.

Discussion

Variable importance

The proposed method allows for a comprehensive assessment of the influence of selected variables on the share of P_tot in water. However, it should be examined how this approach differs from other methods used in lake studies, namely multiple linear regression (LR) (Su et al. 2011, Staehr et al. 2012) and RF (Genuer et al. 2010). Those methods are commonly used to select significant variables (Li et al. 2016, 2017, Bourel, Segura 2018). From existing methods of LRs, we used ElasticNet, a LR model extended by regularisation, a technique that adds a penalty to the model parameters when the model complexity increases. In simple words, when the coefficients are either very high or very low, ElasticNet eliminates those features from the model and shows no relationship between a given explanatory variable and the response variable. The final model is described then, only by those variables that explain the main trends of the model, while variables introducing the noise are eliminated. The RF model is the core of the presented method, so the order of variable importance is identical to our method. The value of importance indicates so-called impurity (or Gini) importance, a normalised total reduction of the error brought by that variable. Thus, if the selected variables significantly reduce the error, but only for selected cases, the role of such a variable may be overestimated.

The RF importance describes only the importance of variables without the information about the direction: whether the variable generally increases or decreases the modelled value. Compared to RF, the LR model provides more information because it also includes the signs of coefficients. After removing least-informative variables, the sign shows whether the variable increases or decreases the outcome and the coefficient value is somehow related to the intensity of this factor.

Figure 10 shows the evaluation results of the importance of the variables for both models. Values of linear coefficient (slopes) of the LR model and Gini importance of RF cannot be compared by numbers, but both methods indicate more or less the same subset of variables. The value of the coefficient sign follows the orientation of the influence plots in Figure 8. Nevertheless, the variables indicated by LR show rather the strength of the general relationship between a given explanatory variable and the explained variable. Thus, linear models may overestimate the role of variables for which there is a linear relationship (even very weak) with the explained variable at the expense of the variables whose influence is significant but not linear.

The variable importance estimated using multiple linear regression (ElasticNet) and random forest. See text for details.

Analysis of dissimilarity inside original and transformed data

The goal of clustering is to identify stable groups in a dataset. The purpose of the comparison is to check whether grouping only the original explanatory variables will allow to identify groups of lakes with a similar trophy. As the clustering technique is not crucial for the entire method, multidimensional scaling (MDS) was used for comparative analysis. The MDS is an ordination technique, a form of non-linear dimensionality reduction that maps distances between objects in original multidimensional spaces into lower-dimensional space positions, preserving original dissimilarities as much as possible (Cox, Cox 2000). The d was calculated for z-scored X variables and in our method d’ using X′ directly. Due to a large number of variables (high dimensionality), and to avoid the curse of dimensionality, we used the L1 (‘Cityblock’) metric, which is preferable for high dimensional applications to the default L2 (‘Euclidean’) metric (Aggarwal et al. 2001). Figure 11 shows city block dissimilarity between lakes and the value of the P_tot index.

Visualisation of dissimilarities between lakes in a form of MDS (multidimensional scaling). The axes of the plot have no units. Lakes morphometry presents dissimilarity for a group of morphometric features (see Table 1); Land cover presents dissimilarity for four land cover variables (urbanised, agriculture, forests, wetlands); ‘All variables’ plate uses all 25 variables; Influence presents dissimilarity between lakes in a space of variables influence. For colours see Figure 7.

Three cases were analysed: an ordination using all variables and two ordinances for a subset of variables describing land cover and lakes’ morphometry. The ordination with all variables shows an amorphous structure and does not reveal P_tot variability gradient. The ordinance performed for both subsets also does not show separate clusters, but there were gradients of trophic changes along the horizontal dimension: negative for morphometry and positive for land cover. The appearance of these gradients is related to the role of LDMX and WARE variables in the subgroup of morphometric variables, and WURB and WWET in the land cover group. The ordinance based on d’ shows distinct clusters with similar values of P_tot indices. Only Low trophy class and Mod.-III class are not separated. This lack of separation is a consequence of the structure of both clusters. In the Mod-III class, the five most important variables have a negative impact on the P_tot index as in the Low-trophy class. However, in the latter class, this impact is definitely more substantial.

Conclusions

The paper proposes a new, complex method of data analysis, classified as EML. This method allows for a detailed visual analysis of complex nonlinear regressors and introduces a new concept: variable influence. The latter is a transformation of the set of explanatory variables into a form describing the influence of a given variable on the modelled feature. Such a transformation makes it possible to group data based on the functional relations between the explanatory variables and the explained variable instead of the variation in the explanatory variables only. This is also the limitation of the method because its quality depends on the performance of the model. The method was developed to explain the P_tot index for the group of glacial lakes in north-eastern Poland. As part of the analysis, complex nonlinear factors shaping the P_tot of individual lakes were detected. On this basis, lakes were grouped into five clusters showing similar values of trophy. In each of the classes, the trophy is the effect of synergy between different groups of factors. The method of LR and RF was compared, and it was shown that it combines the advantages of both – the proposed method allows to precisely determine the real impact of each variable and the relationship between the explanatory and explained variable.

Nonlinear relationships between the variables and the value of the impact are related to nonlinear natural processes – for example, the bimodal distribution of rainfall intensity. However, this problem requires a separate, dedicated research. The cluster analysis showed that the studied lakes could be divided into several clusters, where the P_tot value is shaped similarly. This means that there is no single pattern, on how the watershed influences the content of P_tot but rather a few repeating patterns representing the studied phenomenon of the trophy value. There are five classes, one for lakes with high and low trophies and three classes with medium trophies, where the P_tot index is shaped differently in each. Such a conclusion can have potential value for protecting and managing limnic environments. Although the method has been developed for the problems of lake ecology, its application seems to be more comprehensive and can be applied wherever complex; multivariate numerical models can be used.

eISSN:: 2081-6383
Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 4 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Geowissenschaften, Geografie

Zeitschrift RSS Feed

Interpretative Machine Learning as a Key in Recognizing the Variability of Lakes Trophy Patterns

Online veröffentlicht: 01. Apr. 2022

Seitenbereich: 127 - 146

Eingereicht: 14. Feb. 2022

DOI: https://doi.org/10.2478/quageo-2022-0009

Schlüsselwörter
total phosphorus, interpretative machine learning, random forest, Masurian lakes

© 2022 Jarosław Jasiewicz et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 11

Interpretative Machine Learning as a Key in Recognizing the Variability of Lakes Trophy Patterns

Online veröffentlicht: 01. Apr. 2022

Seitenbereich: 127 - 146

Eingereicht: 14. Feb. 2022

DOI: https://doi.org/10.2478/quageo-2022-0009

Schlüsselwörtertotal phosphorus, interpretative machine learning, random forest, Masurian lakes

© 2022 Jarosław Jasiewicz et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 11

Schlüsselwörter
total phosphorus, interpretative machine learning, random forest, Masurian lakes