Uneingeschränkter Zugang

Identification of nonlinear determinants of stock indices derived by Random Forest algorithm


Zitieren

Introduction

Stock investors value the knowledge of the factors, which drive the price of shares and the acquired knowledge may influence their decision-making process “how to construct their portfolio.” The analyses of these factors were also a subject of academic research and the results were characterized by a high degree of diversity, which included different methodologies, different models, or different markets. The main important feature of these academic studies is the approach of choosing the factors to be considered in further analysis. In general, the inputs are derived through logical reasoning and then tested for significance. This study aims to show a nonlinear alternative for deriving the most important drivers of future price movements of the chosen stock indices from a large dataset.

Usually, the determinants of price movements are established through regression significance [ex. Al-Shubiri, 2010; Al-Tamimi et al., 2011; Garefalakis et al., 2013]. In this work, the author uses a different way of tackling the problem by using a different machine learning algorithm called Random Forest to analyze relationships between the macro and microeconomic and future price movements. In this research, three stock indices were chosen: Polish WIG20—constituted by the largest twenty companies listed on the Warsaw Stock Exchange, DAX—a total return index and contains thirty largest German stocks listed on the Frankfurt Stock Exchange, and Stoxx Europe 600—includes the largest 600 companies from 17 European countries. The whole dataset include the day-to-day observations during the period January 2004 to January 2020.

The article is divided into 4 sections. Section 2 depicts the overview of how the determinants of the prices could be derived and summarizes the literature review. The Section 3 presents some characteristics of Random Forest algorithm and how it can be used for determinants’ recognition. Section 4 describes the steps performed for data preparation to run through the algorithm and the methodology of training the algorithm on the dataset. Section 5 summarizes the results of the research and presents the main conclusions.

Determinants of price movements

A large number of theoretical and empirical studies have been done on the factors that affect stock prices and their returns/profits. This section presents the general idea behind the selection of methodologies. However, a variety of studies involves different sets of input data, a time horizon of stock prices, and the source of the data, and so only general comparison of these studies is performed. But some broad conclusions have been drawn based upon this analysis.

Most of the research papers investigating the relationships between the stock market and micro and macroeconomic factors concentrate only on finding the logical reasoning and then testing such relationship statistically. The examples of earliest research are Heins and Allison [1966] who referenced to prior Clendenin's [1951] research on factors affecting stock price and extending it by adding the following theoretically supported variables to the regression model: average price of the stock, earnings, price-to-earnings ratio, turnover of the stock, and a binary variable describing the stock exchange, where the shares were quoted. Garefalakis et al. [2013], at first, discusses about the theoretical aspect of the selected input variables to determine the factors affecting the Hang Seng Index and then regress them to test their significance. He had chosen S&P 500, Crude Oil, Gold, and Dollar to Yen exchange rate.

Some of the papers focused only on specific markets and sectors. Jeon [2020] regress oil price, exchange rate KRW/USD, tourist expenditure, consumer price index, and industrial production to the prices of tourism stock index from the Korean stock exchange. Demir [2019] built a model for Istanbul Stock Market Index with inputs such as commercial loan interest rate, net portfolio investment inflows, real effective interest rate, oil price, net foreign direct investment inflows, and real gross domestic product. Chen [2009] investigated whether macroeconomic variables were useful in predicting recession on the US stock market by forecasting the bear market on S&P 500 price index. He used interest rate spreads, inflation rates, aggregate output, unemployment rates, debt, and nominal effective exchange rates.

The choice of the variables is usually very arbitrary, supported by theoretical knowledge, and usually, these inputs could be classified as internal (of fundamental) factors or as external (macroeconomic) ones. The studies of fundamental inputs on specific examples are conducted by Hartono [2004], which examines the effect of positive and negative earnings or dividend information provided by the specific company based upon its stock price. The main conclusion is that the above information could affect the share price only in the short term, but have no effect in long term. Another example is the study conducted by Lee [2006] on DJIA index where some internal characteristics like dividend, book value, and earnings were chosen and the effect on price movements was compared against the influence of external variables.

The influence of external factors is presented, for example, by Durham [2002] who examined the relationships of GDP per capita growth, levels of country credit risk, and level of legal development to the stock market. Corwin [2003] recognizes the asymmetric information and uncertainty as factors that affect the share prices. He also provides a literature review, in which he concludes that other research factors depend on specific company, industry, country or economy, but the relationships with the selected variable are always positive or negative whatever may be the industry, country and so on. For example, a declining dividend would always impact the share price negatively.

The most important conclusion from the above literature analysis is that the researchers tend to determine the drivers first and then test the relationship between the selected inputs and a specific stock index. Such an approach is commonly used among the researchers because it is usually easy to defend the choice of the factors with existing studies. In this paper, an alternative approach is presented. First, a large dataset of inputs is collected and the algorithm is set to select the most important ones. Breiman [2001] has presented the difference between these two approaches. The first one was called the Data Modeling approach and the second the Algorithmic Modeling approach. The first one is focused on the objective results supported by carefully chosen variables and the model. The second one cares more about the results and minimalizes the role of inputs and the choice of the model, which can be even treated as a “black box.”

Random Forest algorithm

Random Forest algorithm belongs to the family of ensemble machine learning method, mainly used for classification and regression problems. It is based on the multiple constructed decision trees through a learning process. The final result from the algorithm can be a class, which is the mode class from all trees’ output in classification or a predicted average of individual trees in regression problems. The concept of the Random Forest was developed by Ho [1998] and was extended by Breiman [2001] into a specific form for a specific application, which is used in this paper.

The term “random” refers to the construction process in which every tree is trained and optimized on the randomly chosen subset of input variables. This implies that every created tree may be unique, which leads to a better model's efficiency as it reduces the model's variance with a very small increase in bias. This randomness makes the algorithm designed for large datasets with unobvious and nonlinear relationships. This is why the Random Forest can be used as the “black box” and can serve as a tool for choosing the determinants of the price movements.

The fundamental concept of the Random Forest, as its name indicates, consists of a large number of individual decision trees that operate as a group with a common purpose. This simple and advantageous collection of decision trees uses the wisdom of the crowd to solve classification and regression problems. Based on the above, it can be concluded that the little or low similarity between predictors is the key. In this model, either categorical or continuous inputs may be applied., A common issue of spotting the outliers when working with empirical data is easily handled by the algorithm. The element of randomness causes that there are always some trees that would consider some values as not required, which highlights the individual character of each tree. The same feature of the Random Forest tackles the problem of incomplete datasets and missing data points, thereby bringing advantage in diminishing their impact on the model. Furthermore, the algorithm is resistant to minor changes in data input. The Random Forest could be classified as the “black box” algorithm because determining the impact of a single variable might be very problematic. On the one hand, as an advantage, we can assume that the relationships between the inputs and outputs of variables are not relevant. But on the other hand, mentioned relationships could not be reasonably justified.

From the perspective of this research, the most important characteristic of Random Forest is the nonlinearity in decision trees and the possibility to measure the importance of the input variables. Usually, the determinants of stock price movements are selected by logical reasoning and verified by significance in the linear regression model. This limits the analysis to linear relationships, which is quite a naive approach, and using more complex algorithms allows us to circumvent this issue.

Measuring the importance of the input variables is performed mainly in the process of choosing the most significant inputs or in limiting its number. The most commonly known permutation method is used for assessing the importance of the variables for Random Forest. Gregorutti et al. [2017] describes this method with 5 steps. The first step is fitting the model and calculating the out-of-bag error for each tree, which is characteristic for algorithms using bagging. Then the permutation of one variable is performed and the whole changed dataset is passed through the model so that the variables are fitted and the errors are calculated again. The importance of this variable is assessed by the average of the differences of out-of-bag errors calculated without and with the permutated variable. This set of steps is replicated over all variables used. This process was used to evaluate the importance of the market variables for future price movements of stock indices and allowed to derive the most important ones. The whole process is described in the remaining part of the paper.

Data preparation

The daily data of closing prices from January 1, 2000 to January 31, 2020 for WIG20, DAX, and Stoxx600 were collected from Thomson Reuters Eikon. The whole data set consisted of 5, 241 observations for each index. For the input variables, the variety of global and regional stock indices were collected that included commodities prices like oil or gold, and some macroeconomic data such as interest rates, inflation, unemployment rates, sovereign nominal bond yields for different maturities, and some exchange rates. The whole input data set consists of 209 variables with the same number of observations, which were used against the indices. All the variables used are listed in Tables 1 and 2 in Appendix.

Because of the different scales of the variables, the normalization process is required to maintain a similar variance for each of them [Specht, 1991]. Besides, such a process improves the timing of the learning process and may improve the results as well [Jayalakshmi and Santhakumaran, 2011]. There are a lot of methods to perform the normalization process, but the most popular ones are the Min-Max method or z-score method. The z-score method was chosen for this research.

The inputs were transformed into changes—percentage or absolute—depending on the characteristic of the variable. After this step, the 5-year rolling z-score was introduced for all of the series and the input data set is prepared in such a way that it can be used in the training procedure subsequently.

Training the algorithm

In this paper, the supervised learning method is used to fit the Random Forest algorithm. The future one-month price change was chosen to be predicted by the model, as it is effectively a short-time horizon, but not short enough to be treated as random. From the investors’ perspective, not a specific price value is needed to make an investment decision but rather the expectation whether the price will go up, down, or remain stable. That is why the following classification has been introduced: “1” stands for the one-month future rise in price by more than 2%, “0” stands for a price change between 2% and −2%, whereas “−1” is for changes lower than −2%. Such a supervised time series was created and it serves as a “teacher” for the algorithm.

A popular method for dividing the dataset into in-sample and out-of-sample data is splitting the whole dataset into a training set and test set. Usually 60% to 40% or 67% to 33% for training and test sets, respectively. However, Hart [1994] suggested a time series cross-validation as a substitute for a standard training-test division to benefit from the time characteristic of the data. In time series cross-validation a rolling window is used, which rolls over the entire data set. There are two approaches of how to use such a window: one is to fix its starting point at the beginning of the data and expanding it to cover the data one by one. The second way is more suitable for economic time series as they are characterized by changing variation and thus the relationships may change over time. It is based on a fixed-sized window that shifts over the time series. In this paper, such an approach is used. The span of the rolling window is fixed to 5 years. It means that the algorithm is trained with the first 5-year data and is forecasting only the first day ahead of the window. It implied that only the first 5 years can be treated as an in-sample period.

Usually, before the training process starts, the algorithm's hyperparameters need to be selected. Hyperparameters are external specifications of the model that cannot be estimated from data. In the case of Random Forest, the examples of hyperparameters are the number of trees used in the algorithm, their maximal depth, or the number of variables used in each split of the node. There are several ways of choosing the right set of such parameters like manual search, grid search, or random search [Bergstra and Bengio, 2012], although there is no optimal solution of how to the find the best set. As the searching of the best hyperparameters is not the subject of this paper the author decided to use the default values delivered by the Scikit-learn package in Python.

The next step in the training process is the reduction of the number of variables used in the model. Genuer et al. [2010] describes the approaches of variable selection and suggests the utilization of Recursive Feature Elimination. The key element of this process is the importance of the variables presented in the previous sections. The model is fitted and the importance of all the inputs is calculated. Then, the least important feature is eliminated and the process is repeated until the specific quantity of variables is left. The Recursive Feature Elimination is used for the reduction of the input's number to decrease the chances for potential overfitting and may increase the effectiveness of the model. In this research, the number of used variables in final models is limited to 5 as derived by the Recursive Feature Elimination process every month. It means that the most important 5 variables are derived each month and such set is used to fit the model for the rest of the days in this month.

The output of the fitted models is the importance of the variables chosen in the Recursive Feature Elimination process and the calculated every time as the rolling window moves forward. The importance of the variables was saved in the form of a data frame and the results have been analyzed.

The whole process was performed using Python language in Spyder environment. The mainly used packages were “pandas,” “numpy,” and “sklearn.”

Results and conclusions

The results are presented in percentage form of occurrence in the final chosen 5 variables in the out-of-sample period. As the rolling window accounts for 5 years of observations and the variable selection process happens every month, it implies that there were 176 possible chances to be qualified for the best 5 variables. Figures 1–3 present 15 input variables most often chosen respectively for WIG20, DAX, and Stoxx Europe 600.

Figure 1

Occurrence in top 5 selected variables for WIG20 price index.

Figure 2

Occurrence in top 5 selected variables for DAX price index.

Figure 3

Occurrence in top 5 selected variables for Stoxx 600 price index.

Results show that the most frequent determinant for all of the analyzed stock indices is China Prime Lending Rate that is transformed into changes and normalized by z-score. This variable refers to the weighted average interest rate quoted by three major Chinese banks on loans and directly influences the economy in China through the cost of capital. The second important conclusion from the above is the occurrence of The Conference Board Leading Economic Index, which is an American leading indicator that is expected to forecast future economic activity. Other notable variables are various bond yields and inflation rates. This kind of variables tends to describe the overall performance of the economy of a country or even the whole region, but all major economies like the US or China have a significant impact on smaller markets like Polish and German ones. These dependencies can be seen in the results however using the “black box” model in a form of Random Forest does not allow to quantify them and confirm their significance statistically.

From the practical perspective for investors, this research provides an insight into the factors that may drive European stock indices. The results indicate that yields and the American and Chinese economic indicators have a more nonlinear impact on future price changes of WIG20, DAX, and Stoxx Europe 600 than other equity indices, interest rates, or commodities prices.

The main purpose of this paper was to show an alternative way of how to select determinants of three stock indices using the Random Forest algorithm. The research shows that this approach might be useful as a tool for finding nonlinearly related factors. Further research on this topic may include the examination of the effectiveness of the forecasts and checking whether the above relationships can be confirmed statistically.

eISSN:
2299-9701
Sprache:
Englisch