In this paper, the use of the machine learning algorithm is examined in derivation of the determinants of price movements of stock indices. The Random Forest algorithm was selected as an ideal representative of the nonlinear algorithms based on decision trees. Various brokering and investment firms and individual investors need comprehensive and insight information such as the drivers of stock price movements and relationships existing between the various factors of the stock market so that they can invest efficiently through better understanding. Our work focuses on determining the factors that drive the future price movements of Stoxx Europe 600, DAX, and WIG20 by using the importance of input variables in the Random Forest classifier. The main determinants were derived from a large dataset containing macroeconomic and market data, which were collected everyday through various ways.
- Random Forest
- stock index
- machine learning
Stock investors value the knowledge of the factors, which drive the price of shares and the acquired knowledge may influence their decision-making process “how to construct their portfolio.” The analyses of these factors were also a subject of academic research and the results were characterized by a high degree of diversity, which included different methodologies, different models, or different markets. The main important feature of these academic studies is the approach of choosing the factors to be considered in further analysis. In general, the inputs are derived through logical reasoning and then tested for significance. This study aims to show a nonlinear alternative for deriving the most important drivers of future price movements of the chosen stock indices from a large dataset.
Usually, the determinants of price movements are established through regression significance [ex. Al-Shubiri, 2010; Al-Tamimi et al., 2011; Garefalakis et al., 2013]. In this work, the author uses a different way of tackling the problem by using a different machine learning algorithm called Random Forest to analyze relationships between the macro and microeconomic and future price movements. In this research, three stock indices were chosen: Polish WIG20—constituted by the largest twenty companies listed on the Warsaw Stock Exchange, DAX—a total return index and contains thirty largest German stocks listed on the Frankfurt Stock Exchange, and Stoxx Europe 600—includes the largest 600 companies from 17 European countries. The whole dataset include the day-to-day observations during the period January 2004 to January 2020.
The article is divided into 4 sections. Section 2 depicts the overview of how the determinants of the prices could be derived and summarizes the literature review. The Section 3 presents some characteristics of Random Forest algorithm and how it can be used for determinants’ recognition. Section 4 describes the steps performed for data preparation to run through the algorithm and the methodology of training the algorithm on the dataset. Section 5 summarizes the results of the research and presents the main conclusions.
A large number of theoretical and empirical studies have been done on the factors that affect stock prices and their returns/profits. This section presents the general idea behind the selection of methodologies. However, a variety of studies involves different sets of input data, a time horizon of stock prices, and the source of the data, and so only general comparison of these studies is performed. But some broad conclusions have been drawn based upon this analysis.
Most of the research papers investigating the relationships between the stock market and micro and macroeconomic factors concentrate only on finding the logical reasoning and then testing such relationship statistically. The examples of earliest research are Heins and Allison  who referenced to prior Clendenin's  research on factors affecting stock price and extending it by adding the following theoretically supported variables to the regression model: average price of the stock, earnings, price-to-earnings ratio, turnover of the stock, and a binary variable describing the stock exchange, where the shares were quoted. Garefalakis et al. , at first, discusses about the theoretical aspect of the selected input variables to determine the factors affecting the Hang Seng Index and then regress them to test their significance. He had chosen S&P 500, Crude Oil, Gold, and Dollar to Yen exchange rate.
Some of the papers focused only on specific markets and sectors. Jeon  regress oil price, exchange rate KRW/USD, tourist expenditure, consumer price index, and industrial production to the prices of tourism stock index from the Korean stock exchange. Demir  built a model for Istanbul Stock Market Index with inputs such as commercial loan interest rate, net portfolio investment inflows, real effective interest rate, oil price, net foreign direct investment inflows, and real gross domestic product. Chen  investigated whether macroeconomic variables were useful in predicting recession on the US stock market by forecasting the bear market on S&P 500 price index. He used interest rate spreads, inflation rates, aggregate output, unemployment rates, debt, and nominal effective exchange rates.
The choice of the variables is usually very arbitrary, supported by theoretical knowledge, and usually, these inputs could be classified as internal (of fundamental) factors or as external (macroeconomic) ones. The studies of fundamental inputs on specific examples are conducted by Hartono , which examines the effect of positive and negative earnings or dividend information provided by the specific company based upon its stock price. The main conclusion is that the above information could affect the share price only in the short term, but have no effect in long term. Another example is the study conducted by Lee  on DJIA index where some internal characteristics like dividend, book value, and earnings were chosen and the effect on price movements was compared against the influence of external variables.
The influence of external factors is presented, for example, by Durham  who examined the relationships of GDP per capita growth, levels of country credit risk, and level of legal development to the stock market. Corwin  recognizes the asymmetric information and uncertainty as factors that affect the share prices. He also provides a literature review, in which he concludes that other research factors depend on specific company, industry, country or economy, but the relationships with the selected variable are always positive or negative whatever may be the industry, country and so on. For example, a declining dividend would always impact the share price negatively.
The most important conclusion from the above literature analysis is that the researchers tend to determine the drivers first and then test the relationship between the selected inputs and a specific stock index. Such an approach is commonly used among the researchers because it is usually easy to defend the choice of the factors with existing studies. In this paper, an alternative approach is presented. First, a large dataset of inputs is collected and the algorithm is set to select the most important ones. Breiman  has presented the difference between these two approaches. The first one was called the Data Modeling approach and the second the Algorithmic Modeling approach. The first one is focused on the objective results supported by carefully chosen variables and the model. The second one cares more about the results and minimalizes the role of inputs and the choice of the model, which can be even treated as a “black box.”
Random Forest algorithm belongs to the family of ensemble machine learning method, mainly used for classification and regression problems. It is based on the multiple constructed decision trees through a learning process. The final result from the algorithm can be a class, which is the mode class from all trees’ output in classification or a predicted average of individual trees in regression problems. The concept of the Random Forest was developed by Ho  and was extended by Breiman  into a specific form for a specific application, which is used in this paper.
The term “random” refers to the construction process in which every tree is trained and optimized on the randomly chosen subset of input variables. This implies that every created tree may be unique, which leads to a better model's efficiency as it reduces the model's variance with a very small increase in bias. This randomness makes the algorithm designed for large datasets with unobvious and nonlinear relationships. This is why the Random Forest can be used as the “black box” and can serve as a tool for choosing the determinants of the price movements.
The fundamental concept of the Random Forest, as its name indicates, consists of a large number of individual decision trees that operate as a group with a common purpose. This simple and advantageous collection of decision trees uses the wisdom of the crowd to solve classification and regression problems. Based on the above, it can be concluded that the little or low similarity between predictors is the key. In this model, either categorical or continuous inputs may be applied., A common issue of spotting the outliers when working with empirical data is easily handled by the algorithm. The element of randomness causes that there are always some trees that would consider some values as not required, which highlights the individual character of each tree. The same feature of the Random Forest tackles the problem of incomplete datasets and missing data points, thereby bringing advantage in diminishing their impact on the model. Furthermore, the algorithm is resistant to minor changes in data input. The Random Forest could be classified as the “black box” algorithm because determining the impact of a single variable might be very problematic. On the one hand, as an advantage, we can assume that the relationships between the inputs and outputs of variables are not relevant. But on the other hand, mentioned relationships could not be reasonably justified.
From the perspective of this research, the most important characteristic of Random Forest is the nonlinearity in decision trees and the possibility to measure the importance of the input variables. Usually, the determinants of stock price movements are selected by logical reasoning and verified by significance in the linear regression model. This limits the analysis to linear relationships, which is quite a naive approach, and using more complex algorithms allows us to circumvent this issue.
Measuring the importance of the input variables is performed mainly in the process of choosing the most significant inputs or in limiting its number. The most commonly known permutation method is used for assessing the importance of the variables for Random Forest. Gregorutti et al.  describes this method with 5 steps. The first step is fitting the model and calculating the out-of-bag error for each tree, which is characteristic for algorithms using bagging. Then the permutation of one variable is performed and the whole changed dataset is passed through the model so that the variables are fitted and the errors are calculated again. The importance of this variable is assessed by the average of the differences of out-of-bag errors calculated without and with the permutated variable. This set of steps is replicated over all variables used. This process was used to evaluate the importance of the market variables for future price movements of stock indices and allowed to derive the most important ones. The whole process is described in the remaining part of the paper.
The daily data of closing prices from January 1, 2000 to January 31, 2020 for WIG20, DAX, and Stoxx600 were collected from Thomson Reuters Eikon. The whole data set consisted of 5, 241 observations for each index. For the input variables, the variety of global and regional stock indices were collected that included commodities prices like oil or gold, and some macroeconomic data such as interest rates, inflation, unemployment rates, sovereign nominal bond yields for different maturities, and some exchange rates. The whole input data set consists of 209 variables with the same number of observations, which were used against the indices. All the variables used are listed in Tables 1 and 2 in Appendix.
Because of the different scales of the variables, the normalization process is required to maintain a similar variance for each of them [Specht, 1991]. Besides, such a process improves the timing of the learning process and may improve the results as well [Jayalakshmi and Santhakumaran, 2011]. There are a lot of methods to perform the normalization process, but the most popular ones are the Min-Max method or z-score method. The z-score method was chosen for this research.
The inputs were transformed into changes—percentage or absolute—depending on the characteristic of the variable. After this step, the 5-year rolling z-score was introduced for all of the series and the input data set is prepared in such a way that it can be used in the training procedure subsequently.
In this paper, the supervised learning method is used to fit the Random Forest algorithm. The future one-month price change was chosen to be predicted by the model, as it is effectively a short-time horizon, but not short enough to be treated as random. From the investors’ perspective, not a specific price value is needed to make an investment decision but rather the expectation whether the price will go up, down, or remain stable. That is why the following classification has been introduced: “1” stands for the one-month future rise in price by more than 2%, “0” stands for a price change between 2% and −2%, whereas “−1” is for changes lower than −2%. Such a supervised time series was created and it serves as a “teacher” for the algorithm.
A popular method for dividing the dataset into in-sample and out-of-sample data is splitting the whole dataset into a training set and test set. Usually 60% to 40% or 67% to 33% for training and test sets, respectively. However, Hart  suggested a time series cross-validation as a substitute for a standard training-test division to benefit from the time characteristic of the data. In time series cross-validation a rolling window is used, which rolls over the entire data set. There are two approaches of how to use such a window: one is to fix its starting point at the beginning of the data and expanding it to cover the data one by one. The second way is more suitable for economic time series as they are characterized by changing variation and thus the relationships may change over time. It is based on a fixed-sized window that shifts over the time series. In this paper, such an approach is used. The span of the rolling window is fixed to 5 years. It means that the algorithm is trained with the first 5-year data and is forecasting only the first day ahead of the window. It implied that only the first 5 years can be treated as an in-sample period.
Usually, before the training process starts, the algorithm's hyperparameters need to be selected. Hyperparameters are external specifications of the model that cannot be estimated from data. In the case of Random Forest, the examples of hyperparameters are the number of trees used in the algorithm, their maximal depth, or the number of variables used in each split of the node. There are several ways of choosing the right set of such parameters like manual search, grid search, or random search [Bergstra and Bengio, 2012], although there is no optimal solution of how to the find the best set. As the searching of the best hyperparameters is not the subject of this paper the author decided to use the default values delivered by the Scikit-learn package in Python.
The next step in the training process is the reduction of the number of variables used in the model. Genuer et al.  describes the approaches of variable selection and suggests the utilization of Recursive Feature Elimination. The key element of this process is the importance of the variables presented in the previous sections. The model is fitted and the importance of all the inputs is calculated. Then, the least important feature is eliminated and the process is repeated until the specific quantity of variables is left. The Recursive Feature Elimination is used for the reduction of the input's number to decrease the chances for potential overfitting and may increase the effectiveness of the model. In this research, the number of used variables in final models is limited to 5 as derived by the Recursive Feature Elimination process every month. It means that the most important 5 variables are derived each month and such set is used to fit the model for the rest of the days in this month.
The output of the fitted models is the importance of the variables chosen in the Recursive Feature Elimination process and the calculated every time as the rolling window moves forward. The importance of the variables was saved in the form of a data frame and the results have been analyzed.
The whole process was performed using Python language in Spyder environment. The mainly used packages were “pandas,” “numpy,” and “sklearn.”
The results are presented in percentage form of occurrence in the final chosen 5 variables in the out-of-sample period. As the rolling window accounts for 5 years of observations and the variable selection process happens every month, it implies that there were 176 possible chances to be qualified for the best 5 variables. Figures 1–3 present 15 input variables most often chosen respectively for WIG20, DAX, and Stoxx Europe 600.
Results show that the most frequent determinant for all of the analyzed stock indices is China Prime Lending Rate that is transformed into changes and normalized by z-score. This variable refers to the weighted average interest rate quoted by three major Chinese banks on loans and directly influences the economy in China through the cost of capital. The second important conclusion from the above is the occurrence of The Conference Board Leading Economic Index, which is an American leading indicator that is expected to forecast future economic activity. Other notable variables are various bond yields and inflation rates. This kind of variables tends to describe the overall performance of the economy of a country or even the whole region, but all major economies like the US or China have a significant impact on smaller markets like Polish and German ones. These dependencies can be seen in the results however using the “black box” model in a form of Random Forest does not allow to quantify them and confirm their significance statistically.
From the practical perspective for investors, this research provides an insight into the factors that may drive European stock indices. The results indicate that yields and the American and Chinese economic indicators have a more nonlinear impact on future price changes of WIG20, DAX, and Stoxx Europe 600 than other equity indices, interest rates, or commodities prices.
The main purpose of this paper was to show an alternative way of how to select determinants of three stock indices using the Random Forest algorithm. The research shows that this approach might be useful as a tool for finding nonlinearly related factors. Further research on this topic may include the examination of the effectiveness of the forecasts and checking whether the above relationships can be confirmed statistically.
List of input variables—commodities, macroeconomics, and bond yields
|Oil||EM HICP—ALL ITEMS||US 30y bond yield|
|Copper||The Conference Board Leading Economic Index||German 2y bond yield|
|Gold||US money supply M2||Euro 10y bond yield|
|Palladium||CHINA CPI||Greece 10y bond yield|
|Aluminum||Eurozone Consumer Confidence Indicator||Portugal 10y bond yield|
|Platinum||US AHETPI||UK 3y bond yield|
|Nickel||UK CPI YoY||US 1y bond yield|
|Wheat||China Imports||US 5y bond yield|
|Lead||German CPI||German 5y bond yield|
|Silver||China Exports||Canadian 10y bond yield|
|Sugar||US New Private Housing Units Started||Austrian 10y bond yield|
|Soybeans||China Prime Lending Rate||Belgian 10y bond yield|
|Cotton||Eurozone Unemployment Rate||Finland 10y bond yield|
|Nickel 3m||US Nonfarm Payroll Employment||South African 10y bond yield|
|US Personal Consumption Expenditures||Sweden 10y bond yield|
|Baltic Exchange Dry Index||Eurozone Industrial Confidence Indicator||US 3y bond yield|
|IBOXX EURO CORPORATES||US Industrial Production - Manufacturing||French 3m bond yield|
|IBOXX EURO EUROZONE||UK Consumer Confidence Indicator||German 1y bond yield|
|IBOXX £ CORPORATES||Indian 10y bond yield|
|US CPI—ALL URBAN||US 10y bond yield||Irish 10y bond yield|
|US ISM PMI||German 10y bond yield||Japan 3m bond yield|
|US Unemployment Rate||UK 10y bond yield||Netherlands 10y bond yield|
|US Industrial Production||Japan 10y bond yield||Polish 10y bond yield|
|UK CPI INDEX All Items||Italy 10y bond yield||French 5y bond yield|
|US Treasury Bill Rate – 3m||US 3m bond yield||German 30y bond yield|
|US Personal Consumption Expenditure||US 2y bond yield||Italian 3m bond yield|
|China CPI YoY||French 10y bond yield||Swiss 10y bond yield|
|US PCE less food & energy||German 3y bond yield||US 6m bond yield|
List of input variables—currencies and equities
|USD TWI||New Zealand $ to Euro||S&P 500||MDAX Frankfurt|
|EUR TWI||Peruvian Sol to Euro||MSCI World||OMX Copenhagen|
|GBP TWI||Chilean Peso to Euro||FTSE 100||S&P 500 Growth|
|US $ to EURO||Croatian Kuna to Euro||DAX 30 Price Index||SHENZHEN SE B Share|
|UK £ to EURO||Icelandic Krona to Euro||Stoxx 600||SBF 120|
|Japanese Yen to Euro||Qatari Rial to Euro||Euro stoxx 50||STOXX EUROPE 50|
|Swiss Franc to Euro||Saudi Riyal to Euro||Hang Seng||AEX All Share|
|Swedish Krona to Euro||Ukraine Hryvnia to Euro||Shanghai SE A Share||Euronext 100|
|Norwegian Krone to Euro||Albanian Lek to Euro||Topix||OSLO Exchange All Share|
|Chinese Yuan to Euro||Algerian Dinar to Euro||MSCI EM||BUDAPEST (BUX)|
|Canadian $ to Euro||GBP to USD||CAC 40||ROMANIA BET|
|Brazilian Real to Euro||Japanese Yen to USD||FTSE All Share||S&P/ASX 300|
|Danish Krone to Euro||Chinese Yuan to USD||Korea SE Kospi||Next 150|
|Polish Zloty to Euro||South Africa Rand to USD||MSCI AC World||MSCI EAFE|
|Australian $ to Euro||Canadian $ to USD||S&P/ASX 200||MSCI PACIFIC|
|Czech Koruna to Euro||Brazilian Real to USD||FTSE 250||NIKKEI 225|
|Hong Kong $ to Euro||South Korean Won to USD||FTSE MIB||DOW JONES IND.|
|Hungarian Forint to Euro||Indian Rupee to USD||S&P/TSX||NASDAQ 100|
|Russian Rouble to Euro||Mexican Peso to USD||Euro stoxx||BRAZIL BOVESPA|
|Singapore $ to Euro||Russian Rouble to USD||Swiss SMI||NASDAQ|
|South Africa Rand to Euro||Swiss Franc to USD||MSCI Europe||FTSE/JSE ALL SHARE|
|New Turkish Lira to Euro||Australian $ to USD||Ibex 35||RUSSELL 2000|
|South Korean Won to Euro||Indonesian Rupiah to USD||AEX index||FTSE ALL WORLD|
|Argentine Peso to Euro||Swedish Krona to USD||BEL 20||FTSE Bursa Malaysia|
|Indian Rupee to Euro||EUR to USD||OMX 30||MOEX Russia Index|
|Mexican Peso to Euro||Taiwan New $ to USD||S&P 500 Value||WIG|
|Thai Baht to Euro||Thai Baht to USD||Straits Times Index|
|Israeli Shekel to Euro||CHF to USD||Taiwan SE - TAIEX|
|Colombian Peso to Euro||Hong Kong $ to USD||SET index Bangkok|
|Malaysian Ringgit to Euro||New Turkish Lira to USD||Hang Seng China Enterprises|
|New Romanian Leu to Euro||Norwegian Krone to USD||IDX|
|Taiwan New $ to Euro||Philippine Peso to USD||OMX Helsinki|