A Supervised Machine Learning in Financial Forecasting: Identifying Effective Models for the BIST100 Index

Accurate financial performance forecasting is integral to a company’s strategic planning and financial management, playing a critical role in shaping both short- and long-term business decisions. By predicting future financial performance, companies can make well-informed decisions about resource allocation, risk management, and investment opportunities (Roberts and Dowling, 2002; Koller et al., 2010). As the financial landscape becomes increasingly complex, with companies operating in volatile and unpredictable markets, the demand for more accurate forecasting methods has grown significantly. Recent advancements in computational techniques, coupled with the exponential growth of financial data, have prompted a surge in the application of machine learning (ML) models to financial performance prediction (Jordan et al., 2015; Nosratabadi et al., 2020; Iqbal et al., 2020). These models offer robust capabilities for capturing complex, non-linear relationships within financial data, addressing many of the limitations of traditional forecasting approaches.

Despite the potential of machine learning, gaps remain in our understanding of which models perform best in specific financial contexts, particularly in emerging markets such as Turkey. The unique economic structure and market dynamics of Turkey, as represented by the companies listed on the Istanbul Stock Exchange (BIST100), offer an ideal environment for testing and comparing machine learning models. The BIST100 index, composed of the 100 largest publicly traded companies in Turkey, reflects the diverse economic landscape of the country and plays a significant role in its financial stability and growth. Accurate predictions of financial performance within this index are not only essential for companies themselves but also for investors, analysts, and policymakers who rely on such forecasts for decision-making.

Traditional financial forecasting methods, while useful in many cases, often fall short in capturing the multi-dimensional and non-linear interactions inherent in financial data (Refenes et al., 1997; Clements et al., 2004; Gradojevic and Yang, 2006; Adegbite et al., 2019). Financial data is affected by a wide range of factors, including macroeconomic indicators, company-specific financial ratios, and market sentiment. These factors interact in complex ways that can be difficult to model using conventional statistical approaches. In contrast, machine learning models, which are designed to handle large and complex datasets, offer significant advantages for financial forecasting. Specifically, they can learn from historical data and identify intricate patterns and relationships that may not be apparent through traditional methods.

This study aims to address the research gap by conducting a comprehensive evaluation of several supervised machine learning models for predicting the financial performance of companies in the BIST100 index. The models selected for this analysis include Tree-Based Models such as Decision Trees, Random Forests, Bagging, and Boosting methods (XGBoost, LightGBM, CatBoost), Neural Network-based models like Artificial Neural Networks (ANN), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTM), and Instance-based Learning Models such as K-Nearest Neighbors (KNN) and Support Vector Machines (SVM). These models were chosen for their proven capabilities in handling high-dimensional, non-linear financial data (Jones, 2017; Chow, 2018; Jan, 2021; Reel et al., 2021).

The dataset for this study is drawn from the BIST100 index, covering a ten-year period from 2014 to 2024. It includes daily stock prices, trading volumes, and various technical indicators such as Moving Averages (MA 50, MA 200), Relative Strength Index (RSI), Moving Average Convergence Divergence (MACD), and Bollinger Bands (BB Mid, BB Upper, BB Lower). Financial ratios, derived from companies’ balance sheets and income statements – such as the debt-to-equity ratio, current ratio, return on capital employed, and net profit margin – serve as key inputs for the machine learning models (Höbarth, 2006; Chen, 2011; Xiuguo and Shengyong, 2022). By assessing model performance through error metrics such as Mean Square Error (MSE), Root Mean Square Error (RMSE), Normalized Root Mean Square Error (rRMSE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE), this study provides a comprehensive evaluation of which algorithms deliver the most accurate and reliable predictions in this financial context.

In addition to determining which models provide the best overall accuracy, this research seeks to explore the conditions under which certain models may outperform others. For instance, treebased models have often demonstrated strong performance in financial forecasting due to their ability to handle non-linear relationships, while neural networks, especially deep learning methods like CNNs and LSTMs, may excel in capturing more subtle patterns and temporal dependencies in financial time series data (Jordan et al., 2015; Jan, 2021). Understanding these nuances is essential for improving the practical application of machine learning in financial forecasting.

The practical implications of this study are far-reaching. Investors, financial analysts, and corporate decision-makers can leverage the insights from this research to make more informed investment strategies, optimize financial performance, and mitigate risks. By comparing the strengths and weaknesses of various machine learning models, this research provides a valuable resource for selecting the most appropriate tools for financial forecasting in diverse economic contexts.

This paper is structured as follows: Section 2 presents a comprehensive literature review of machine learning applications in financial forecasting. Section 3 describes the dataset and methodology used in the study. Section 4 discusses the results of the model evaluations, while Section 5 elaborates on the practical and theoretical implications of the findings. Finally, Section 6 concludes with key insights and recommendations for future research in this rapidly evolving field.

2

LITERATURE REVIEW

The financial forecasting landscape has witnessed a significant transformation with the advent of machine learning (ML) techniques, particularly in the context of stock market predictions. The BIST100 index, comprising the top 100 companies listed on Borsa Istanbul, is a critical benchmark for Turkish equities and has increasingly attracted the interest of researchers aiming to leverage advanced ML models for financial prediction. A substantial body of research has explored the use of supervised machine learning in financial forecasting, especially for developed markets. Treebased models such as Random Forests, Gradient Boosting Machines (GBM), and their derivatives have become popular due to their ability to model complex, non-linear relationships and effectively reduce overfitting through ensemble strategies (Hoque and Aljamaan, 2021; Ryll and Seidens, 2019; Chen and Guestrin, 2016). Neural network-based methods – including Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs) – have also gained traction for their capability to capture sequential dependencies and hidden patterns in financial time series data (Shen et al., 2021; Bai et al., 2023; Kim and Kim, 2019). Evaluation of these models often involves robust error metrics such as MSE, RMSE, MAE, and MAPE, as well as sensitivity analyses to assess their reliability and robustness under varying market conditions (Kambale et al., 2024; Yaprakdal and Bal, 2022).

Despite these advancements, a critical review of studies published between 2022 and 2024 reveals several important limitations. First, the majority of recent comparative studies remain largely focused on developed financial markets – such as the United States, Western Europe, or East Asia – while emerging markets like Turkey remain underrepresented (Alizadegan et al., 2024; Bhuiyan et al., 2023; Liu et al., 2023; Alshehri, 2023). Second, existing studies often restrict themselves to a small set of algorithms – most frequently Random Forest, XGBoost, or LSTM – without evaluating a broader methodological landscape that includes both instance-based (e.g., KNN, SVM) and hybrid models. Furthermore, many analyses rely on short time frames, limited variable sets, or lack statistical significance testing, reducing the robustness and generalizability of their findings (Margarat et al., 2023; Benchikh et al., 2024). Studies on the Turkish market are particularly scarce and typically focus on a single model or a limited set of predictors, which constrains the understanding of model transferability in volatile, emerging market conditions (Beyaz and Efe, 2019).

This study directly addresses these gaps by providing a comprehensive head-to-head comparison of a diverse set of supervised machine learning models – including ensemble, deep learning, and instance-based approaches – using a large-scale, up-to-date BIST100 dataset spanning a decade. By benchmarking model performance using multiple error metrics and conducting statistical tests, this research not only advances the methodological rigor of comparative ML forecasting but also provides actionable insights for both researchers and practitioners seeking effective and adaptable tools for financial forecasting in emerging markets. Thus, the present study extends the frontier of machine learning research by systematically evaluating which models offer the greatest accuracy and robustness for BIST100 companies, thereby supporting more data-driven decision-making in the Turkish stock market and beyond.

3

THEORETICAL BACKGROUND

Ensemble models, particularly Random Forest and XGBoost, have demonstrated consistently strong performance in financial forecasting due to their robust theoretical underpinnings. One primary reason for their effectiveness is their ability to reduce model variance and mitigate overfitting – a common issue in volatile and high-dimensional financial datasets (Breiman, 2001; Chen and Guestrin, 2016). Ensemble learning combines the predictions of multiple weak learners (typically decision trees) to form a more stable and accurate model, capturing complex, non-linear relationships within the data that single models may miss (Dietterich, 2000). In financial time series, where market dynamics can shift rapidly and involve intricate interdependencies, ensemble methods provide both flexibility and resilience. Random Forest achieves this by aggregating numerous uncorrelated decision trees, each built on randomly sampled subsets of features, which enhances generalizability and reduces sensitivity to noise (Liaw and Wiener, 2002; Davis and Nielsen, 2020). Similarly, XGBoost employs gradient boosting to sequentially optimize model performance, focusing learning on difficult-to-predict cases and delivering high accuracy even on challenging, noisy datasets (Chen and Guestrin, 2016; Alshehri, 2023).

Another crucial advantage of tree-based ensemble models is their robustness to multicollinearity and missing values, as well as their ability to provide feature importance measures – essential tools for financial analysts working with large, multifaceted datasets (Lundberg et al., 2019; Wang et al., 2022). These properties make ensemble methods especially suitable for financial applications where predictors may be numerous, highly correlated, or partially missing. In contrast, deep learning models such as Long Short-Term Memory networks (LSTM) and Convolutional Neural Networks (CNN) excel in capturing temporal dependencies and extracting meaningful features from sequential data (Hochreiter and Schmidhuber, 1997; Kim and Kim, 2019; Liu et al., 2022). LSTM networks are specifically designed to overcome the limitations of traditional recurrent neural networks by addressing the vanishing gradient problem, allowing them to model longterm dependencies in time series – an essential requirement for financial forecasting where past trends and shocks can impact future prices (Bai et al., 2023; Fischer and Krauss, 2018). This theoretical advantage is supported by recent studies showing LSTM’s superior ability to model non-stationary, sequential financial data, especially in volatile markets (Wang, 2024; Diqi et al., 2024).

CNNs, although originally developed for image processing, have been successfully adapted for financial time series analysis due to their ability to extract and hierarchically combine local patterns within the data (Cao and Wang, 2019; Qi et al., 2022). When used alone or in hybrid architectures (e.g., CNN-LSTM), these networks can learn both short-term local features and long-term dependencies, providing a comprehensive approach to modeling market behaviors (Widiputra et al., 2021; Kim and Kim, 2019). Moreover, deep learning models are data-driven and require minimal feature engineering, enabling them to automatically uncover complex, non-linear, and hidden relationships in large, high-frequency financial datasets (Jan, 2021; Liu et al., 2022). However, while ensemble and deep learning models offer significant advantages, their success also depends on careful hyperparameter tuning, appropriate data preprocessing, and understanding their limitations regarding interpretability and computational cost (Henrique et al., 2023; Beyaz and Efe, 2019).

4

PREDICTION MODELS

In this study, Tree-Based Models (Decision Trees, Bagging, Random Forests, Adaboost, Gradient Boosting Machine (GBM), LightGBM, XGBoost, CatBoost), Neural Network-based Models (Artificial Neural Networks (ANN), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTM)) and Instance-based Learning Models (K-Nearest Neighbors (KNN) and Support Vector Machines (SVM)) were used. Fig. 1 shows the models used in this study. The selection of machine learning models in this study is guided by both empirical evidence from recent literature and practical considerations relevant to financial time series forecasting. Ensemble methods – notably Random Forest, LightGBM, and XGBoost – were included due to their demonstrated ability to handle large, high-dimensional datasets with complex, non-linear relationships, and their robust performance in diverse financial prediction tasks (Henrique et al., 2023; Chen and Guestrin, 2016; Davis and Nielsen, 2020). These models effectively reduce overfitting and variance through aggregation and boosting techniques, making them highly suitable for volatile and multi-factor financial environments such as BIST100. Neural network-based models, including LSTM, CNN, and ANN, were selected for their theoretical and empirical strengths in modeling temporal dependencies and uncovering hidden structures in sequential data (Hochreiter and Schmidhuber, 1997; Kim and Kim, 2019; Liu et al., 2022). LSTM networks, in particular, excel at learning long-term dependencies, a critical feature for financial time series characterized by delayed effects and memory. CNNs have also shown promise in extracting localized features and patterns, especially when integrated into hybrid architectures (Cao and Wang, 2019; Widiputra et al., 2021). Their flexibility and scalability make them strong candidates for financial forecasting with large and complex datasets. Instancebased methods such as SVM and KNN are widely used in the literature for classification and regression in financial applications, providing valuable benchmarks and offering interpretability advantages (Cao and Tay, 2001; Islam et al., 2021; Bellotti et al., 2011). SVMs are particularly robust to outliers and noise, and are effective for high-dimensional spaces – a common characteristic in financial data. Alternative statistical methods such as ARIMA, GARCH, or Facebook Prophet were intentionally not included in this study. While these approaches have been widely used in earlier financial forecasting research, they typically rely on strong linearity or stationarity assumptions and have limited capacity to model the multi-dimensional and nonlinear dependencies prevalent in large-scale financial datasets (Gradojevic and Yang, 2006; Jan, 2021). Recent comparative studies also show that advanced machine learning models routinely outperform traditional time-series techniques, especially in contexts with complex interactions and numerous predictive features (Bhuiyan et al., 2023; Alizadegan et al., 2024). Finally, model selection also considered computational efficiency and interpretability. While deep learning models can require significant computational resources, the models chosen represent a balance between predictive power, scalability, and practical implementation in real-world financial analysis.

4.1

Tree-Based Models

4.1.1

Decision Trees

Decision trees are widely used in financial forecasting for their robust decision-making capabilities. In stock price forecasting, they analyze historical market data to predict trends (Li et al., 2023). They are also valuable in risk management, integrated into multi-agent systems for financial distress prediction (Yan et al., 2015). In real option valuation, decision trees model uncertainties and complex payout structures flexibly (Hahn et al., 2007). Beyond finance, they are effective in load forecasting for energy (Faber and Finkenrath, 2021), credit behavior prediction (Anand et al., 2022), traffic flow prediction (Xia and Chen, 2017), financial statement fraud detection (Chen et al., 2014), and asset investment discharge prediction (Jannink and Bos, 2005). Their adaptability and ability to handle non-linear data make them invaluable in forecasting and financial analysis.

4.1.2

Bagging

Bagging, or bootstrap aggregation, is a powerful ensemble technique in financial forecasting, known for addressing parameter sensitivity and model uncertainty by averaging estimates across bootstrap samples to reduce parameter instability (Hao et al., 2023; Jordan et al., 2017). Widely applied in economic and financial forecasting, bagging is effective in reducing forecast mean square error, as seen in inflation forecasting (Lee et al., 2019). It also excels in stock price forecasting, outperforming other methods in financial crisis prediction and stock return forecasting (Tsai et al., 2011). Bagging improves forecast accuracy for crude oil prices (Aydoğmuş et al., 2015) and enhances financial risk prediction by creating diverse training datasets for classifiers (Sun, 2012). Additionally, in stock premium predictability, bagging uses moving block bootstrapping, demonstrating its versatility in financial forecasting (Gupta et al., 2016).

4.1.3

Random Forests

Random Forests play a critical role in enhancing predictive accuracy and addressing complex financial challenges via ensemble learning. Introduced by Breiman, Random Forests combine multiple random decision trees to deliver reliable predictions for classification and regression tasks (Davis and Nielsen, 2020). This approach improves accuracy by leveraging an ensemble of trees and injecting randomness during training (Rawnaq et al., 2024). Random Forests have been widely applied in financial forecasting, including stock price prediction, fraud detection, bankruptcy prediction, and option pricing (Qin et al., 2013; Li et al., 2015; Karminsky and Burekhin, 2019; Chen et al., 2021b). Studies confirm their strength in modeling nonlinear relationships, outperforming traditional linear models (Qin et al., 2013). They have also effectively predicted abnormal trading behavior linked to internet rumor spread, highlighting their adaptability to complex market dynamics (Cheng et al., 2023).

4.1.4

AdaBoost

AdaBoost, or Adaptive Boosting, combines multiple weak learners to form a strong learner, effectively capturing patterns in financial data (Li, 2024a). It has been applied successfully in credit assessment, bankruptcy prediction, stock index forecasting, and economic time series prediction, often outperforming traditional models like logistic regression in complex financial scenarios (Bluwstein et al., 2023; Zhou and Lai, 2016; Park et al., 2019; Heo and Yang, 2014). AdaBoost’s accuracy improves further when integrated with models such as LSTM and SVM (Nabi et al., 2020; Wu and Gao, 2018). Its ability to handle nonstationary financial data highlights its value in trading and decision-making contexts (Chang et al., 2019). By focusing on misclassified instances, AdaBoost effectively reduces bias and strengthens predictive accuracy (Busari et al., 2021).

4.1.5

Gradient Boosting Machine (GBM)

Gradient Boosting Machine (GBM), pioneered by Friedman, has become essential for regression tasks across financial applications (Pedchenko et al., 2018). GBM is widely used for forecasting in domains such as natural gas spot prices (Li et al., 2023), house prices (Yan et al., 2015), startup performance (Hahn et al., 2007), agricultural prices (Faber and Finkenrath, 2021), rent prices (Anand et al., 2022), electricity prices, and GDP growth (Xia and Chen, 2017). It often outperforms traditional models in predicting financial crises, bankruptcy, stock indices, and economic time series, showcasing its strength in complex scenarios (Pedchenko et al., 2018; Hahn et al., 2007; Faber and Finkenrath, 2021; Anand et al., 2022; Xia and Chen, 2017). GBM is also effectively combined with other algorithms like XGBoost, LightGBM, and Random Forest to improve forecasting accuracy (Yan et al., 2015; Chen et al., 2014). These hybrid models have shown promise in capturing nuanced patterns in financial data (Yan et al., 2015; Chen et al., 2014). Additionally, GBM is well-suited for nonstationary financial data analysis, supporting robust trading and decision-making processes (Jannink and Bos, 2005). By iteratively refining challenging cases, GBM ensures high accuracy, making it a preferred choice in financial forecasting (Hao et al., 2023).

4.1.6

LightGBM

LightGBM has gained prominence in financial forecasting due to its efficient data handling, low memory usage, and fast processing, especially in large-scale applications (Liu et al., 2023). It has been widely applied to forecast financial parameters like stock returns, water quality (dissolved oxygen levels), and stock prices (Zhou et al., 2023; Cao and Sun, 2024; Liu et al., 2023). Its capability to handle high-dimensional datasets effectively makes it ideal for financial forecasting tasks (Chen et al., 2021a). LightGBM is also combined with algorithms like LSTM, XGBoost, and Random Forest to enhance forecasting accuracy in diverse applications, yielding strong predictive results (Cao and Sun, 2024; Alizadegan et al., 2024). It has further been used in predicting stock volatility, global stock prices during market stress, and Bitcoin prices, underscoring its adaptability in financial forecasting (Gong et al., 2022; Bhuiyan et al., 2023; Liu et al., 2022). LightGBM’s efficiency with large datasets and rapid processing capabilities make it a valuable tool for financial measurement and market analysis (Zhang, 2022).

4.1.7

XGBoost

XGBoost has become popular across fields like finance, energy, traffic, and medicine due to its optimization strengths (Zhu et al., 2023). In finance, it is frequently used for predicting price movements (Alshehri, 2023; Choi et al., 2022) and assessing credit debt default risks through variable importance analysis (Wang et al., 2022). Recognized for its powerful optimization, XGBoost is widely adopted for complex research challenges (Bačanin et al., 2022) and has been applied in trading to predict apple futures prices (Deng et al., 2021). Outside of finance, XGBoost also sees applications in healthcare and natural language processing (Du et al., 2024).

4.1.8

CatBoost

CatBoost is highly regarded for its ability to manage categorical features effectively across various applications, including financial forecasting (Prokhorenkova et al., 2017; Dorogush et al., 2018). It has proven successful in predicting credit debt defaults, corporate failure, hotel cancellations, and liquidated damages in construction projects, showcasing strong predictive performance in complex scenarios (Yao and Yang, 2024; Wang et al., 2022; Alshboul et al., 2022). CatBoost also excels in fault diagnosis, anomaly detection, quality prediction, and stock forecasting, illustrating its versatility in financial and industrial domains (Fadzil et al., 2024; Alfarhood et al., 2024). Known for efficiently handling categorical data, it is a preferred tool for precise decision-making. Additionally, CatBoost has been combined with models like XGBoost, LightGBM, and neural networks to boost accuracy and address unique challenges in financial forecasting, with promising improvements in capturing complex patterns (Alshboul et al., 2022; Sun and Tian, 2023).

4.2

Neural Network-Based Models

4.2.1

Artificial Neural Networks (ANN)

ANNs are extensively applied in financial forecasting due to their capacity to model complex financial data relationships. Researchers have used ANN for tasks such as predicting stock movements, credit risk, bankruptcy risk, and stock index forecasting. Casas (2012) emphasized ANN’s importance in series prediction, particularly for financial instruments, by enabling parallelized training. Alamsyah et al. (2021) developed a financial distress early warning model using ANN with backpropagation on financial indicators. Johari et al. (2018) demonstrated ANN’s efficacy in financial time series prediction, while Verma et al. (2017) compared ANN algorithms for time series forecasting, underscoring their popularity in finance. Selmi et al. (2015) applied ANN for stock return forecasting, and Cao and He (2019) used it to predict financial crises by linking commodity and stock markets, showcasing ANN’s adaptability.

4.2.2

Convolutional Neural Networks (CNNs)

CNNs have gained traction in financial forecasting due to their ability to identify complex patterns in financial time series. They have been applied to various tasks, including stock price forecasting, credit risk assessment, and market trend analysis. For instance, Cao and Wang (2019) developed a CNN-based stock price model that effectively predicted stock indices, while Lu et al. (2023) further validated CNNs’ capabilities in stock index forecasting. Mode and Hoque (2020) examined CNN-based model robustness by generating adversarial examples for deep learning regression. Widiputra et al. (2021) introduced a Multivariate CNN-LSTM model for predicting parallel financial time series, highlighting CNN-LSTM’s versatility across indices. Qi et al. (2022) proposed a CNN-GRU-attention model for stock prediction in Chinese markets, emphasizing CNNs’ feature extraction efficacy and forecasting accuracy in financial data.

4.2.3

Recurrent Neural Networks (RNNs)

RNNs have become central to financial forecasting, valued for their ability to capture temporal dynamics and model nonlinear relationships in stock market data. Studies demonstrate RNNs’ use in predicting stock prices, volatility, trading patterns, and portfolio strategies (An, 2023; Dey et al., 2021; Alotaibi et al., 2018; Singh et al., 2019; Yadav and Vasuja, 2019). Integrating RNNs with advanced techniques like LSTM further improves forecast accuracy by capturing market complexity (Raza et al., 2023; Upadhyay et al., 2023). RNNs have also been applied in sentiment analysis to integrate sentimental information with market data, aiding in volatility prediction (Liu et al., 2017). Techniques such as Denoising Autoencoders and Transformers enhance RNN-based stock trend forecasting (Chen, 2024). Combining RNNs with time series algorithms has shown reliable results in historical data forecasting (Yadav and Vasuja, 2019). Recent findings highlight LSTM’s superiority over traditional RNNs in precision for stock price prediction (Upadhyay et al., 2023), with effectiveness demonstrated across various companies, showcasing RNN architectures’ adaptability to different market scenarios (Gohil and Shah, 2022; Firouzjaee and Khaliliyan, 2024).

4.2.4

Long Short-Term Memory networks (LSTM)

LSTM networks are pivotal in financial forecasting, especially for stock prices and market trends, due to their ability to capture long-term dependencies in financial time series data (Zahrah et al., 2021; Li, 2024b; Zhang, 2022; Reddy et al., 2020; Gajamannage and Park, 2022; Deshpande, 2023; Bhalke et al., 2022; Raut and Shrivas, 2024). Studies consistently demonstrate LSTM’s precision in managing both short and long-distance sequences in stock price prediction (Bai et al., 2023; Qiu et al., 2020; Firouzjaee and Khalilian, 2024; Diqi et al., 2023; Luo, 2018). Techniques like wavelet denoising, whale optimization, and stacked architectures enhance LSTM’s predictive accuracy (Wang and Wu, 2023; Kim and Kim, 2019; Fischer and Krauss, 2018). Combining LSTM with CNNs further improves forecasting by leveraging varied data representations (Kim and Kim, 2019; Li et al., 2020). LSTM surpasses models like ARIMA and GRU in stock market prediction (Das, 2024; Fatima and Rahimi, 2024) and is also effective in volatility forecasting, bankruptcy prediction, and financial time series analysis (Ho et al., 2021; Vochozka et al., 2020).

4.3

Instance-based Learning Models

4.3.1

K-Nearest Neighbors (KNN)

The K-Nearest Neighbors (KNN) algorithm has been applied in financial forecasting, particularly in predicting stock prices. Tanuwijaya and Hansun (2019) demonstrated KNN’s potential in predicting the LQ45 stock index. Chen (2024) also employed KNN alongside other machine learning methods for stock price forecasting, highlighting its relevance in financial analysis. Additionally, Islam et al. (2021) emphasized the popularity of KNN and support vector regression in predicting stock prices on the Dhaka Stock Exchange. These studies collectively indicate the practical application and effectiveness of KNN in financial prediction tasks.

4.3.2

Support Vector Machines (SVM)

Support Vector Machines (SVM) have been widely utilized in financial forecasting due to their effectiveness in handling classification and regression tasks. Cao and Tay (2001) introduced the application of SVMs in financial forecasting, highlighting their significance in computational science and engineering. Hossain et al. (2020) successfully used SVMs to predict stock prices, specifically forecasting daily closing prices on the Dhaka Stock Exchange. Bellotti et al. (2011) compared SVMs with ordered choice models in predicting international bank ratings, demonstrating SVMs’ capability in time series forecasting. Zhang et al. (2016) enhanced financial time series forecasting accuracy by integrating SVMs with autoregressive integrated moving average (ARIMA). The robustness of SVMs in handling noisy data and generalization challenges has been emphasized by Lin (2012). Overall, SVMs have proven to be a valuable tool in financial forecasting, offering accurate predictions and efficient handling of complex financial data.

5

EXPERIMENTAL SETTING

5.1

Data Description

The dataset utilized in this study encompasses financial data from companies listed on the BIST100 index, which includes the 100 largest companies traded on Borsa Istanbul. The data spans from 2014 to 2024 and includes various financial indicators and technical metrics. Key variables in the dataset include the date of observation, opening price of the stock on the given day, highest price of the stock on the given day, lowest price of the stock on the given day, and closing price of the stock on the given day. Additional variables include the Simple Moving Average (SMA), which is the average of closing prices over a specified period, and the Weighted Moving Average (WMA), which is the average of closing prices over a specified period, weighted by recent prices. Momentum (MOM) is the rate of change of the closing price over a specified period. Stochastic %K (STCK) represents the current closing price relative to the range over a recent period, while Stochastic %D (STCD) is the moving average of the Stochastic %K. The Relative Strength Index (RSI) measures the magnitude of recent price changes to evaluate overbought or oversold conditions. Moving Average Convergence Divergence (MACD) is the difference between the 12-day and 26-day exponential moving averages (EMAs). Larry William’s %R (LWR) is a momentum indicator measuring overbought and oversold levels. The Accumulation/Distribution Oscillator (ADO) combines price and volume to show how much of a stock is being accumulated or distributed. Finally, the Commodity Channel Index (CCI) measures the deviation of the price from its average over a specified period, indicating overbought or oversold conditions.

A comprehensive set of financial indicators has been used to estimate the performance of companies listed in the BIST100 index. These indicators have been selected after a in depth review of the existing literature. The financial indicators have been selected based on their relevance and effectiveness in financial analysis and stock market forecasting and form the basis of machine learning models. Tab. 1 includes detailed explanations of the variables.

Tab. 1:

Characteristics of the dataset

Variable	Description
Date	Date of observation
Open	Opening price on the given day
High	Highest price on the given day
Low	Lowest price on the given day
Close	Closing price on the given day
Simple Moving Average (SMA)	Average of closing prices over a specified period
Weighted Moving Average (WMA)	Average of closing prices over a specified period, weighted by recent prices
Momentum (MOM)	Rate of change of the closing price over a specified period
Stochastic %K (STCK)	Current closing price relative to the range over a recent period
Stochastic %D (STCD)	Moving average of the Stochastic %K
Relative Strength Index (RSI)	Measures the magnitude of recent price changes to evaluate overbought or oversold conditions
Moving Average Convergence Divergence (MACD)	Difference between the 12-day and 26-day exponential moving averages (EMAs)
Larry William’s %R (LWR)	Momentum indicator measuring overbought and oversold levels
Accumulation/Distribution Oscillator (ADO)	Indicator combining price and volume to show how much of a stock is being accumulated or distributed
Commodity Channel Index (CCI)	Measures the deviation of the price from its average over a specified period, indicating overbought or oversold conditions

5.2

Data Preprocessing

Data preprocessing is a crucial step in preparing the dataset for machine learning models. The following steps outline the preprocessing techniques applied to the BIST100 dataset: First, the dataset was thoroughly cleaned to address missing values, outliers, and inconsistencies. Missing values were handled using interpolation methods, while outliers were identified and treated using statistical techniques such as z-score and IQR methods. Normalization was then applied to ensure that all features contribute equally to the analysis. Numerical variables, including stock prices and technical indicators, were normalized using Min-Max scaling, rescaling the values to a range between 0 and 1. Feature engineering was conducted to create additional features that capture more information from the raw data. New indicators such as moving averages (MA 50, MA 200), Bollinger Bands (BB Mid, BB Upper, BB Lower), and other technical indicators were calculated. Since the dataset consists of time series data, it was necessary to structure the data into a format suitable for time series analysis. Lag features were created to include past values of stock prices and technical indicators as inputs for the models. The dataset was divided into training and testing sets using a time-based split to ensure that the training data precedes the testing data, maintaining the chronological order of financial data. Dimensionality reduction techniques, such as Principal Component Analysis (PCA), were applied where necessary to reduce computational complexity and enhance model performance. In cases where the target variable exhibited class imbalance, techniques such as oversampling, undersampling, or synthetic data generation (e.g., SMOTE) were employed to balance the dataset. By implementing these preprocessing steps, the dataset was prepared to maximize the performance and accuracy of the machine learning models in predicting the financial performance of companies in the BIST100 index. The data preprocessing workflow – comprising data cleaning, normalization, feature engineering, time series structuring, dimensionality reduction, and data splitting – is summarized in Fig. 1. These steps are critical to ensuring that the raw BIST100 dataset is transformed into an optimized format suitable for machine learning modeling.

5.3

Experimental Design

The experimental design of this study aims to evaluate the effectiveness of various supervised machine learning models in predicting the financial performance of companies listed on the BIST100 index. The process involves several stages, including model selection, training, validation, and evaluation. For model selection, the study includes Tree-Based Models such as Decision Trees, Bagging, Random Forests, AdaBoost, Gradient Boosting Machine (GBM), LightGBM, XGBoost, and CatBoost. Neural Network-Based Models, including Artificial Neural Networks (ANN), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory networks (LSTM), are also considered. Additionally, Instance-Based Learning Models such as K-Nearest Neighbors (KNN) and Support Vector Machines (SVM) are included. Data preparation involves preprocessing the dataset, which comprises financial indicators and technical metrics. This includes data cleaning, normalization, feature engineering, categorical encoding, and creating lag features for time series analysis. The dataset is then split into training and testing sets using a time-based split to maintain the chronological order of the financial data. During model training, each selected model is trained on the training set. Hyperparameter tuning is performed using grid search or random search techniques to optimize the performance of each model. Cross-validation is employed to ensure the robustness of the models, with the training set further divided into k-folds (e.g., 5-fold or 10-fold cross-validation) to validate the models during the training phase. Model evaluation involves assessing the trained models on the testing set using various performance metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and relative Root Mean Squared Error (rRMSE). Comparative analysis is conducted to assess the performance of different models, and statistical significance tests (e.g., paired t-tests) are used to determine if differences in performance are significant. The overall experimental workflow, from model selection and data preparation to model evaluation and comparative analysis, is illustrated in Fig. 2.

5.4

Evaluation Metrics

The evaluation of the machine learning models in this study is based on several key performance metrics that quantify the accuracy and robustness of the predictions. The metrics used are as follows: 1 $MSE = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}$ {\rm{MSE}} = {1 \over n}\sum\limits_{i = 1}^n {{{\left( {{Y_i} - {{\hat Y}_i}} \right)}^2}} 2 $RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}}$ {\rm{RMSE}} = \sqrt {{1 \over n}\sum\limits_{i = 1}^n {{{\left( {{Y_i} - {{\hat Y}_i}} \right)}^2}} } 3 $MAE = \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} - {\hat{Y}}_{i} |$ {\rm{MAE}} = {1 \over n}\sum\limits_{i = 1}^n {\left| {{Y_i} - {{\hat Y}_i}} \right|} 4 $MAPE = \frac{100}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |$ {\rm{MAPE}} = {{100} \over n}\sum\limits_{i = 1}^n {\left| {{{{y_i} - {{\hat y}_i}} \over {{y_i}}}} \right|} 5 $rRMSE = \frac{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}}{\bar{y}}$ {\rm{rRMSE}} = {{\sqrt {{1 \over n}\sum\limits_{i = 1}^n {{{\left( {{y_i} - {{\hat y}_i}} \right)}^2}} } } \over {\bar y}}

6

RESULTS

The results of this study highlight the effectiveness of various supervised machine learning models in predicting the financial performance of companies listed on the BIST100 index. The evaluation of these models was conducted using several key performance metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and relative Root Mean Squared Error (rRMSE). Tab. 2 presents the comparative error metrics – MSE, RMSE, MAE, MAPE, and rRMSE – for each evaluated model, with the best-performing values highlighted in bold. Fig. 3 to 5 illustrate the model fitting results for the Tree-Based Models, Neural Network-Based Models, and Instance-Based Learning Models, respectively, providing a visual comparison of their predictive accuracy on the BIST100 dataset.

Tab. 2:

MSE, RMSE, MAE, MAPE and rRMSE values

	MSE	RMSE	MAE	MAPE	rRMSE
Decision Trees	3.258e–05	0.0057	0.0032	2.4591	0.0283
Bagging	1.671e–05	0.0041	0.0026	2.1876	0.0202
Random Forests	1.635e–05	0.0041	0.0024	1.9541	0.0201
Adaboost	2.272e–05	0.0047	0.0025	2.0166	0.0236
GBM	2.035e–05	0.0045	0.0023	1.6585	0.0223
LightGBM	2.739e–05	0.0052	0.0026	1.7510	0.0259
XGBoost	2.662e–05	0.0051	0.0025	1.7688	0.0256
CatBoost	3.129e–05	0.0055	0.0037	3.0035	0.0277
ANN	0.00042	0.0205	0.0151	13.6640	0.1019
CNNs	2.886e–05	0.0053	0.0039	4.1893	0.0266
RNNs	6.565e–05	0.0081	0.0060	5.6011	0.0402
LSTMs	0.0001	0.0106	0.0078	7.3147	0.0528
KNN	0.0001	0.0101	0.0071	7.0558	0.0501
SVM	1.592e–05	0.0039	0.0031	3.5148	0.0198

Note: MSE = Mean Squared Error, RMSE = Root Mean Squared Error, MAE = Mean Absolute Error, MAPE = Mean Absolute Percentage Error, rRMSE = relative Root Mean Squared Error.

Performance of Tree-Based Models

Tree-based models, including Decision Trees, Bagging, Random Forests, AdaBoost, Gradient Boosting Machine (GBM), LightGBM, XGBoost, and CatBoost, demonstrated strong performance in predicting financial outcomes. Among these, XGBoost and Random Forests exhibited particularly robust and consistent performance across all evaluation metrics. XGBoost, known for its scalability and flexibility, provided high accuracy and low error rates, making it one of the best performers in this category. Random Forests, with its ensemble approach, effectively handled nonlinear relationships and reduced overfitting, resulting in reliable predictions.

Performance of Neural Network-Based Models

Neural Network-based models, including Artificial Neural Networks (ANN), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory networks (LSTM), also showed significant promise. LSTM networks, designed to handle long-term dependencies in time series data, achieved high accuracy in predicting stock prices and market trends. The integration of LSTM with techniques such as wavelet denoising and optimization algorithms further enhanced its predictive capabilities. CNNs were effective in capturing intricate patterns in financial data, and their performance was further improved when combined with RNNs in hybrid models. ANNs and RNNs also provided competitive results, demonstrating their versatility in financial forecasting tasks.

Performance of Instance-Based Learning Models

Instance-based learning models, specifically K-Nearest Neighbors (KNN) and Support Vector Machines (SVM), were evaluated for their predictive accuracy. SVMs showed strong performance in handling classification and regression tasks, effectively predicting daily closing prices and financial risk assessments. The robustness of SVMs in managing noisy data and complex financial patterns was evident from the results. KNN demonstrated its potential in stock index prediction, although its performance was slightly lower compared to SVMs and other advanced models.

Comparative Analysis

The comparative analysis revealed that no single model consistently outperformed others across all companies in the BIST100 index. However, certain models, such as XGBoost and Random Forests, stood out due to their strong and consistent performance. Deep learning algorithms, including CNNs, RNNs, and LSTMs, also showed significant accuracy in predicting the financial outcomes of specific companies. The integration of various techniques and hybrid models often resulted in improved prediction performance, highlighting the importance of model selection and customization based on the specific financial forecasting task. Paired t-tests, were conducted to compare the performance of different models. The results indicated that the differences in performance between the top-performing models were statistically significant, reinforcing the robustness of models like XGBoost, Random Forests, and LSTM networks.

7

DISCUSSION

This study offers valuable insights into the predictive capabilities of supervised machine learning models applied to the financial forecasting of companies listed on the BIST100 index. Our findings reveal significant differences in model performance, with ensemble methods such as Random Forest and XGBoost emerging as the most effective for this context. These models consistently produced lower error metrics across MSE, RMSE, MAE, and MAPE, highlighting their robustness and adaptability to the complexities inherent in financial data.

The superior performance of Random Forest and XGBoost can be attributed to their ensemble learning structures, which mitigate overfitting and better capture non-linear relationships within financial datasets. These characteristics make them highly suitable for environments with high variability, such as the Turkish stock market, where economic and market dynamics are influenced by a range of macroeconomic and company-specific factors. This aligns with Viswanathan and Stephen’s (2021) findings, where ensemble approaches, including cross-validation within Random Forest, significantly enhanced the accuracy of stock market index predictions, confirming the effectiveness of ensemble methods in improving error metrics. Similarly, Yang (2023) observed that Random Forest produced high accuracy and precision in predicting stock price directional movements, further supporting the robustness and adaptability of ensemble models in complex financial environments. In line with our results, Shende et al. (2022) highlighted the practical utility of ensemble methods for handling non-linear financial dynamics, demonstrating the suitability of Random Forest and similar models for financial forecasting tasks based on financial ratios.

Additionally, models like LightGBM and Gradient Boosting Machine (GBM) offered a balanced trade-off between accuracy and computational efficiency, positioning them as practical alternatives for similar financial applications where efficiency is prioritized alongside accuracy. Henrique et al. (2023) provide a balanced view on the limitations of these models, noting that while ensemble models like Random Forest are highly effective for general financial forecasting, they may face challenges in certain short-term prediction tasks. This finding aligns with our observations and reinforces the importance of context-specific adjustments when applying ensemble methods in financial forecasting. Together, these references and our findings underscore the adaptability and accuracy of ensemble models like Random Forest and XGBoost, particularly for long-term forecasting in emerging markets. This study thus provides a solid foundation for leveraging ensemble models in financial markets and supports further research into optimizing these methods for even more nuanced applications.

In contrast, neural network models, including LSTMs and CNNs, while effective in identifying complex temporal patterns in financial data, demonstrated higher error rates in our tests. This suggests that while deep learning techniques hold promise for sequential and time-series analysis, they may require additional tuning and integration with feature engineering techniques to enhance their performance in financial forecasting for the BIST100 index. Beyaz and Efe (2019) similarly observed that neural network ensembles could optimize feature selection and improve forecast accuracy, although single networks displayed greater sensitivity to parameter tuning. This finding reinforces our results, suggesting that an ensemble approach or hybrid models might further optimize neural networks’ forecasting accuracy. Furthermore, Cao and Wang (2019) highlighted the utility of CNNs in identifying temporal dependencies, yet found that SVMs demonstrated resilience in managing diverse data patterns, underscoring the balance between deep learning and traditional models. The strengths of neural networks in time-series analysis are further supported by Liu et al. (2022), who applied a bidirectional LSTM model with an attention mechanism, achieving superior results in capturing intricate temporal dependencies compared to SVM and KNN in stock price prediction.

Interestingly, our evaluation of Support Vector Machines (SVM) showed its robustness in handling complex financial patterns, particularly in classification and regression tasks. This aligns with existing literature, such as that by Patel et al. (2023), which demonstrates SVM’s effectiveness in high-dimensional, noisy financial data environments. However, K-Nearest Neighbors (KNN) exhibited limited performance, likely due to its sensitivity to the local structure of data points, which may not generalize effectively within the BIST100’s diverse economic landscape. Reddy et al. (2024) similarly found that while LSTMs excel in stock market prediction, SVM offers competitive performance in classification tasks involving high-dimensional data, underscoring the versatility of SVM in complex financial contexts. These findings suggest that future studies could explore hybrid models that combine the interpretability and stability of tree-based methods with the temporal insights provided by neural networks, leveraging the strengths of each model type. Such hybrid approaches might bridge the gap between deep learning’s advanced pattern recognition and traditional models’ robustness, offering improved predictive capabilities in financial forecasting. From a practical standpoint, our findings offer actionable insights for investors, financial analysts, leaders in financial institutions, market authorities, and supervisory professionals operating in emerging market contexts. The consistent superior performance of ensemble models such as XGBoost and Random Forest indicates that practitioners seeking robust, scalable stock prediction solutions should prioritize these algorithms in their forecasting workflows (Henrique et al., 2023; Du et al., 2022). Their adaptability to non-linearities, ability to handle large and complex datasets, and resistance to overfitting make them especially well-suited to volatile markets like Turkey. For institutional investors and portfolio managers, integrating advanced ensemble models into their decision-making processes can enhance return forecasting, risk assessment, and portfolio optimization. For example, using XGBoost-based predictive signals alongside traditional financial analysis can improve the identification of under- or overvalued assets and help to time market entries more effectively (Alshehri, 2023; Moon and Kim, 2019). Financial analysts can benefit by combining model-driven insights with fundamental or technical analysis. Ensemble models’ feature importance scores can highlight which macroeconomic indicators or company ratios most influence stock prices, enabling more targeted research and investment recommendations (Lundberg et al., 2019; Wang et al., 2022). Leaders in financial institutions, market authorities, and supervisory professionals are encouraged to leverage the predictive capabilities of these models for developing AI-driven, real-time decision-support tools in market supervision, systemic risk assessment, and compliance monitoring. As machine learning models can rapidly adapt to shifting market patterns, their integration into industry technology solutions may help in identifying early warning signals of instability or market manipulation, thereby supporting financial system resilience (Du et al., 2022; Henrique et al., 2023).

8

CONCLUSION

This study provides a comprehensive assessment of supervised machine learning models for forecasting the financial performance of companies listed on Turkey’s BIST100 index. By conducting a rigorous comparative analysis of tree-based models (Random Forest, XGBoost, LightGBM), neural network models (LSTM, CNN), and instance-based models (SVM, KNN), we have identified ensemble approaches – particularly XGBoost and Random Forest – as consistently delivering superior predictive accuracy and error reduction across multiple performance metrics. Our results indicate that ensemble models are especially adept at capturing the complex, non-linear dynamics characteristic of emerging financial markets, while also maintaining robustness to noise and high-dimensionality. This highlights their practical value for investors, financial analysts, and decisionmakers aiming to deploy data-driven strategies in rapidly changing environments. Although neural network models like LSTM and CNN demonstrated strengths in sequential data analysis, their higher error rates point to the need for further refinement – potentially through hybrid architectures or enhanced feature engineering. SVM also exhibited strong predictive capabilities, especially in classification tasks, while KNN’s performance was less favorable, likely due to its sensitivity to the local structure of financial data. In terms of actionable recommendations, our findings suggest that financial institutions, analysts, and market authorities in emerging economies should prioritize ensemble models such as XGBoost and Random Forest for reliable, interpretable, and scalable financial forecasting. Moreover, the integration of explainable AI tools can further enhance the transparency and acceptance of these models in professional practice. For future research, we recommend several promising directions: Exploring hybrid models that combine the interpretability and robustness of ensemble methods with the advanced pattern recognition capabilities of deep learning. Incorporating alternative data sources, such as sentiment analysis from social media, macroeconomic indicators, or ESG metrics, to enrich model inputs and capture a broader spectrum of market signals. Applying advanced feature selection and engineering techniques to further improve model accuracy and generalizability. Benchmarking model performance across different emerging markets to evaluate the transferability and contextual robustness of leading algorithms.

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 2 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Wirtschaftswissenschaften, Volkswirtschaft, Wirtschaftstheorie, -systeme und -strukturen

Zeitschrift RSS Feed

A Supervised Machine Learning in Financial Forecasting: Identifying Effective Models for the BIST100 Index

Cansu Ergenç

Rafet Aktaş

Online veröffentlicht: 08. Sept. 2025

Seitenbereich: 66 - 90

DOI: https://doi.org/10.2478/revecp-2025-0005

Schlüsselwörtertree-based models, neural network-based models, instance-based learning models

© 2025 Cansu Ergenç et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Schlüsselwörter
tree-based models, neural network-based models, instance-based learning models