Machine learning based churn analysis for sellers on the e-commerce marketplace

The goal of this study is to develop churn models for sellers on the e-commerce marketplace by using machine learning methods. In order to develop these models, three approaches are applied. The dataset used in this study includes ten features, which are maturity type, maturity interval, city of the seller, total revenue of the seller, total transaction of the seller, sector type of the seller, business type of the seller, sales channel, installment option and discount type. Random Forest (RF) and Logistic Regression (LR) are used for churn analysis in all of the approaches. In the ﬁrst approach, models are developed without applying preprocessing operations on the dataset. In the second and third approaches, under sampling and oversampling methods are used respectively to balance the data set. By using stratiﬁed cross validation on the dataset, F-Scores of the churn models are obtained. These results show that F-Scores were 0.76, 0.71 and 0.92 for the three approaches developed with RF, and 0.84, 0.68 and 0.69 for the three approaches developed with LR, respectively.


Introduction
Interest in e-commerce, which is one of the fastest-growing sectors in the world, is increasing day by day.The main reason why users choose e-commerce is that they can shop effortlessly, around the clock, from any location, simply by using an internet connection and with a wide range of options.The term E-commerce marketplace refers to online platforms set up by third parties on the Internet to bring sellers and buyers together.
Sellers can thus open stores and upload their products, while consumers rate and buy the corresponding products.
E-commerce companies are constantly trying to attract new customers and increase revenue to ensure their continued existence.However, research has shown that acquiring a new customer is more expensive than retaining an existing one.For this reason, companies spend most of their efforts on customer retention.However, the efforts made with the wrong strategy lead to negative consequences for companies.Assessing the risk of churn is an important task in customer retention.
In the last few years, numerous methods have been used for churn analysis.[1] presented a comprehensive evaluation of the capacity of Recurrent Neural Networks (RNN) and transducers for Customer Attrition Prediction (CCP) using time-varying behavioral characteristics such as novelty, frequency, and monetary value.Hybrid approaches combining DNN outputs from traditional CCP models were also evaluated in this context.The obtained results demonstrated that DNNs, particularly RNNs, were appropriate for CCP using time-varying recency, frequency, and monetary measurements.[2] proposed a Naive Bayes (NB) classifier with a Genetic Algorithm (GA) based feature weighting approach for loss estimation.They also compared the proposed approach to the classifier in public datasets, with mean precision of 0.97, 0.97, 0.98, and sensitivity rates of 0.84, 0.94, 0.97.The proposed method yielded F scores of 0.89, 0.96, 0.97, Matthews Correlation Coefficients of 0.89, 0.96, 0.97, and accuracies of 0.95, 0.97, 0.98.[3] presented a hybrid recommendation strategy based on SVM with targeted retention attempts for an e-commerce loss forecast.When the integrated forecasting model was used, the coverage rate, hit rate, removal rate, precision rate, and other metrics increased significantly.[4] described how the organization could create a customer churn model based on mathematical and statistical approaches using customer acquisition, usage, interaction, and call center data, with the goal of minimizing customer churn.This research has aided many organizations and data scientists in developing a customer churn model.[5] proposed a theoretical foundation for customer churn and customer segmentation, as well as the use of supervised machine learning techniques to predict churn.Customers were segmented using K-Means, K-Nearest Neighbors (KNN), Logistic Regression (LR), Decision Tree (DT), and RF.They employed the SVM to forecast customer churn.As a result, the data set achieved an Accuracy of approximately 97% with the RF model, and the average Accuracy of each model performed well after customer segmentation, with LR having the lowest Accuracy (87.27%).[6]performed data preprocessing and feature analysis in the first two stages of churn analysis.In the prediction process, most popular predictive models namely Support Vector Machine (SVM), Decision Trees (DT), Naive Bayes, LR, RF as well as boosting and ensemble techniques have been applied.According to the results, AdaBoost and XGBoost classifier were found to give the highest accuracy.In order to classify the data more accurately, [7] suggested a brand-new hybrid method called Logit Leaf Model (LLM).LLM consists of estimation and segmentation stages.It was seen that different models created by LLM on data segments instead of the whole dataset provided better prediction performance.LLM provided more accurate models than using the independent classification techniques LR and DT.To help telecom providers estimate which customers they are most likely to lose, [8] generated a model.The Area Under the Curve measure was adopted to measure the performance of the model.Additionally, it was intended to incorporate social network analysis elements into the prediction model in order to incorporate customer social networks.The best results in the model were obtained when the eXtreme Gradient Boosting (XGBoost) algorithm was applied.[9] suggested a hypothesis to predict customer churn using deep learning-based neural networks.In order to explain the interactions, this article considered mathematical models and modeling methods were also suggested.For chain training, Python was used by processing user activity datasets.[10] examined six different methods for estimating customer churn for the banking industry.The best results were obtained with Stochastic Boosting.[11] calculated consumer turnover using deep learning on a telecom dataset.It was decided to build a nonlinear classification model using a multilayer neural network.There was research done on the churn model, customer features, support features, use features, and contextual features.[12] aimed to provide a framework that attempts to perform churn analysis.Articial Neural Network (ANN) and DT were used as methods.According to the results obtained, five variables including number of products, acceptance of returned products, discount, delivery time and reward were selected as the best variables.ANN had the highest accuracy with 97.92% accuracy, while DT had the least.
This paper is organized as follows: Section 2 presents material and method.Results and discussion are presented in Section 3. Section 4 concludes the paper.
2 Material and method

Random forest
The output of various decision trees is combined with the widely used machine learning algorithm RF to produce a single outcome.Its widespread use is motivated by its adaptability and usability because it can solve classification and regression issues.The working logic is to select the number of trees to be created and each of the trees created to predict the y value for each data point [13].The hyperparameters used for the RF method are listed in Table 1.

Logistic regression
When the number of dependent variables is two, LR is the proper regression strategy to use.The LR is predictive analysis, as all regression analyses are.Using LR, data can be described and the relationship between a dependent binary variable and one or more independent nominal, ordinal, interval, or ratio-level variables can be explained [14].The hyperparameters used for the LR method are listed in Table 2. To increase the performance of the model, the best values of the hyperparameters of the RF and LR have been found by grid search.
Class imbalance, also known as the unbalanced data problem, is a problem that must be considered in datadriven studies.Most classification algorithms assume that the training data sets are balanced.However, this assumed balanced distribution is often not found in real datasets.One of the classes may be represented by very few instances, while the other class is represented by a large number of instances.In this case, classification problems may occur.For samples with little labeling information, the model is likely to make incorrect predictions for this group because the model is not sufficiently trained.The first method that can be used when working with an imbalanced data set is to adjust the class distributions by resampling the data.These methods are undersampling and oversampling.

Undersampling
Undersampling is a technique to balance the data set by removing samples that belong to the majority class.It is one of many methods data scientists can employ to draw out more accurate information from datasets that were initially unbalanced.Although it has drawbacks like the loss of potentially crucial information, data scientists nonetheless frequently use and value this expertise [15].

Oversampling
Oversampling balances the data set by duplicating samples belonging to the minority class.Although it has drawbacks, such as the fabrication of synthetic data, it has become a frequent and crucial skill for data scientists.[16].

Results and discussion
Churn models have been developed for sellers in the e-commerce market using machine learning methods.The features in the data set used in the study are maturity type, maturity range, seller's city, seller's total income, seller's total transaction, seller's sector type, seller's business type, sales channel, installment option and discount type.Three different approaches were applied to develop the models.RF and LR were used for churn analysis in all of the approaches.In the first approach, models were developed without applying preprocessing operations on the dataset.In the second and third approaches, undersampling and oversampling methods, respectively, were used to balance the data set.The reason for using three different approaches is to analyze the impact of the preprocessing of the data set on the performance of the models.Cross-validation is a technique used to predict how a model will perform with real data obtained with training data.In this technique, the model is trained with training data while the performance of the model is evaluated using the remaining data (validation data).By using stratified cross-validation on the dataset, F-Scores of the churn models were obtained.Table 3 shows F-Scores of the models developed with three different approaches using RF and LR.− When all prediction models developed with RF are compared, it has been determined that the highest F-Score value was obtained with the third approach using the oversampling technique.
−When the F-Scores obtained by the RF method were examined, it was seen that the undersampling technique did not make much difference compared to the models created on the unbalanced data set.
−When all prediction models developed with LR are compared, it has been determined that the highest F-Score value was obtained with the first approach which was created without preprocessing.
−When the results obtained with the LR method were examined, it was seen that the oversampling/undersampling techniques significantly reduced the F-Scores of the models compared to the model created on the unbalanced data set.
−When the results of both methods are examined, it was seen that the models developed using the undersampling technique had the lowest F-Score.
−It was determined that the mean F-Scores obtained with the RF method (0.80) was 8.11% higher than that obtained with the LR method (0.74).

Conclusion
In this study, churn models are developed for sellers in the e-commerce market by using machine learning based RF and LR methods.Three approaches were used to develop the models.In the first approach, models are developed without preprocessing the dataset.In the second and third approaches, undersampling and oversampling methods, respectively, were used to balance the data set.F-Score was used to assess how effectively the churn models performed.The results showed that the model developed using the RF oversampling technique was the most successful one.It can be concluded that the developed churn model is effective and can be used in practice.

Table 1
Hyperparameter values of RF.

Table 2
Hyperparameter values of LR.

Table 3 F
-Score values obtained with three different approaches using RF and LR.