1. bookAHEAD OF PRINT
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2444-8656
Pierwsze wydanie
01 Jan 2016
Częstotliwość wydawania
2 razy w roku
Języki
Angielski
access type Otwarty dostęp

Data mining of Chain convenience stores location

Data publikacji: 31 May 2022
Tom & Zeszyt: AHEAD OF PRINT
Zakres stron: -
Otrzymano: 18 Jan 2022
Przyjęty: 27 Mar 2022
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2444-8656
Pierwsze wydanie
01 Jan 2016
Częstotliwość wydawania
2 razy w roku
Języki
Angielski
Abstract

Based on the knowledge of convenience stores, this paper comprehensively probes the influencing factors of their location, and constructs and analyses their location mode. The initial evaluation system was put forward, which consists of 5 primary indexes and 14 secondary indexes. The primary index is divided into population, traffic, competition, self-factors and others. After determining three different location evaluation systems, the location model of chain convenience stores is constructed and verified based on the data of Nanjing. The verification of the model shows that the accuracy of the location model of a convenience store is 90.3%, which indicates that the location model proposed in this paper has high accuracy and practicability.

Keywords

Introduction

A convenience store is a small retail store located near a residential area, which mainly deals in instant goods, takes the emergency and convenience needs of customers as their first purpose and adopts self-selected shopping methods [1]. It originated in the United States, and is a retail format differentiated from supermarkets after they developed to a relatively mature stage. Typical convenience stores generally have a small area for business, with an average of only about 100 m2, and operating at the entrance and exits of urban new villages and the main roads of residential areas; the commodities are mainly food, daily necessities, cosmetics and hygiene products, and so on.

With the rapid development of China's economy, consumer purchasing power and shopping demands have also greatly increased [2]. Convenience stores have made great progress in China in order to meet the small-and medium-sized shopping needs of consumers. In 2020, sales in convenience stores in China reached 296.1 billion yuan, and the number of stores was 193,000. Inspite of the pandemic, it still reached a growth rate of about 6%. In recent years, chain convenience stores have gradually become established in the lower-middle cities, wherein they quickly occupy the market by opening new stores to obtain greater sales. At the start of the expansion of enterprises, location plays a vital role. Whether the location is reasonably good is directly related to the efficiency and strategic layout of the enterprise. It also has a direct impact on the cost control and service level of enterprises [3]. The location of the retail industry is characterised by one-off, large investment that is unchangeable; a number of stores have been known to suffer from bankruptcy quickly due to unfavourable location, which is a cost loss that enterprises hate. Therefore, it is of great practical significance to study the location of convenience stores. Based on data mining technology, this paper tries to establish a quantitative comprehensive evaluation system of location by analysing and summarising the influencing factors that can provide a reference for quantitative evaluation and scientific decision-making for the location of a chain convenience store.

Concepts of data mining technology
Concept of data mining

Data mining is a step in the Knowledge-Discovery in Databases (KDD) [4]. It generally refers to the process of searching for hidden information from a large amount of data through algorithms. Data mining is usually related to computer science, statistics, online analytical processing, information retrieval, machine learning, expert system (relying on past empirical rules), pattern recognition and many other methods used to achieve the objectives. It is used in a wide range of fields, such as in medical treatment, finance, communication, logistics and other industries, and valuable and oriented information from massive data can be analysed from data mining [5, 6].

Methods of data mining

There are five main methods of data mining: cluster analysis, classification analysis, sequence analysis, correlation analysis and outlier analysis. The following is their specific meaning.

Cluster analysis [7,8,9]: Clustering is to gather similar data into a cluster; the data in each cluster has as much similarity as possible, while the similarity between each cluster is extremely low. Through clustering, a large amount of data can be gathered quickly, and data of corresponding categories can be provided according to the needs of different users, which is a highly efficient method of data analysis. Among others, the K-means clustering algorithm is one of the main representative methods.

Classification analysis [10,11,12]: Classification corresponds to clustering, and the core of classification lies in distinguishing data, which mainly means predicting unknown objects by analysing known data with functions or models, and classifying new data sets according to the analysis results. The representative methods are Bayesian network and decision tree.

Sequence analysis [13,14,15]: Sequence mainly refers to time series, which is based on the time characteristics of data. The effective information hidden in the data can be obtained by analysing the sequence of data occurrence and the time characteristics of data, combined with the development process of things. Then it establishes the data model combining with the time characteristics to calculate the changing rule and development direction of things. This method of analysis is used for the prediction of consumers’ purchasing tendency, weather forecast and so on.

Association analysis [16, 17]: Association mainly refers to analysing the association relationship between different individuals in the same data cluster, finding the association rules, and analysing the changing rules of things through this rule. This method of analysis is quite common; that is, we can find the potential relevance among them by analysing the association rules among some behaviours of a large number of users, which includes the use of diapers and beer.

Outlier analysis [18]: When analysing the characteristics of data, there will always be obvious differences between some data and other data. When it becomes difficult to find the similarity or correlation among them, it can be called as outliers, which are difficult to deal with by the methods above. This requires the use of data cleaning in data mining to analyse isolated points separately, and then find the possible regularity from them.

The research in this paper involves the use of algorithms related to data mining, such as minimum circle covering algorithm, Kruskal-Wallis rank sum test, K-means clustering algorithm, Gaussian mixture clustering algorithm, feature selection algorithm, logistic regression, neural network and support vector machine and so on.

Locomotive collector and R language is used, which is a software used to capture, process, analyse and mine data. It can flexibly capture the scattered data information in web pages and accurately mine the required data through a series of analyses and processing. R language is applied to operate the environment, statistical analysis and drawing. In the process of using R language, R software packages provided by other scholars can be used, which are actually plug-ins of R. Different R software packages can meet different needs, such as data analysis, data mining and artificial intelligence.

The process of data mining

Data mining is a complicated process which needs a series of calculations in a large amount of data and it is necessary to find the potential and hidden internal relations in the data and extract valuable information. The specific process is as follows:

Selection and preparation of data set. According to the actual demand, select the initial data set and collect as much relevant data as possible. The more relevant the data, the higher the accuracy, although the amount of calculation will also increase [19].

Data preprocessing: The collected data may contain some isolated and discrete points, so it is necessary to preprocess the data to make it useful and to improve the data quality. Data preprocessing include data cleaning, transformation, integration and specification [20].

Data mining: According to the data characteristics, after data preprocessing and the actual needs of users, appropriate methods and models for data mining is selected, and the intrinsic relationship between data and extract hidden valuable information will be found.

Result analysis and application: Analyse the obtained results in connection with the actual situation, and there may be situations that are not applicable to the actual situation, so the above steps need to be repeated.

Fig. 1

Process of data mining

Feature selection of location evaluation system
Initial evaluation system of location

There are four principles to be met when determining the model indicators, namely the comprehensive, measurable, independence and pertinence principles. This paper puts forward five factors that affect the location of chain convenience stores, respectively, population factors, traffic factors, competition factors, self-factors and others, which can comprehensively evaluate the location model of chain convenience store. After discussing and subdividing each factor, we can measure each factor by using the characteristics of each index. Besides, these five factors are relatively independent and have little correlation to each other, while they can clearly reflect the contents that affect the location model. Therefore, this paper divides the primary index of the location of the chain convenience store into population, traffic, competition, self-factors and others.

The population factor should be considered for whether the coverage of geographical location and the population scale within this range can attract consumers. The population scale reflects the population flow, which means the larger the population flow, the more consumers there are, and the higher the turnover for shops may be. In addition, the number of floating population can reflect the market value of this region. Usually, an area with a large number of population movements is also the area where large shopping centres are located.

Traffic factors reflect the convenience of a region's contact with other regions. According to the different needs, consumers are willing to spend different time costs within a certain range. For the department store industry, most consumers are reluctant to reach to places with inconvenient transportation. Therefore, imperfect traffic facilities and other factors inhibit the consumer behaviour of users to a certain extent, which will reduce the profit.

Competition factors are reflected in the number of competing shops in the location. Competition refers to shops that provide similar services or sell similar goods. However, such competition between shops may not always bring negative effects. A usual phenomenon that has been noticed is that if the McDonald's chain exists in a certain area, there is a great possibility that KFC also exists. This kind of competition denies that stores will share resources equally in the competition, which causes vicious competition. The reason is that these two restaurants are mainly concentrated in areas with large and fast flow of people, whose layout can solve the problem of insufficient utilisation of customers and meet the different needs of different consumers to the greatest extent. Although seemingly contradictory, it is actually mutual influence, mutual transformation and mutual promotion.

Self-factors are reflected in the nature and strength of the enterprise. Self-factors have a direct impact on the store location. When the external influencing factors of the same site are the same, it is the enterprise's own factors that play a direct role. At the same location, companies with strong strength and excellent reputation may be profitable here. However, companies with weak strength and a common reputation may suffer a deficit in this location.

Other factors that are reflected in the market economy are cultural atmosphere and public security environment. The atmosphere of business environment is related to the number of large shopping centres and supermarkets in the region, while the types of shopping centres determine the types of consumers. The business environment is also reflected in the vitality and influence of the business circle, and the operating cost of shops in areas with a better business environment is also higher. As the rent cost of shops in remote areas is relatively small, it also affects the overall operating cost of shops.

The primary indicators are subdivided into several secondary indicators, and the initial evaluation index system is summarised as shown in Table 1. The initial evaluation system consists of 5 primary indicators and 14 secondary indicators. Population factors are subdivided into population flow and per capita convenience store consumption level. Traffic factors are subdivided into the number of parking lots, subway stations and bus stations. Competition factors are subdivided into saturation of the convenience store market, proportion of convenience stores in Category 1, proportion of convenience stores in Category 2 and the proportion of convenience stores in Category 3 (the competitiveness of convenience stores in Category 1 is very small, in Category 2 it is strong and in Category 3 it is the strongest). Self-factors are subdivided into commodity categories, environment and services. Other factors are subdivided into store rental and the number of large shopping centres.

Initial evaluation system

First level indicator Secondary indicators

Demographic factors Population flow
Convenience store consumption level per capita
Traffic factor Number of parking lots
Number of subway stations
Number of bus stops
Competitive factors Convenience store market saturation
Proportion of Type 1 Convenience Stores
Proportion of Type 2 Convenience Stores
Proportion of Type 3 Convenience Stores
Own factor product category
surroundings
Serve
other factors Shop rental price
Number of shopping malls
Feature selection based on caret

First, this paper selects the features of the initial evaluation system based on caret package of R language [21]. Caret package is based on recursive feature elimination algorithm to select features, which is a method of reverse selection. This algorithm is suitable for all features of the model. Each predictor is sorted based on its importance. Let S be a set of sequences that retains the candidate values of the predictor (S1>S2, …); in each feature selection process, the prediction factor with the highest Si ranking is retained. The model will be readjusted and its performance will be evaluated [22]. In order to determine the best Si value, the top ranked Si predictor is used to fit the final model.

Since feature selection is a part of the model building process, resampling methods (such as cross-validation, self-sampling, etc.) should consider the changes caused by feature selection when calculating performance. In order to obtain the performance estimation under different feature selection, a more complete feature selection algorithm is formed in the outer layer of resampling [23]. The feature selection algorithm is based on backward selection. First, it calculates the performance of the model containing all features and the importance degree between features; then it eliminates the features with low importance, keeps the most important feature subsets, subsequently calculates the performance of the model again, and repeats the above steps until the number of the best feature subsets is found. The core idea of this method is to predict with random forest, and the selected features make the average prediction accuracy of cross-validation as high as possible. The disadvantage of this algorithm is the possibility of over-fitting, which can be solved by adding cyclic sample division. Therefore, this paper uses rfe () function in caret package to select features.

Rfe () function is used to select the features in the initial evaluation i system of location, and the result is shown in Table 2. The best point in the graph is the solid point, and the corresponding number of features is 10 where the accuracy reaches the maximum. The characteristics are convenience store consumption level per capita, service, environment, number of parking lots, market saturation of convenience store, proportion of Category 1 and 3, number of subway stations and store rental. According to their importance, the top five features are taste, service, environment, number of parking lots and the proportion of convenience stores in Category 3.

Fig. 2

Result of feature selection in caret package

Evaluation system of feature selection of CARET

First level indicator Secondary indicators

Demographic factors Convenience store consumption level per capita

Traffic factor Number of parking lots
Number of subway stations
Number of bus stops
Competitive factors Convenience store market saturation
Proportion of Type 1 Convenience Stores
Proportion of Type 2 Convenience Stores
Proportion of Type 3 Convenience Stores
Own factor surroundings
Serve
other factors Shop rental price
Feature selection based on boruta

The idea of feature selection of boruta package is different from that of the traditional feature selection. First, it increases the randomness of data based on shadow features [24, 25]. Then it uses the method of random forest to expand the data set and evaluate the importance of each feature. During each cycle, the algorithm of feature selection in boruta will check the importance between features and shadow features (i.e., compare the feature score with the largest shadow feature score) and then eliminate unimportant features. Finally, all features will get the result of ‘pass’ or ‘reject’. When the feature selection algorithm reaches the limit, it will stop the loop and output the result.

All features related to the result variables can be obtained by using the feature selection algorithm of boruta, because it follows almost all methods of feature selection [26]. The advantage of boruta is that it will not produce the minimum error, while most algorithms of traditional feature selection will produce the minimum error when selecting classification. Moreover, the best feature subset obtained from the feature selection of boruta can be obtained under random forest for experiments, which eliminates the features with less importance in the cycle process, so that it can minimise the error of random forest model. This method can find all features, regardless of whether the correlation between features and decision variables is strong, which makes the algorithm in boruta package suitable for data analysis and data mining.

In this paper, the boruta () function in the Boruta of R language is used to select features. The result is shown in Figure 3. The blue box in the figure corresponds to the minimum score, average score and maximum z score of shadow features, respectively. The red box represents the rejection feature, the yellow box represents temporary features and the green box represents the confirmation feature. The characteristics of the green box are the number of parking lots, services, environment, the number of subway stations, the types of commodities and the proportion of the second category. The yellow box-shaped features are the proportion of convenience stores in Category 1, the market saturation of convenience stores, the proportion of convenience stores in Category 3, the number of large shopping centres and the consumption level of convenience stores per capita. The red box-shaped features are the number of bus stops, store rental price and population flow.

Fig. 3

Result of feature selection in Boruta package

The features corresponding to the yellow box diagram are checked, and the Tentative Rough Fix () is used. The results are shown in Table 3. These five characteristics are the proportion of convenience stores in Category 1, the market saturation of convenience stores, the proportion of convenience stores in Category 3, the number of shopping malls and the consumption level of convenience stores per capita.

Results of provisional features

Feature Mean Imp Median Imp Norm Hits Result

Convenience store consumption level per capita 2.34 2.57 0.52 accept
Number of shopping malls 2.38 2.61 0.55 accept
Proportion of Type 3 Convenience Stores 2.79 2.94 0.56 accept
Convenience store market saturation 3.11 3.88 0.67 accept
Proportion of Type 1 Convenience Stores 3.24 3.81 0.61 accept

The result of feature selection based on boruta package is that 11 features are retained, which are service, environment, commodity types, number of parking lots, number of subway stations, proportion of Category 1, Category 2, Category 3, and convenience store market saturation, per capita consumption level and number of large shopping centres.

The evaluation system of the location of chain convenience stores is reconstructed through the 11 characteristic indicators, and the results are shown in Table 4. Therefore, the evaluation system based on the feature selection of boruta consists of 5 primary indexes and 11 second indexes. Among them, the five primary indicators are population, traffic, competition, self-factor and other factors.

Evaluation system of the feature selection in Boruta package

First-level indicator Secondary indicators

Demographic factors Convenience store consumption level per capita
Traffic factor Number of parking lots
Number of subway stations
Competitive factors Convenience store market saturation
Proportion of Type 1 Convenience Stores
Proportion of Type 2 Convenience Stores
Proportion of Type 3 Convenience Stores
Own factor product category
surroundings
Serve
Other factors Number of shopping malls
Construction of the location model
Selection of hidden layer number, hidden layer node number and activation function of neural network

The location model in this paper is based on logistic regression algorithm, neural network algorithm and support vector machine algorithm. Next, the selection of hidden layer number, hidden layer node number and activation function of neural network is explained.

Determination of the number of hidden layers and hidden layer nodes of neural network

At present, no scholars have put forward a good method to determine the number of hidden layers of neural networks. Most scholars judge the number of hidden layers and the number of nodes in hidden layers of neural networks through trial and experience [27]. First, the number of hidden layers of neural network is determined. It is found that a hidden layer of neural network can solve practical problems. Theoretically, a feedforward neural network with two hidden layers can represent any nonlinear decision boundary. Based on the literature and experience, this paper selects the number of hidden layers of neural network as 1 and 2.

After selecting the number of hidden layers, it is necessary to determine the number of hidden layer nodes. The number of hidden layer nodes is usually, NM \sqrt {NM} where N represents the number of nodes in the input layer and M represents the number of nodes in the output layer; the number of nodes in the hidden layer is set to 2 and 3 by calculating and considering the actual research situation.

Figures 4 and 5 are partial structures of neural networks. Figure 4 is a neural network structure with 1 hidden layer and 3 hidden layer nodes. Figure 5 is a structure with 2 hidden layers and 3 and 2 hidden layer nodes. It is helpful to understand the learning and training process of the later model by mastering the structure of a neural network.

Fig. 4

Neural network structure with 1 hidden layer

Fig. 5

Shows a neural network structure with 2 hidden layers

Determination of activation function

Different activation functions have different advantages and disadvantages [28]. For example, the disadvantage of Sigmoid activation function is that it is easy to supersaturate, which will cause the disappearance of gradient; that is, the weight update near the output layer is relatively normal, but the weight update near the input layer is very slow, which is equivalent to that only the hidden layer networks in the latter layers are learning. And owing to the form of the functional expression of Sigmoid, the output value of this function is always greater than 0, which will slow down the convergence speed of the model training. If the sample data is too large, the calculation cost in the learning process will be very high.

Contraposing the shortcomings of the Sigmoid activation function, the tanh activation function appears which is centred on the origin, and there are both positive and negative output values. It solves the problem that the output value in Sigmoid activation function is always greater than 0, and improves the convergence speed of model training. However, the tanh activation function still has the problems of gradient disappearance and large computational cost.

The emergence of ReLU activation function solves the problem of gradient disappearing in the positive half-axis, and the calculation speed is obviously improved, but the training process is inactive in the negative half-axis, and it is not centred on the original position.

In this paper, Sigmoid activation function and tanh activation function are selected as the activation functions of neural network.

Selection of kernel function of support vector machine

At present, no scholars have put forward a good solution for the kernel function selection, which has always been an unsolved problem. Lanckriet (2004), a foreign scholar, proposed multi-core learning for the selection of kernel functions. Multi-core learning refers to learning multiple kernel functions to obtain the optimal convex combination of multiple kernel functions [29]. Support vector machines have many kernel functions, such as polynomial kernel function, radial basis kernel function and so on. In addition, new kernel functions can be generated by combining multiple kernel functions. In the process of selecting kernel functions, it is necessary to know the advantages and disadvantages of each kernel function and select it according to the characteristics of training data. The kernel functions selected in this paper are polynomial kernel function and radial basis kernel function.

The properties of polynomial kernel function

Polynomial kernel function can map the original input space to the high-dimensional feature space [30]. This function is suitable for orthogonal normalised data, and distant data points also have a certain influence on it, so the polynomial kernel function belongs to the global kernel function. The larger the parameter d, the higher the corresponding mapping dimension and the larger the corresponding calculation. The mathematical formula of polynomial kernel function is shown in Figure 1.

k(xi,xj)=(xiTxj)d,d1 k\left( {{x_i},{x_j}} \right) = {\left( {x_i^T{x_j}} \right)^d},\quad d \ge 1
Properties of radial basis kernel function

Radial basis kernel function is a kind of kernel function with strong locality, which can map a sample to a higher dimensional feature space. The mathematical expression is shown in Formula 2 as follows: k(xi,xj)=exp(xixj22δ2),δ>0 k\left( {{x_i},{x_j}} \right) = \exp \left( { - {{{x_i} - {x_j}^2} \over {2{\delta ^2}}}} \right),\quad \delta > 0

Data preprocessing

Before the experiment, important data operations and data preprocessing are needed. In this paper, the operation of data preprocessing is mainly divided into two parts: the first part is the division of data sets, and the second part is the standardisation processing of data. There are many methods to divide data sets, such as setting aside method, cross-validation method, self-help method and so on.

The method of setting aside is adopted to divide the data, where about 2/3 to 4/5 samples are used for training and the remaining samples are used for testing. In this paper, 70% of the samples are used for training and the remaining 30% for testing. After the data set is divided, it needs to be standardised uniformly for the difference in the units between data. It is necessary to remove the limitation of units and process the data into dimensionless pure values, and standardisation is to scale the data in proportion to make the numerical value within a specific range. There are many ways to standardise data, the most common of which are min–max standardisation and z-score standardisation; this paper uses the scale () function in R language.

Accuracy of convenience store location model
Introduction of evaluation system
Accuracy and recall rate

Accuracy and recall rate are two metrics that are widely used in the fields of information retrieval and statistical classification, and are indicators used to evaluate the experimental results. Accuracy = TP/(TP + FP) and recall = TP/(TP + FN). TP represents the number of samples that are actually positive examples and the predicted results are also positive examples, FP represents the number of samples that are actually negative examples but the predicted results are positive examples, FN indicates the number of samples that are actually positive examples but the predicted results are negative examples, and TN indicates the number of samples that are actually negative examples and the predicted results are also negative examples. TP + FP + TN + FN = total number of samples. Accuracy is the proportion of samples that are actually positive examples in the samples that are predicted as positive examples, and recall rate is how many positive examples in the samples are predicted correctly by the model. The values of these two indicators are between 0 and 1, and the larger the better. F1 value is a comprehensive index, and F1 value = (2* accuracy * recall rate)/(accuracy + recall rate).

Owing to the study of the location model of chain convenience stores, the accuracy rate and recall rate should be such that the accuracy rate will be higher in the results given by the model.

ROC curve and AUC value

ROC is nothing but ‘the receiver operating character’, which was originally used to detect the radar signals of enemy planes, and then introduced into the field of machine learning. In the ROC curve, the vertical axis is TPR (true case rate) and the horizontal axis is FPR (false positive case rate), where TPR = TP/(TP + FN) and FPR = FP/(TN + FP). By selecting different cut-off points, the corresponding values of TPR and FPR can be obtained, which leads to the points on the coordinate axis. The ROC curve can be obtained by connecting the results of different truncation points corresponding to the points on the coordinate axis. When comparing and analysing the performance of two models, if the ROC curve of one model can completely cover the ROC curve of the other model, then the performance of the former is definitely better than that of the latter. If the ROC curves of the two models cross, it needs to be further judged by the value of AUC. The value of AUC is the result of summing the areas of each part under the ROC curve. The larger the value of AUC, better the effect of the classifier.

Modelling results of logistic regression

The initial evaluation system consists of 14 features: caret retains 10 features after feature selection of the initial evaluation system, and boruta retains 11 features after feature selection of the system. Different evaluation systems correspond to different data sets. Therefore, different logistic regression models and optimal threshold points are obtained by using the above methods. The accuracy rate, recall rate, F1 value and AUC value of the model can be obtained by determining the corresponding threshold.

The accuracy and recall rates in the modelling results of three site selection evaluation index systems are summarised, and Figure 6 is obtained. As can be seen from Figure 6, among the modelling results based on logistic regression algorithm, the performance of the model based on the data set corresponding to the 11 features selected by boruta package features is the best, with an accuracy rate of 90.7%, recall rate of 72.1% and the value of F1 at 0.8. The modelling results based on the initial evaluation system show that the accuracy rate of the model is 80.4%, recall rate is 68.5%, and the value of F1 is 0.73. The performance of the model based on the data set corresponding to the 10 features selected by caret package is ranked last among the three models, with an accuracy rate of 75.8%, recall rate of 65.3% and F1 value of 0.7.

Fig. 6

Modelling results of logistic regression based on three evaluation systems

Therefore, the chain convenience store location model constructed by the feature selection algorithm of boruta package combined with logistic regression algorithm has the best performance, with an accuracy rate of 90.7%, and AUC value of 0.815. It shows that boruta package can improve the performance of the model, while caret package does not.

Modelling results of neural network

The initial evaluation system consists of 14 features: caret package retains 10 features after feature selection of the i system, and boruta retains 11 features after feature selection of the initial evaluation system. Different evaluation systems correspond to different data sets. Neural network models with different structures are obtained using the modelling method above and the accuracy, recall and F1 value of the model can be obtained using the training set data to train the offspring of the model into the test set data.

Because accuracy is given a higher weight in the model performance evaluation of the chain convenience store location model, only accuracy is selected as the index to draw the modelling results of three location evaluation systems. The results are shown in Figure 7 where the accuracy results of the model are based on four neural networks with different structures before and after feature selection. The first set of histograms is based on the modelling results of the first neural network, the second set of histograms is based on the modelling results of the second neural network, the third set of histograms is based on the modelling results of the third neural network and the fourth set of histograms is based on the modelling results of the fourth neural network. It can be seen from the results that after feature selection using caret package and boruta package, the accuracy of all models has improved, and the accuracy of the models based on the initial evaluation system always ranks last. Among the two feature selection methods, the boruta package after feature selection has the best effect.

Fig. 7

Accuracy of neural network model

Therefore, the location model of chain convenience stores built by the 11 features retained by boruta combined with the third neural network structure has the best performance, with an accuracy rate of 87.8%, recall rate of 69.7% and the value of F1 at 0.77. The structure of the neural network with R language is shown in Figure 8. The neural network has 1 hidden layer and 3 hidden layer nodes. By using back propagation algorithm to learn internally, and constantly updating the values of threshold and connection weight, the learning is completed when the condition of stopping iteration is reached. The final threshold value and connection value of weight, as shown in the figure, input the data set corresponding to 11 features, build the model through neural network, and finally output the learning result of the model.

Fig. 8

Structure diagram of neural network with the best performance

Modelling results of support vector machine

There are many parameters in the kernel, which need to be selected. This paper uses the tune.svm () function to find the best parameters in svm () function, and sets the range of gamma and cost to find the best values. The range of gamma set in this paper is 10e-6 to 10e-1, and the range of cost is 10–100. The kernel function is changed by changing the kernel parameter value, and the results are displayed by the summary () function. In this paper, the svm () function in e1071 package is used to build the support vector machine model, and different kernel functions are selected by setting the values of kernel parameters in the function, in which polynomial kernel function and radial basis kernel function correspond to ‘polynomial’ and ‘radial basis’, respectively. Taking polynomial kernel function as an example, the cost and gamma of the optimal parameters obtained by calculation are 100 and 0.001. The model before and after feature selection can be built by setting the parameter to ‘data’.

The support vector machine models with different structures are obtained by the above modelling method, the accuracy, recall and F1 value of the model can be obtained by using the training set data to train the offspring of the model into the test set data.

The modelling results of the corresponding data sets before and after feature selection are shown in Figure 9. Only the accuracy of the model is shown in the figure. It can be seen that the result of the accuracy of SVM model with polynomial as kernel function has not improved. However, the accuracy of support vector machine model whose kernel function is radial basis function is improved from 80.5% to 81.2%. The result of feature selection using boruta e shows that the accuracy of the model is improved whether the kernel function of SVM is polynomial kernel function or radial basis kernel function, in which the accuracy of SVM model with polynomial kernel function is improved from 85.7% to 92.5%. The recall rate of this model is 75.8% and the value of F1 is 0.84.

Fig. 9

Accuracy of support vector machine model

Comparative analysis of model performance

The modelling results based on boruta feature selection are summarised and obtained in Table 5. It can be seen from Table 5 that the models built by the three algorithms all perform very well. The accuracy of the model built by logistic regression algorithm is 90.7%, the accuracy of the model built by neural network algorithm is 87.8%, and the accuracy of the model built by support vector machine algorithm is 92.5%. The accuracy rate is given a higher weight due to the study of location in this paper. The value of F1 was calculated by combining the model accuracy and recall rate. Of the three algorithms, the model constructed by the support vector machine algorithm performed best, and the model accuracy and recall rate were both the highest.

Summary of modelling results with the best performance among the three algorithms

Algorithm Accuracy Recall rate F1 value

Logistic regression 90.7% 72.1% 0.80
NNs 87.8% 69.7% 0.77
SVM 92.5% 75.8% 0.84

After summarising and comparing the results, it can be determined that the best modelling algorithm of chain convenience store location is the support vector machine, and the kernel function of support vector machine is the polynomial kernel function. The evaluation system of the chain convenience store location is composed of 5 primary indexes and 11 secondary indexes. The primary indexes are population, traffic, competition, self-factor and other factors. The population factor is determined as the per capita convenience store consumption level. The traffic factor is determined as the number of parking lots and the number of subway stations. The competitive factors are determined as convenience store market saturation, the proportion of Category 1, Category 2 and Category 3 convenience store. Self-factors are determined as commodity type, service and environment. Other factors were identified as the number of shopping malls.

Analysis of application results

After constructing the location model of a chain convenience store, it is necessary to determine the validation area of the model. Nanjing is chosen as the validation area of the location model of the chain convenience store. In order to verify the chain convenience store model, the support vector machine is used and the kernel function of support vector machine is the best choice. This paper uses the data of Nanjing to test the performance of the model built by logistic regression, neural network and support vector mechanism, and the results are shown in Figure 10. It can be seen from the results that the performance of the model constructed by polynomial kernel function is the best. In addition, the model is better than other algorithms in both accuracy and recall rate.

Fig. 10

Modelling results of multiple classification algorithms (by the date of Nanjing)

The results show that the location model of chain convenience store presented in this paper has good performance and high accuracy. The model can be applied on a wide range and has practical application value.

Conclusion

With improvement in the income levels of Chinese residents, consumers no longer pursue only price orientation in shopping, but begin to prefer convenience and speed. The convenience store market is more and more promising, and brands of chain convenience stores have begun to expand everywhere to seize the market to enter the stage of large-scale development. However, the operation of convenience stores largely depends on the location of their stores. At present, the location decision of convenience stores mainly depends on the subjective experience, which lacks certain scientificity. Based on data mining, this paper studies the model of location selection of convenience stores. It can overcome subjective assumptions and scientifically evaluate each location to be selected, which has certain reference value and practical significance for the location of chain convenience stores.

Fig. 1

Process of data mining
Process of data mining

Fig. 2

Result of feature selection in caret package
Result of feature selection in caret package

Fig. 3

Result of feature selection in Boruta package
Result of feature selection in Boruta package

Fig. 4

Neural network structure with 1 hidden layer
Neural network structure with 1 hidden layer

Fig. 5

Shows a neural network structure with 2 hidden layers
Shows a neural network structure with 2 hidden layers

Fig. 6

Modelling results of logistic regression based on three evaluation systems
Modelling results of logistic regression based on three evaluation systems

Fig. 7

Accuracy of neural network model
Accuracy of neural network model

Fig. 8

Structure diagram of neural network with the best performance
Structure diagram of neural network with the best performance

Fig. 9

Accuracy of support vector machine model
Accuracy of support vector machine model

Fig. 10

Modelling results of multiple classification algorithms (by the date of Nanjing)
Modelling results of multiple classification algorithms (by the date of Nanjing)

Initial evaluation system

First level indicator Secondary indicators

Demographic factors Population flow
Convenience store consumption level per capita
Traffic factor Number of parking lots
Number of subway stations
Number of bus stops
Competitive factors Convenience store market saturation
Proportion of Type 1 Convenience Stores
Proportion of Type 2 Convenience Stores
Proportion of Type 3 Convenience Stores
Own factor product category
surroundings
Serve
other factors Shop rental price
Number of shopping malls

Evaluation system of feature selection of CARET

First level indicator Secondary indicators

Demographic factors Convenience store consumption level per capita

Traffic factor Number of parking lots
Number of subway stations
Number of bus stops
Competitive factors Convenience store market saturation
Proportion of Type 1 Convenience Stores
Proportion of Type 2 Convenience Stores
Proportion of Type 3 Convenience Stores
Own factor surroundings
Serve
other factors Shop rental price

Results of provisional features

Feature Mean Imp Median Imp Norm Hits Result

Convenience store consumption level per capita 2.34 2.57 0.52 accept
Number of shopping malls 2.38 2.61 0.55 accept
Proportion of Type 3 Convenience Stores 2.79 2.94 0.56 accept
Convenience store market saturation 3.11 3.88 0.67 accept
Proportion of Type 1 Convenience Stores 3.24 3.81 0.61 accept

Summary of modelling results with the best performance among the three algorithms

Algorithm Accuracy Recall rate F1 value

Logistic regression 90.7% 72.1% 0.80
NNs 87.8% 69.7% 0.77
SVM 92.5% 75.8% 0.84

Evaluation system of the feature selection in Boruta package

First-level indicator Secondary indicators

Demographic factors Convenience store consumption level per capita
Traffic factor Number of parking lots
Number of subway stations
Competitive factors Convenience store market saturation
Proportion of Type 1 Convenience Stores
Proportion of Type 2 Convenience Stores
Proportion of Type 3 Convenience Stores
Own factor product category
surroundings
Serve
Other factors Number of shopping malls

Tian Xin, Cao Shasha, Song Yan. The impact of weather on consumer behavior and retail performance: Evidence from a convenience store chain in China[J]. Journal of Retailing and Consumer Services, 2021, 62. TianXin CaoShasha SongYan The impact of weather on consumer behavior and retail performance: Evidence from a convenience store chain in China [J] Journal of Retailing and Consumer Services 2021 62 10.1016/j.jretconser.2021.102583 Search in Google Scholar

Lyu Fang, Lim Hyun A, Choi Jaewon. Customer Acceptance of Self-service Technologies in Retail: A Case of Convenience Stores in China[J]. Asia Pacific Journal of Information Systems, 2019, 29(3). LyuFang LimHyun A ChoiJaewon Customer Acceptance of Self-service Technologies in Retail: A Case of Convenience Stores in China [J] Asia Pacific Journal of Information Systems 2019 29 3 10.14329/apjis.2019.29.3.428 Search in Google Scholar

Huang Ting, Liu Chunxiong. The Distribution Strategies of Convenience Stores Chain in China from Japan 7–11 [J]. MATEC Web of Conferences, 2017, 100. HuangTing LiuChunxiong The Distribution Strategies of Convenience Stores Chain in China from Japan 7–11 [J] MATEC Web of Conferences 2017 100 10.1051/matecconf/201710005045 Search in Google Scholar

China Convenience Store Outlook 2020 - Increased Demand for Convenience & Time Saving Options[J]. M2 Presswire, 2016. China Convenience Store Outlook 2020 - Increased Demand for Convenience & Time Saving Options [J]. M2 Presswire 2016 Search in Google Scholar

Tallón Ballesteros Antonio J.. Fuzzy Systems and Data Mining VII[M]. IOS Press:2021-10-14. TallónBallesteros Antonio J. Fuzzy Systems and Data Mining VII [M]. IOS Press 2021 10 14 10.3233/FAIA340 Search in Google Scholar

Zhenhua Cai, Chuanshuai Zuo, Jianying Zhu, Peng Qin, Baojiang Duan, Wenrong Song, Jiaqi Wang. Classification and Application of Tight Gas Wells Based on Cluster Analysis[C]//Proceedings of 2021 5th International Conference on Electrical, Automation and Mechanical Engineering (EAME2021), 2021:723–729. DOI:10.26914/c.cnkihy.2021.044452. ZhenhuaCai ChuanshuaiZuo JianyingZhu PengQin BaojiangDuan WenrongSong JiaqiWang Classification and Application of Tight Gas Wells Based on Cluster Analysis [C] Proceedings of 2021 5th International Conference on Electrical, Automation and Mechanical Engineering (EAME2021) 2021 723 729 10.26914/c.cnkihy.2021.044452 Otwórz DOISearch in Google Scholar

Guoqing Zhou. Data Mining for Co-location Patterns: Principles and Applications[M].CRC Press:2021-08-24. ZhouGuoqing Data Mining for Co-location Patterns: Principles and Applications [M] CRC Press 2021 08 24 10.1201/9781003139416 Search in Google Scholar

Bhargava Neeraj, Bhargava Ritu, Rathore Pramod Singh, Agrawal Rashmi. Artificial Intelligence and Data Mining Approaches in Security Frameworks[M]. John Wiley & Sons, Inc.:2021-08-10. NeerajBhargava RituBhargava SinghRathore Pramod RashmiAgrawal Artificial Intelligence and Data Mining Approaches in Security Frameworks [M]. John Wiley & Sons, Inc. 2021 08 10 10.1002/9781119760429 Search in Google Scholar

Chong Xiaoyu, Shang Shun-Li, Krajewski Adam M, Shimanek John D, Du Weihang, Wang Yi, Feng Jing, Shin Dongwon, Beese Allison M, Liu Zi-Kui. Correlation analysis of materials properties by machine learning: illustrated with stacking fault energy from first-principles calculations in dilute FCC-based alloys[J]. Journal of Physics: Condensed Matter, 2021, 33(29). ChongXiaoyu ShangShun-Li KrajewskiAdam M ShimanekJohn D DuWeihang WangYi FengJing ShinDongwon BeeseAllison M LiuZi-Kui Correlation analysis of materials properties by machine learning: illustrated with stacking fault energy from first-principles calculations in dilute FCC-based alloys [J] Journal of Physics: Condensed Matter 2021 33 29 10.1088/1361-648X/ac019534132202 Search in Google Scholar

Takama Yasufumi, Tanaka Yuna, Mori Yoshiyuki, Shibata Hiroki. Treemap-Based Cluster Visualization and its Application to Text Data Analysis[J]. jaciii, 2021, 25(4). TakamaYasufumi TanakaYuna MoriYoshiyuki ShibataHiroki Treemap-Based Cluster Visualization and its Application to Text Data Analysis [J] jaciii 2021 25 4 10.20965/jaciii.2021.p0498 Search in Google Scholar

Erasmus Daniel J. DNA barcoding: A different perspective to introducing undergraduate students to DNA sequence analysis[J]. Biochemistry and Molecular Biology Education, 2021, 49(3). ErasmusDaniel J DNA barcoding: A different perspective to introducing undergraduate students to DNA sequence analysis [J] Biochemistry and Molecular Biology Education 2021 49 3 10.1002/bmb.2149233529467 Search in Google Scholar

Baietto Armando. Palazzo Gualino | палаццо гуалино 1928–2018:Un capolavoro del razionalismo italiano di Giuseppe Pagano Pogatschnig e Gino Levi-Montalcini | A Masterpiece of Italian Rationalism by Giuseppe Pagano Pogatschnig and Gino Levi-Montalcini | памятник архитектуры итальянского рационализма джузеппе пагано погачниг и джино леви-монтальчини [M]. Quodlibet:2020-12-21. BaiettoArmando Palazzo Gualino | палаццо гуалино 1928–2018:Un capolavoro del razionalismo italiano di Giuseppe Pagano Pogatschnig e Gino Levi-Montalcini | A Masterpiece of Italian Rationalism by Giuseppe Pagano Pogatschnig and Gino Levi-Montalcini | памятник архитектуры итальянского рационализма джузеппе пагано погачниг и джино леви-монтальчини [M]. Quodlibet 2020 12 21 10.2307/j.ctv1cdxfx2 Search in Google Scholar

Zahir Tari, Adil Fahad, Abdulmohsen Almalawi, Xun Yi. Network Classification for Traffic Management: Anomaly detection, feature selection, clustering and classification[M]. IET Digital Library:2020-01-24. ZahirTari AdilFahad AbdulmohsenAlmalawi XunYi Network Classification for Traffic Management: Anomaly detection, feature selection, clustering and classification [M]. IET Digital Library 2020 01 24 10.1049/PBPC032E Search in Google Scholar

Darryl Felder, Amanda Windsor. Introduced Hemigrapsus oregonensis (Dana, 1851) formerly colonized an inland Texas salt spring, as now underpinned by COI barcode sequence analysis[J]. BioInvasions Records, 2020, 9(2). DarrylFelder AmandaWindsor Introduced Hemigrapsus oregonensis (Dana, 1851) formerly colonized an inland Texas salt spring, as now underpinned by COI barcode sequence analysis [J] BioInvasions Records 2020 9 2 10.3391/bir.2020.9.2.12 Search in Google Scholar

Wenjiang Fu. A Practical Guide to Age-Period Cohort Analysis Using R[M]. Taylor and Francis; CRC Press:2015-02-15. WenjiangFu A Practical Guide to Age-Period Cohort Analysis Using R [M]. Taylor and Francis; CRC Press 2015 02 15 Search in Google Scholar

Krzysztof Jajuga, Andrzej Sokołowski, Hans Hermann Bock. Classification, Clustering, and Data Analysis[M]. Springer, Berlin, Heidelberg. JajugaKrzysztof SokołowskiAndrzej BockHans Hermann Classification, Clustering, and Data Analysis [M]. Springer Berlin, Heidelberg 10.1007/978-3-642-56181-8 Search in Google Scholar

János Abonyi, Balázs Feil. Cluster Analysis for Data Mining and System Identification[M]. Birkhäuser Basel. AbonyiJános FeilBalázs Cluster Analysis for Data Mining and System Identification [M]. Birkhäuser Basel Search in Google Scholar

Fernandez P, Pelegrin B, Lancinskas A. New heuristic algorithms for discrete competitive location problems with binary and partially binary customer behavior[J].Computers and Operations Research, 2017, 79:12–18. FernandezP PelegrinB LancinskasA New heuristic algorithms for discrete competitive location problems with binary and partially binary customer behavior [J]. Computers and Operations Research 2017 79 12 18 10.1016/j.cor.2016.10.002 Search in Google Scholar

Garcia J, Alfandari L. Robust location of new housing developments using a choice model[J]. Annals of Operations Research, 2018, 271:527–550. GarciaJ AlfandariL Robust location of new housing developments using a choice model [J] Annals of Operations Research 2018 271 527 550 10.1007/s10479-017-2750-6 Search in Google Scholar

Lin B, Liu S. The location-allocation model for multi-classification-yard location problem[J]. Transportation Research Part E, 2019, 122:283–308. LinB LiuS The location-allocation model for multi-classification-yard location problem [J] Transportation Research Part E 2019 122 283 308 10.1016/j.tre.2018.12.013 Search in Google Scholar

Muhammet D, Ibrahim Z, Selahattin Y. A GIS-based interval type-2 fuzzy set for public bread factory site selection [J]. Journal of Enterprise Information Management, 2018, 31(2):101–123. MuhammetD IbrahimZ SelahattinY A GIS-based interval type-2 fuzzy set for public bread factory site selection [J] Journal of Enterprise Information Management 2018 31 2 101 123 Search in Google Scholar

Sennaroglu B, Celebi G. A military airport location selection by AHP integrated PROMETHEE and VIKOR methods[J]. Transportation Research Part D Transport & Environment, 2018, 59(3):160–173. SennarogluB CelebiG A military airport location selection by AHP integrated PROMETHEE and VIKOR methods [J] Transportation Research Part D Transport & Environment 2018 59 3 160 173 10.1016/j.trd.2017.12.022 Search in Google Scholar

Waqas K, Zaza N, Lee H. Using K-means clustering in international location decision [J]. Journal of Global Operations and Strategic Sourcing, 2018, 21(3):23–46. WaqasK ZazaN LeeH Using K-means clustering in international location decision [J] Journal of Global Operations and Strategic Sourcing 2018 21 3 23 46 Search in Google Scholar

Hemalatha M, Sridevi P, Sivakumar V. A decision-support system application in retail store location model: a case study of hypermarket in emerging markets[J]. International Journal of Business & Emerging Markets, 2017, 3(19):158–176. HemalathaM SrideviP SivakumarV A decision-support system application in retail store location model: a case study of hypermarket in emerging markets [J] International Journal of Business & Emerging Markets 2017 3 19 158 176 10.1504/IJBEM.2011.039406 Search in Google Scholar

He Y, Li G, Liao Y, et al. Gesture recognition based on an improved local sparse representation classification algorithm[J]. Cluster Computing, 2018, 12(1):1–12. HeY LiG LiaoY Gesture recognition based on an improved local sparse representation classification algorithm [J] Cluster Computing 2018 12 1 1 12 10.1007/s10586-017-1237-1 Search in Google Scholar

Xu Z, Che Y, Min H, et al. Initial classification algorithm for pavement distress images using features fusion[C]//International Conference on Intelligent Interactive Multimedia Systems and Services. Springer, Cham, 2018, 10(22):418–427. XuZ CheY MinH Initial classification algorithm for pavement distress images using features fusion [C] International Conference on Intelligent Interactive Multimedia Systems and Services Springer, Cham 2018 10 22 418 427 Search in Google Scholar

Caigny A.D., Coussement K, Bock K. A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees[J]. European Journal of Operational Research, 2018, 110(32):35–45. CaignyA.D. CoussementK BockK A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees [J] European Journal of Operational Research 2018 110 32 35 45 10.1016/j.ejor.2018.02.009 Search in Google Scholar

Wang Y. Analysis of user reading behavior based on bayes classification algorithm[J]. Journal of Shaoguan University, 2018, 27(6):287–291. WangY Analysis of user reading behavior based on bayes classification algorithm [J] Journal of Shaoguan University 2018 27 6 287 291 Search in Google Scholar

Lang X, Li P, Hu Z, et al. Leak detection and location of pipelines based on LMD and least squares twin support vector machine[J]. IEEE Access, 2017, 3(22):1–1. LangX LiP HuZ Leak detection and location of pipelines based on LMD and least squares twin support vector machine [J] IEEE Access 2017 3 22 1 1 10.1109/ACCESS.2017.2703122 Search in Google Scholar

Mathas, Schuster, Puga, A heuristic algorithm for solving large location inventory problems with demand uncertainty[J]. European Journal of Operational Research, 2017, 259(2):413–423. Mathas Schuster Puga A heuristic algorithm for solving large location inventory problems with demand uncertainty [J] European Journal of Operational Research 2017 259 2 413 423 10.1016/j.ejor.2016.10.037 Search in Google Scholar

Polecane artykuły z Trend MD

Zaplanuj zdalną konferencję ze Sciendo