Research on Driving Conditions Based on Principal Component and K-means Clustering Optimization

Vehicle driving conditions, also known as vehicle operating cycles or driving cycles, refer to the mathematical representation of the speed-time curve that characterizes the vehicle's operating state in a specific traffic environment [1]. It provides core data support for fuel efficiency evaluation, emission control technology research and development, and intelligent traffic control, and directly affects the design of new energy vehicles and the accuracy of urban traffic carbon accounting [2]. Zhao Xuan's team [3] proposed a driving mode analysis method based on fuzzy C-means clustering. By integrating the time distribution characteristics of kinematic segments with multi-dimensional parameter correlation analysis, they achieved intelligent identification and optimized reconstruction of typical driving conditions of urban electric vehicles. The operating condition curves they constructed have significant improvements in typicality indicators compared to traditional methods. Currently, K-means clustering is widely used in driving cycle synthesis. However, K-means clustering often has problems such as large dependence on the initial cluster center, isolated points, and sensitivity to noise data. Ma Fumin et al. [4] innovatively constructed a local density dynamic adaptation measurement model to accurately characterize the spatial distribution characteristics of data objects within a cluster, and based on this, designed a rough K-means clustering algorithm that integrates a local density adaptation mechanism. Yuan Yiming et al. [5] developed an optimized K-means text clustering algorithm based on density peak. This algorithm effectively overcomes the convergence instability problem caused by random initialization of centers in the traditional K-means algorithm by accurately selecting density peak points as the initial clustering centers, and significantly improves the reliability of clustering results. Although the above two methods optimize the initial cluster centers to a certain extent, they do not mention the impact of edge data and isolated points in the data set. Bao Zhiqiang et al. [6] only used an outlier removal algorithm to eliminate isolated points in the data set, but still used traditional K-means clustering to cluster the data set.

Based on the above research conclusions, this paper constructs an improved density-driven K-means clustering algorithm. The core innovation of this algorithm is to introduce density measurement indicators to screen the initial clustering centers, effectively suppress the interference of noise data on the selection of initial centers, and integrate enhanced principal component analysis technology to build a two-stage collaborative optimization framework to achieve intelligent synthesis of driving conditions.

II.

Data Preprocessing

The measured data obtained in this study are from a light vehicle road operation scenario in a certain city, with a sampling rate of 1 Hz. The data dimensions include multiple source parameters such as timestamp, global positioning system (GPS) speed measurement value, geographic longitude and latitude coordinates, and instantaneous fuel consumption rate. In the actual data collection process, due to the combined influence of factors such as complex driving environment, electromagnetic signal interference, and urban building occlusion, the original sensor data generally has significant noise pollution, which manifests as multiple problems such as data distortion, abnormal value fluctuations, and signal interference [7]. Therefore, the first step in data processing is to preprocess the original data using wavelet decomposition and reconstruction [8]. The basic idea is to remove the wavelet coefficients corresponding to each frequency band and noise while retaining the wavelet coefficients of the original signal, and then perform wavelet reconstruction on the processed coefficients to obtain a pure signal.

Assume that a noisy signal can be described as: 1 $S (x) = f (x) + n_{1} (x) \times n_{2} (x)$ $$S(x) = f(x) + {n_1}(x) \times {n_2}(x)$$

Among them, ^S(x) is the degraded signal, ^f(x) is the original signal, ^n₁(x) is the additive noise, and ^{n₂ (x)} is the multiplicative noise.

The denoising process based on wavelet decomposition and reconstruction is described as follows: Step 1:

Decompose the noisy signal ^f(x) into approximate component ^c_j,k and detailed component ^d_j,k by wavelet decomposition.

Step 2:

According to the threshold ^δ_j, use equation (2) to process the detailed component ^{d_{j, k}} of the layer ^j. 2 $d_{j, k} = {\begin{matrix} d_{j, k}, & | d_{j, k} | > δ_{i} \\ 0, & | d_{j, k} | \leq δ_{i} \end{matrix}$ $${d_{j,k}} = \left\{ {\matrix{ {{d_{j,k}},} & {\left| {{d_{j,k}}} \right| > {\delta _i}} \cr {0,} & {\left| {{d_{j,k}}} \right| \le {\delta _i}} \cr } } \right.$$

Step 3:

Use the reconstruction algorithm to reconstruct the approximate component ^c_j,k and the detailed component ^d_j,k to obtain the filtered signal.

The original data and the data preprocessed by wavelet decomposition and reconstruction are shown in Fig. 1.

The comparison between the original data and the preprocessed data shows that the wavelet decomposition and reconstruction method has a good denoising effect and can effectively improve the signal-to-noise ratio of the signal.

III.

Analysis of Driving Condition Data

A. Feature Parameter Extraction and Kinematic Segment Division

Based on the analysis of relevant data and related research, 12 characteristic parameters were defined to describe the kinematic segments [8]. This paper selects 12 characteristic parameters including running time ^T/s, driving distance ^S/km, average speed ^{V_a/(km·h⁻¹)}, average driving speed ^{V_d/(km·h⁻¹)}, idling time ratio ^T_i/%, acceleration time ratio ^T_a/%, deceleration ^T_d/% time ratio ^T_d/% cruising time ratio ^T_c/%, speed standard deviation ^{V_std/(km·h⁻¹)}, average acceleration ^{a_a/(m·s⁻¹)}, acceleration standard deviation ^{a_std/(km·h⁻¹)}, and average deceleration a_d/(m·s⁻²).

The interval from the start of one idle speed to the start of the next idle speed of the car is called a kinematic segment [9]. This paper uses Python language to process and segment 1,655 kinematic segments from 195,815 pre-processed data.

B. Improved principal component analysis

Although the classical principal component analysis can effectively eliminate the differences in dimensions and magnitudes between the original variables when standardizing data, this process may also cause the characteristic differences of different indicators to be over-smoothed, resulting in potential information loss [10]. In view of the above situation, the improved principal component analysis method is as follows: Step 1.

Improve the traditional principal component dimensionless method by using the indicator mean method and indicator homogeneity method [11]. Assume that there are m objects and n indicators in the overall evaluation, and the initial indicators can form a matrix ^X_ij = (x_ij)_m×m). To average the matrix is to divide the original index by the average value of all indexes Y_ij: 3 $Y_{i j} = x_{i j} / {\bar{x}}_{j}, (i = 1, 2, \dots, n; j = 1, 2, \dots, p)$ $${Y_{ij}} = {x_{ij}}/{{\bar x}_j},(i = 1,2, \cdots ,n;j = 1,2, \cdots ,p)$$ Among this: 4 ${\bar{x}}_{j} = \frac{1}{n} \sum x_{i j}, (j = 1, 2, \dots, p)$ $${{\bar x}_j} = {1 \over n}\sum {{x_{ij}}} ,(j = 1,2, \cdots ,p)$$

The index can be processed to make all indicators have the same effect on the whole in the same direction. In the series, let ^y_ij be the reverse index, $\min_{1 \leq i \leq m} {y_{i j}}$ $$\mathop {\min }\limits_{1 \le i \le m} \left\{ {{y_{ij}}} \right\}$$ is the smallest number among them, and the index is processed to be: 5 $y_{i j}^{'} = y_{i j} - \min_{1 \leq i \leq m} {y_{i j}}, (i = 1, 2, \dots, m; j = 1, 2, \dots, n)$ $$y_{ij}^\prime = {y_{ij}} - \mathop {\min }\limits_{1 \le i \le m} \left\{ {{y_{ij}}} \right\},(i = 1,2, \cdots ,m;j = 1,2, \cdots ,n)$$

Among them, $y_{i j}^{'}$ $$y_{ij}^\prime $$ is the sequence after ^y_ij is homogenized, Such changes will not change the distribution of the original indicators. The improved principal component can represent more characteristic parameter information and achieve dimensionality reduction of driving conditions.

Fig. 2 shows that there are obvious inflection points in the variation curves of each principal component, and it is concluded from this observation that the first three principal components are selected.

As shown in Fig. 3, the cumulative contribution rate of the first three principal components has reached 85%, which basically represents all the information of the 12 characteristic parameters of the fragment, and can be used for cluster analysis. The first principal component contains 45% of the information, thus meeting the requirement of fewer principal components representing more information.

The larger the absolute value of the parameter principal component load coefficient, the higher the correlation coefficient between a parameter and a principal component, and the larger the contribution factor. According to Table 1 above, the eigenvalues of the first principal component are driving distance, segment duration, cruising time ratio, and average driving speed; the eigenvalues of the second principal component are average speed; and the eigenvalues of the third principal component are idling time ratio. From the first three principal components, it can be seen that the 12 characteristic parameter matrices of the sample are reduced to 6 characteristic parameter matrices that can represent most of the sample information.

TABLE I.

PRINCIPAL COMPONENT LOAD MATRIX

Characteristic parameters	M₁	M₂	M₃
Deceleration time ratio T_d/%	0.323	0.351	-0.223
Driving Clustering S/km	0.893	0.234	0.065
Run time T/S	0.782	0.251	-0.342
Acceleration time ratio T_a/%	0.396	-0.186	0.061
Cruise time ratio T_c/%	0.641	0.335	-0.075
Average speed V_a/(km·h⁻¹)	0.499	0.763	0.125
Deceleration time ratio V_d/(km·⁻¹)	0.778	0.415	0.132
Speed standard deviation V_std/(km·h⁻¹)	0.498	0.333	0.054
Acceleration standard deviation a_std/(km·⁻¹)	0.125	0.267	-0.077
Average acceleration a_a/(m·s⁻²)	0.024	0.523	0.053
Average deceleration a_a/(m·s⁻²)	0.266	-0.433	-0.059
Idle time ratio T_i/(m·s⁻²)	0.165	-0.351	0.853

C. Improved K-means cluster analysis

The K-means algorithm is sensitive to the selection of initial cluster centers. Since the process uses a random mechanism, the initial centroids it selects may be distributed in data sparse areas or coincide with outliers. This non-ideal initial state can easily cause the algorithm to fall into a local optimal solution, thereby reducing the clustering quality [12]. In the usual optimization method, in order to make the initial cluster center better than the method of randomly selecting cluster centers in traditional algorithms, k data objects with the farthest distance or the largest density are generally selected as the initial cluster centers. However, if there is noise data in the data set, the "distance optimization method" is likely to use the noise data as the initial cluster center. The "density method" selects the k data objects with the largest density as the initial clustering centers. This method can remove isolated points of data, but it is not suitable for non-convex data sets. This paper proposes a method that combines the "distance optimization method" and the "density method" to determine the optimal initial clustering center, and constructs a data set density measurement method.

1) Relevant definitions

a)

The Euclidean distance between two points in space is: 6 $d (x_{i}, x_{j}) = \sqrt{{(x_{i 1} - x_{j 1})}^{2} + \dots + {(x_{i m} - x_{j m})}^{2}}$ $$d\left( {{x_i},{x_j}} \right) = \sqrt {{{\left( {{x_{i1}} - {x_{j1}}} \right)}^2} + \cdots + {{\left( {{x_{im}} - {x_{jm}}} \right)}^2}} $$ Among them, ^x_i, ^x_j are two m-dimensional data points.

b)

Average distance between data objects: 7 $M e a n D i s t = \frac{1}{C_{n}^{2}} \sum d (x_{i}, x_{j})$ $$MeanDist{\rm{ }} = {1 \over {C_n^2}}\sum d \left( {{x_i},{x_j}} \right)$$

ⁿ is the number of data points in the data cluster, $C_{n}^{2}$ $$C_n^2$$ and ⁿ is the number of logarithms obtained from ⁿ data points.

c)

Given a data set ^{D = {x¹,x², …, xⁿ}}, the density measurement function of data point ^x_i is: 8 $D e a n (x_{i}) = \sum_{j = 1}^{n} u (M e a n D i s t - d (x_{i}, x_{j}))$ $$Dean\left( {{x_i}} \right) = \sum\limits_{j = 1}^n u \left( {{\rm{ }}MeanDist{\rm{ }} - d\left( {{x_i},{x_j}} \right)} \right)$$

Among them, the u(z) function is expressed as: $u (z) = {\begin{array}{l} 1, & z \geq 0 \\ 0, & z < 0 \end{array}$ $$u(z) = \left\{ {\matrix{ {1,} \hfill & {z \ge 0} \hfill \cr {0,} \hfill & {z < 0} \hfill \cr } } \right.$$

The density parameter of data point ^x_i is actually a data object inside a circle with center x_i and radius MeanDist.

d)

The average density measurement function of the data set is defined as: 9 $M e a n D e n s (D) = \frac{1}{n} \sum_{i = 1}^{n} D e n s (x_{i})$ $$MeanDens(D) = {1 \over n}\sum\limits_{i = 1}^n {Dens} \left( {{x_i}} \right)$$ nis the number of data objects in the dataset a D.

e)

For data point ^xⁱ in data set ^D, if 10 $D e n s (x_{i}) < α \times M e a n D e n s (D)$ $$Dens\left( {{x_i}} \right) < \alpha \times MeanDens(D)$$ Point ^x_i is called an isolated point, where 0 < α 1.

f)

The distance between data object ^x_i and data set ^C is the closest distance to all data points in data set ^C. 11 $d (x_{i}, C) = \min (d (x_{i}, x_{j}), x_{j} \in C)$ $$d\left( {{x_i},C} \right) = \min \left( {d\left( {{x_i},{x_j}} \right),{x_j} \in C} \right)$$

2) Algorithm Idea

The improved K-means clustering process is as follows: first evaluate the density distribution function of all samples, identify and remove outliers, and then construct a high-density data subset. Then select the sample with the best density value as the first initial cluster center, and then select the sample points with the farthest distance from the previous center in the remaining high-density data as new cluster centers, until k initial centroids are established. Finally, the standard K-means clustering process is executed based on the optimized centroid configuration.

The algorithm is described as follows: Input:

Sample dataset D = {d₁, d₂, …, d_n} containing n data objects

Output:

optimal k value and clustering results.

Step 1:

Use ^d(x_i,x_j) and ^MeanDist to calculate the distance and average distance between any two data objects in data set ^D.

Step 2:

Use ^Dens(x_i) and ^MeanDens(D) to calculate the density measurement function of all data objects in data set D and the average density measurement function of data set D.

Step 3:

According to formula (10), determine the isolated data objects and delete them from set ^D to obtain set ^A with high density parameters.

Step 4:

Select a data object with the highest parameter density from set ^A as the first initial cluster center, add it to set ^B, and remove it from set ^A.

Step 5:

From set ^A, select the data object farthest from set ^B as the next initial cluster center, add it to set ^B, and remove it from set ^A.

Step 6:

Repeat Step 5 until the number of data objects in set ^B is ^k.

Step 7 uses traditional K-means for clustering based on ^k cluster centers.

3) Results Analysis

^CH is used as an evaluation index to determine the optimal K value before clustering. It is a measure based on the intra-class dispersion matrix and inter-class dispersion matrix of all samples. The larger the is, the tighter the clusters are and the more dispersed the classes are. In this case, the clustering result is relatively better [13]. The index is defined as: 12 $C H (k) = \frac{t r B (k) / (k - 1)}{t r W (k) / (n - k)}$ $$CH(k) = {{trB(k)/(k - 1)} \over {trW(k)/(n - k)}}$$

Where ⁿ is the number of clusters, ^k is the current class, ^trB(k) is the trace of the between-class dispersion matrix, and ^trW(k) is the trace of the within-class dispersion matrix.

In order to determine the appropriate number of clusters, this paper first clusters into 2, 3, and 4 categories, and the clustering results are shown in Fig. 5. At this time, the clustering ^k value cannot be clearly determined. At this time, ^CH can be used as an evaluation indicator. The processing results of ^CH values under different clustering states are shown in Fig. 4. It can be observed that 3 categories are clustered when the ^CH value is the largest.

Reference [14] used two typical driving condition characteristic parameters, average vehicle speed and idling time ratio, to conduct clustering research. This study innovatively focused on the cruising time ratio and average driving speed, which have a higher contribution, as the core analysis dimensions. The original data points shown in Fig. 6 are scattered point distribution, in which the red marked area clearly circles the isolated samples and outliers that significantly deviate from the main distribution trend.

This paper clusters the kinematic segments into three categories. As shown in Fig. 7, the cluster centers of the first, second and third categories are (14, 38), (52, 45) and (86, 54) respectively, considering the general urban traffic conditions: the first category is the relatively crowded urban area, with relatively low average speed and cruising time, and more frequent starts and stops; the second category is the relatively unobstructed urban suburbs, with relatively high average speed and cruising time, and fewer starts and stops; the third category is the unobstructed high-speed segment, with high average speed and cruising time, and fewer starts and stops.

IV.

Driving Condition Construction and Analysis

A. Working condition construction

According to the proportion of the total time of each time segment to the driving conditions of all data sets, the time taken by each segment in the final constructed condition can be calculated. As shown in Fig. 8, this paper constructs it according to the time of 1,200s of the general typical driving condition.

As can be seen from Fig. 9, most of the operating points are concentrated in the medium and low speed range, and the distribution of acceleration is relatively reasonable, which can show the actual acceleration and deceleration of the car.

As can be seen from Fig. 10, the acceleration is mainly distributed in the speed range of 0-40 ^km·h⁻¹ and around 80 ^km·h⁻¹. During low speed and high acceleration, the instantaneous fuel consumption has a significant bulge, which may be caused by the driver's improper operation.

B. Working condition verification and fuel consumption analysis

The smaller the distribution difference value ^SAFD_diff is, the higher the commonality between the constructed working condition and the actual data is [15]. 13 $S A F D_{d i f f} = \frac{\sum_{i} {(S A F D_{c y c l e} (i) - S A F D_{d a t a} (i))}^{2}}{\sum_{i} S A F D_{d a t a} {(i)}^{2}}$ $$SAF{D_{diff{\rm{ }}}} = {{\sum\nolimits_i {{{\left( {SAF{D_{cycle{\rm{ }}}}(i) - SAF{D_{data{\rm{ }}}}(i)} \right)}^2}} } \over {\sum\nolimits_i {SAF{D_{data{\rm{ }}}}{{(i)}^2}} }}$$

SAFD_cycle is the SAFD of a cycle, and SAFD_dat is the SAFD of all data.

TABLE II.

COMPARISON OF THE METHOD IN THIS PAPER AND TRADITIONAL K-MEANS RESULTS

Method	Eigenvalue mean relative error/%	Cluster average accuracy/%	Average time/s	SAFD_diff/%
Traditional K-means clustering	8.1	93	215	2.12
Clustering of this article	4.6	98	127	1.17

The velocity-acceleration joint distribution is to separate the velocity values and acceleration values into equal intervals and further calculate the proportion of the working condition data in different intervals [16]. As shown in Fig. 11, the combined speed-acceleration difference between the original data and the constructed driving condition is distributed within the range of ±1.2%, and the calculated distribution difference value (SAFD_diff ) is 1.17%. Therefore, the driving condition constructed in this paper meets the driving characteristics of light vehicles and has strong applicability.

V.

Conclusions

In view of the inherent defects of information distortion in the dimensionless processing of traditional principal component analysis, this study proposes a dual optimization strategy of "mean preservation-trend synchronization" to achieve standardized improvement while maintaining the original difference characteristics of the variables. In view of the limitation of the K-means clustering algorithm being sensitive to the initial center, an intelligent optimization mechanism for initial clustering centers based on local density distribution is constructed. The statistically representative initial centroids are selected by the density measurement value of data objects, which effectively eliminates noise interference and improves clustering stability. This paper constructs an optimization method of improved principal component and improved K-means combination to synthesize automobile driving conditions. The verification results show that the difference rate between the synthetic conditions generated by the proposed method and the original data in the joint distribution space of speed-acceleration is only 1.17%, and the reconstruction accuracy of the conditions reaches 98.83%. The driving conditions synthesized by the proposed method are significantly better than those of traditional methods and are closer to actual traffic conditions.

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 4 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Informatik, Informatik, andere

Zeitschrift RSS Feed

Research on Driving Conditions Based on Principal Component and K-means Clustering Optimization

Huifeng Wang

Shuping Xu

Huxiang Yang

Jiaxiang Fang

Feiyan Kou

Online veröffentlicht: 16. Juni 2025

Seitenbereich: 53 - 61

DOI: https://doi.org/10.2478/ijanmc-2025-0016

SchlüsselwörterImproved Principal Component Analysis, Improved K-Means Clustering, Distance Optimization, Density Method

© 2025 Huifeng Wang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Schlüsselwörter
Improved Principal Component Analysis, Improved K-Means Clustering, Distance Optimization, Density Method