Otwarty dostęp

Analysis of Clothing Image Classification Models: A Comparison Study between Traditional Machine Learning and Deep Learning Models


Zacytuj

Introduction

In the era of information technology, image information has become essential for transmitting and obtaining information in clothing e-commerce sales [1]. As online clothing transactions have become increasingly popular these days, the number of clothing images is also increasing, requiring them to be classified. On the one hand, the accurate classification of clothing images facilitates the storage of large amounts of data [2], but on the other it has potential in automatic clothing recognition, clothing retrieval [3,4] and clothing recommendation [5]. Therefore, many scholars have been trying to devise an efficient and accurate classification model for the massive clothing dataset in e-commerce.

Clothing images can be divided into pure clothing and dressed clothing images [6]. Pure clothing images are simply displayed clothing without a human model, with a large solid background, such as clothing flat display images. Classifying pure clothing images is relatively easy but still has a low discriminability across different textures, colours, and fabric features [7]. In contrast, dressed clothing images often have a large portion of the complex background with a human model, for example, “seller's show” and “buyer's show.” In these images, the cluttered background, deformation and obscuration by the human body cause difficulty in classification [8]. As Internet shopping develops, user needs tend to be diversified. As consumers are more likely to use dressed clothing images as a resource when making purchases, dressed clothing images have gradually become crucial when selling clothing online. [9]. Manually labeling attributes for such a large number of complex clothing images is labour-intensive and time-consuming, which is also easily influenced by subjective judgment [10]. The application of image classification technology to clothing images allows us to identify the deeper features of clothing images. However, the author still faces two questions: 1) whether traditional machine learning algorithms or deep learning models are more suitable for the classification of pure clothing images, and 2) which deep learning model is more suitable for the classification of dressed clothing images.

This paper aims to determine proper classification models for each pure clothing and dressed clothing image, thereby enabling the classification of clothing images in the e-commerce industry. First, the author summarises and analyses the existing classification models for pure clothing and dressed clothing images. Then, several models are selected, shown in Figure 1(a), which are described in detail. In order to determine proper models, the author conducted two experiments (Figure 1(b)) for pure clothing and dressed clothing images, respectively. A different dataset was used for each experiment, shown in Figure 1(c). In the first experiment, a typical traditional machine learning algorithm, HOG+SVM, was compared with two deep learning models. For the HOG+SVM algorithm, the author tested four cases. The rbf in parentheses means that the SVM uses rbf (Gaussian kernel) as the kernel function, and Linear in parentheses means that a linear SVM is used. 4x4 and 8x8 in parentheses indicate the size of each cell for HOG features. In the second experiment, the HOG+SVM algorithm was compared with the small VGG network, large VGG network, and GostNet network. And for retail end-users, the author selected different numbers of data sets to experiment on and test the effect. Furthermore, the author summarises the factors that affect the accuracy of the models.

Fig. 1

Research object and methodology

The remainder of this paper is organised as follows. Section 2 provides an overview of related work, and Section 3 describes the theoretical model used in the experiments. The experimental design is given in Section 4. The results and discussion are proveded in Section 5. Finally, Section 6 summarises the conclusions of this paper and outlines potential future research directions.

Literature Review

Due to its increasing significance in the e-commerce industry, considerable research has been carried out on clothing image classification. The existing methods for clothing image classification can be divided into two categories: a) traditional machine learning algorithms and b) deep learning models. Nevertheless, existing clothing classification methods still face three fundamental challenges in their practical application. First, the increased variation of clothes in style, texture, and details leads to a complex image information base. Second, clothes are often distorted and creased when worn on the human body. Third, clothing images appear quite different in different scenes. This section discusses studies to overcome these challenges from the two aspects of pure clothing image classification and dressed clothing image classification.

Pure clothing image classification

At the early stage of e-commerce development, pure clothing images occupied a relatively large proportion. Traditional machine learning algorithms were commonly used in the early studies for clothing image classification, which manually extracted image features for classification [11]. However, the traditional machine learning algorithms showed their shortcomings in handling a large volume of image data. To overcome the limitation, deep learning models have been actively studied to classify pure clothing images [12,13].

The traditional machine learning algorithms extract artificially predefined features (edges, corner points, colours, etc.). Typical feature extractors include the Scale-Invariant Feature Transform (SIFT) [14], Speeded-Up Robust Features (SURF) [15], and the Histogram of Oriented Gradient (HOG) [16]. Thewsuwan et al. [17] established a preprocessing technique based on the Local Binary Pattern (LBP) and Gabor filter for a clothing classification system. Sha et al [18] adopted Uniform Local Binary Pattern (ULBP) features for pattern attributes and the Pyramid Histogram of Oriented Gradients (PHOG), Fourier, and GIST features for the collar and sleeve attributes. Surakarin et al [19] proposed the Bag of Features (BoF) model based on SURF and the Local Directional Pattern (LDP) to identify clothing types. Li et al. [20] proposed a Dragonfly Algorithm (DA)-optimised Online Sequential Extreme Learning Machine (OSELM)-based clothing image classification technique for pure clothing image classification.

The NN model mimics the characteristics of biological NNs. By simulating the architecture and function of a human NN, a complex NN consists of a large number of neurons, which simulates and inhibits neurons to complete complex operations. Recently, deep neural networks have been advanced and surpassed the traditional multilayer perceptron neural networks.

Deep neural networks automatically learn discriminative features from data without requiring a hand-crafted feature construction process. Especially, CNN has been used in clothing image classification, with its powerful feature extraction capabilities. Typical, CNN includes LeNet [21], AlexNet [22], GoogleNet [23], VGGNet [24], and GhostNet [25]. Lao et al. [26] proposed a clothing classification model based on the AlexNet model and a clothing target detection model based on R-CNN. Dong et al. [27] adopted VGG-Net and Spatial Pyramid Pooling (SPP) for multi-scale clothing image classification. Xiang et al. [28] constructed a dataset containing approximately 100,000 shirt images and proposed an RCNN-based classification framework. Di et al. [29] used four Neural Network (NN) models, such as a fully connected neural network, the Convolutional Neural Network (CNN), MobileNetV1, and MobileNetV2, to classify clothing images based on the Fashion-MNIST dataset.

According to literature, a traditional machine learning algorithm, such as HOG, requires only a small number of samples to obtain high accuracy and certain generalisation ability. However, there is still a strong dataset dependency or scene-specific dependency, and it can only classify pure clothing images with solid colour backgrounds. Although the NN model, with its automatic learning feature, achieves a better performance in classifying pure clothing images, it requires a large amount of data to train the model, which is not suitable for small-scale pure clothing images.

Dressed clothing image classification

Dressed clothing images have been taken more portion in clothing image data as the e-commerce industry develops and flat models emerge. Accordingly, many studies have been conducted on dressed clothing image classification.

Due to the complexity of dressed clothing images, only a few studies have been based on traditional machine learning algorithms to achieve human silhouette and clothing recognition in clothing images. In the study by Bossard et al. [30] , HOG and LBP features were fused and fed into the SVM classifier to distinguish the background and human body of the dressed clothing images in natural scenes. Huo et al. [31] proposed a method based on the Deformable Part based Model (DPM) and key point regression method for locating the head and shoulders and the human torso in dressed clothing images. Liu et al. [32] proposed Colour-Fashion, which combines the human pose estimation module and colour information based on SIFT and HOG.

Most existing methods are mainly based on deep learning models to effectively handle the complex background, noise, and relatively small image area of dressed clothing images. Liu et al. [33] constructed a large-scale and semantically labeled comprehensive clothing dataset - DeepFashion. Also, they proposed a FashionNet based on global convolutional features and local keypoint features, incorporating four types of supervised information: broad categories, attributes, clothing IDs, and key-points for clothing feature learning. Wang et al. [34] combined CNN and Recurrent Neural Networks (RNN) to build a CNN-RNN framework for multi-label image classification. Nawaz et al. [35] used an Inception module-based deep network to automatically classify traditional Bangladeshi ethnic costume images. Cychnerski et al. [36] proposed a joint system for the accurate detection and classification of clothing images, providing results for fine-grained five attributes that outperformed SqueezeNet [37] and ResNet on the DeepFashion dataset???. Zhang et al. [38] proposed an effective deep learning network to classify a dataset containing 9,339 dressed clothing images. Rohrmanstorfer et al. [39] constructed a dressed clothing image dataset containing 2,567 images and employed CNN as a classification model. Hodecker et al. [40] evaluated the accuracy of CNN for clothing classification as compared to nonconvolutional models.

As discussed, most existing dressed clothing image classification methods are based on the deep learning model. The performance of the traditional machine learning algorithm is limited, whose capability is only to distinguish the background and the body of dressed clothing images according to the body outline. CNNs can automatically extract image features and recognise highly distorted dressed clothing images with translation, scaling, and distortion invariance. However, a deep learning model still requires a large number of data for training.

In summary, for pure clothing image classification, there are more studies using traditional machine learning algorithms rather than deep learning models. For dressed clothing image classification, only a few studies use traditional machine learning algorithms to segment human silhouettes and clothing in dressed clothing images, and most of the studies are based on deep learning models. Both traditional machine learning and deep learning methods have achieved good performance for clothing image classification. However, the general trend of clothing image classification is toward more complex and a greater amount of dressed clothing image classification, and there are still many difficulties and challenges to be solved. For example, despite good performance in pure clothing image classification, the accuracy rates differ for different sizes of data sets. In dressed clothing image classification, deep learning models provide better performance than traditional machine learning algorithms. However, the performance of different deep learning models varies depending on the dataset and target tasks. Therefore, in this paper, two types of classification methods were explored in subsequent experiments to test the classification of pure clothing images and dressed clothing images, respectively.

Theoretical model

The essence of clothing image classification is to use image features to determine the category of clothing through a classification model. The flow chart of clothing image classification is shown in Figure 2. First, the clothing image is preprocessed, and then the image features are extracted using traditional machine learning algorithms or deep learning models. The extracted features are fed into the classifier, obtaining the output category. This paper analyses the representative traditional machine learning algorithm, HOG+SVM, and the following effective deep learning models: NN, CNN, Small VGG, VGG-16, and GhostNet. In the subsequent sections, they are described in detail.

Fig. 2

Flow chart of clothing image classification

Clothing image classification based on traditional machine learning

Since traditional machine learning algorithms are based on artificially predefined features (edges, corner points, colours, etc.), appropriate features should be selected for different target tasks. SIFT, SURF, and HOG are commonly used as feature descriptors, and SVM, the Extreme Learning Machine (ELM), and Random Forest (RF) are commonly used as a classifier. Especially, HOG is one of the most successful feature descriptors in object detection, which constructs features based on the histogram of gradient directions in local regions of an image. SVM is a generalised linear classifier for supervised data classification. The combination of HOG and SVM has been widely used in many types of classification tasks, mainly due to its robustness and accuracy. In this framework, the input image is first divided into many small connected areas called cell units. Then, the gradient direction of each pixel in the cell unit is accumulated to construct a histogram of directional gradients. The histogram is normalised by the density of each histogram for illumination robustness. The normalised block descriptor is defined as the HOG descriptor. The HOG descriptors of all blocks in the detection window are combined to form the final feature vector, and then the SVM classifier is used for image detection and classification. A schematic diagram of this process is depicted in Figure 3 [16].

Fig. 3

Schematic diagram of HOG+SVM algorithm

Clothing image classification based on deep learning models

A basic NN model consists of neurons with weights and biases, simulating biological neurons for information processing. During the training process, the weights and biases of neurons of the model are adjusted to process the input information into output in accordance with training data. With the rapid development of deep learning in recent years, CNN has achieved great success in image processing and computer vision tasks, with a deeper and feed-forward architecture suitable for image data [41].

Model 1 — simple NN

The architecture of a simple NN is shown in Figure 4. As shown in Figure 4, the model consists of a flatten layer, a fully connected (FC) layer with 128 neurons, followed by Relu activation, and an FC layer with 10 neurons, followed by softmax activation. The flatten layer converts the multidimensional input into one-dimensional input.

Fig. 4

Architecture of a simple NN

Model 2 — CNN model

As shown in Figure 5, the CNN model consists of an input layer, a convolutional layer with 32 3x3 kernels, a flatten layer, and an FC layer. The convolutional layer is followed by Relu activation, a BatchNormalisation layer, max-pooling layer, and dropout layer. The FC layer is followed by the softmax activation function. BatchNormalisation serves to normalise the data, which can speed up the convergence and prevent the overfitting problem. The dropout layer can mitigate overfitting by randomly deactivating a certain number of neurons per iteration.

Fig. 5

Architecture of the CNN

Model 3 — Small VGG network model

The small VGG network model was made by Adrian Rosebrock [42]. As shown in Figure 6, the small VGG network is a lightweight version of the full VGGNet, which is faster in the compromise of accuracy. The small VGG network consists of repeated ConvBlocks and MiddleBlocks, where the ConvBlock consists of a convolutional layer, followed by the Relu activation function and BatchNormalization layer. The MiddleBlock consists of two ConvBlocks, a MaxPooling layer, and dropout layer. The class_len in the penultimate layer indicates the number of labels in the dataset, which is 3 in this case.

Fig. 6

Architecture of small VGG

Model 4 — VGG-16 network model

The VGG network model was proposed by the Visual Geometry Group at Oxford [43]. The VGG network has two structure variants: VGG-16 and VGG-19, which have a similar structure but different network depths. In this paper, the VGG-16 network is used. The VGG-16 network consists of 13 convolutional layers and 3 FC layers.

Model 5 — GhostNet model

The GhostNet is a lightweight but effective model built by Huawei Noah's Ark Laboratory. The GhostNet has the advantage of fast training and fewer hardware requirements, but to the detriment of accuracy. The author chose GhostNet to represent the lightweight network in this paper.

In summary, the author selected a representative traditional machine learning algorithm: HOG+SVM, and representative deep learning models: a simple NN, a simple CNN, a small VGG, a large VGG-16, and a light GhostNet. The above algorithms and models perform differently in different image domains and sample sizes. The purpose of our experiments is to find suitable models for clothing image classification tasks.

Experimental Design

After the literature review and theoretical model above, we designed two-part experiments for pure clothing images and dressed clothing images. For the classification of pure clothing images, experiments were mainly conducted using traditional machine learning algorithms with different parameters, and compared with simple deep learning models. For the classification of dressed clothing images, different kinds of deep learning models were mainly used for experiments and compared with traditional machine learning algorithms. In particular, the experiment of the second part was carried out in two steps. First, the traditional HOG-SVM algorithm is compared with the small VGG network. Then, in the second step, the Small VGG network was compared with the large VGG-16 network and light GhostNet. Figure 7 depicts a flow chart of the experiments.

Fig. 7

Experimental flow chart

Pure clothing image classification
Dataset

The Fashion-MNIST dataset is used to evaluate the models for pure clothing images. The Fashion-MNIST dataset was constructed by the research division of Zalando [44] , which consists of 70,000 28x28 pixel grayscale images. The categories and some example images are shown in Figure 8. These 70,000 images were labeled in ten categories: t-shirt/top, trousers, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle boot, which are labeled as 0~9 (Table 1). In our experiments, the author used 60,000 images as the training set and 10,000 images as the validation set.

Fig. 8

Examples and categories of the Fashion-MNIST dataset

Fashion-MNIST dataset label description

Description T-shirt/top Trouser Pullover Dress Notes
Label 0 1 2 3 4
Description Sandal Shirt Sneaker Bag Ankle boot
Label 5 6 7 8 9
Experimental result

The traditional machine learning algorithm HOG+SVM was compared with the simple NN and simple CNN. In order to analyse the accuracy of the three methods for different numbers of training samples, the data size was gradually increased in the experiment: from 500 to 30,100. The training and testing accuracies are depicted in Figure 9, where ML denotes the HOG+SVM algorithm, NN denotes the simple NN model, and CNN denotes the CNN model. For the HOG+SVM algorithm, the author tested four cases with the necessary additions in parentheses. The rbf in parentheses means that the SVM uses rbf (Gaussian kernel) as the kernel function, Linear in parentheses that a linear SVM is used, and 4x4 and 8x8 in parentheses indicate the size of each cell for HOG features. Among them, since the HOG+SVM algorithm with the Gaussian kernel converges, the author increased the interval of the number of test samples when the ML (rbf_4x4) reached the saturation period (7,700), selected 13300, 18900, 24500, and 30100 samples for testing and continued to plot the line segments.

Fig. 9

Accuracy comparison of HOG+SVM, NN, and CNN.

The results of the comparison show that ML (rbf 4x4) provides the best accuracy, while ML (Linear 8x8) provides the worst accuracy, and the NN and CNN are moderately accurate, but have some fluctuations due to dataset effects. The rbf kernel's results outperformed those of the linear kernel, indicating that the Gaussian kernel brings higher accuracy than a linear SVM. Also, the small size of the cell (4x4) provided higher accuracy than the large size of the cell (8x8), indicating that 4x4 is suitable for this dataset.

Table 2 compares the highest accuracy of different models, showing that the ML (Linear_4x4) outperforms the other methods, including NN and CNN, when the number of samples is very small. It shows that better results can be obtained by adjusting the parameters of traditional machine learning algorithms. But it should also be noted that using the rbf kernel and increasing the cell size of HOG features will increase the computational effort.

Highest accuracy comparison of different models

Name Highest Accuracy
ML (rbf_4x4) 91.3%
CNN 89.7%
ML (Linear_4x4) 88.9%
NN 87.7%
ML (rbf_8x8) 86.7%
ML (Linear_8x8) 83.1%
Dressed clothing image classification
Dataset

In this experiment, the Fashion144k (stylenet_v1) dataset [45] was used, which is a sub-dataset of Fashion144k. It contains more than 89,000 colour clothing images, 123 clothing feature labels, and 3,179 colour feature labels. The dataset images are 256x384 in size, which contains complex backgrounds. Unlike the Fashion-MNIST dataset, this dataset images contain the full body clothing of the person and more tags. Figure 10 shows example images of the dataset. By observing the images in the dataset, the authors found that most of the images have more than one top, which will greatly interfere with the recognition of the model, therefore we filtered out all bottom images for the experiment. In order to train our model step by step, the filtered images were preprocessed and then divided into several sub-datasets. Each sub-dataset is split into a training set and validation set for subsequent experiments.

Fig. 10

Example images of the Fashion144k dataset

Dataset#1—SmallV1. Redundant labels were eliminated from the Fashion144k (stylenet_v1) dataset, thereby retaining only jeans, pants, and leggings. MRCNN [46] was then used to extract the clothing locations, and thus 400 images from each category were collected. The mislocated and misclassified images were manually eliminated, and the remaining 1,065 images were used as the dataset. The dataset, so-called SmallV1, was split into the training set and validation set at a ratio of 8:2. Example images of the SmallV1 dataset are presented in Figure 11.

Fig. 11

Example images of the SmallV1 dataset

Dataset#2 — SmallV2. The SmallV2 sub-dataset was further extended using the MRCNN, increasing each category of jeans, pants, and leggings to approximately 1,000 images. The 3,000 images of the SmallV2 dataset were also split into training and validation sets at a ratio of 8:2.

Dataset#3 — SmallV3. The lower 2/3 height of the images within the Fashion144k (stylenet_v1) dataset was collected as the training set. Thus, 2000 images per category were obtained (2000, 2000, 1932 images, respectively). The 6,000 images of the SmallV3 dataset were also split into training and validation sets at a ratio of 8:2.

Experimental result

In the first experiment, the HOG+SVM algorithm was compared with the Small VGG network model for the SmallV1 dataset. Figure 12 depicts a recognition accuracy comparison of the two methods, revealing that neither have high recognition accuracy for the validation set. As shown in Figure 12, the HOG+SVM algorithm performs poorly for the validation set despite good accuracy for the training set. In contrast, the recognition accuracy of the Small VGG network model is low for both the training set and validation set. Even though the Small VGG network obtains a bit higher accuracy than the HOG+SVM algorithm for the validation dataset, the recognition accuracy is very limited, only 41.67%. One possible reason could be the small size of the SmallV1 dataset, only 1,065 images. The results of the first experiment show that the advantages of CNN only appear when the amount of data is larger. Therefore, the SmallV2 dataset with a medium number of images was not used in the next experiments, while the SmallV3 dataset, which has the largest number of images, was used directly.

Fig. 12

Recognition accuracy of HOG+SVM and Small VGG for the SmallV1 dataset

As shown in Figure 13, the recognition accuracy of the Small VGG network model was improved to 69.78% for the SmallV3 dataset. The result indicates that the recognition accuracy of the CNN can be improved when the number of data increases.

Fig. 13

Recognition accuracy of Small VGG network models in different datasets

Further Experiment result

Since the Small VGG network did not provide high enough accuracy in the previous section, further experiments were conducted to find more suitable models. Two additional deep learning models were adopted: VGG-16 and GhostNet, which are a large network and light network, respectively.

VGG-16 (large network). Since VGG-16 is a large network model, it was trained and tested for the SmallV2 dataset with a medium sample size. First, the author used Adam as the optimiser for training, and the VGG-16 model was trained for the SmallV2 dataset from scratch. It appears that the model does not learn properly. Then, the author retrained the VGG-16 model using the SGD optimiser. The model achieved an accuracy of 33.12% after 10 epochs and 45.74% after 40 epochs. Despite its large size, VGG-16 does not provide high accuracy in this experiment, from which it can be concluded that it is not suitable for clothing classification in e-commerce.

GhostNet (lightweight network). Since the GhostNet model is a lightweight network, it was trained and tested for the SmallV1 dataset. The GhostNet obtained a recognition accuracy of 92.90% for the training set but only 38.86% for the validation set. A further experiment was conducted adjusting the network parameters as follows: the Dropout to 0.35; a fully connected layer was added before the output layer, and the relu activation function and l2 regularisation were used. As a result, the model obtained a recognition accuracy of 58.46% for the training set and 45.02% for the validation set. Another experiment was conducted with another adjustment as follows: the Dropout to 0.45, and the increased epochs provided a recognition accuracy of 64.02% for the training set and 37.44% for the validation set. From those three experiments, it is found that the classification accuracy of the GhostNet network model for the SmallV1 data set is not stable enough.

In contrast, the GhostNet trained for the SmallV2 dataset obtained a recognition accuracy of 86.49% for the training set and 52.19% for the validation set, as shown in Figure 14. The results show that such a lightweight network as GhostNet can also be used for dressed clothing image classification, but the stability is slightly less compared with the Small VGG network models.

Fig. 14

Recognition accuracy of the GhostNet model for different datasets

Discussion
Pure clothing image classification

The comparison analysis of the HOG+SVM algorithm, simple NN, and simple CNN for the Fashion-MNIST dataset shows that the traditional machine learning algorithms can achieve better accuracy than deep learning models for the small size dataset. However, when the number of samples increases, the accuracy of the CNN can also reach 89.73% accuracy and gradually improve with an increase in the number of samples. Therefore, the author concludes that the traditional machine learning HOG+SVM algorithm is more suitable for pure clothing image classification, which can better capture local shape information and has good invariance to geometric and optical changes. However, in the era of big data, as a large number of clothing datasets appear, using the CNN model is also one of the good choices.

Dressed clothing image classification

The first experimental comparison shows that the Small VGG network model obtains higher recognition accuracy than the HOG+SVM algorithm. Also, the recognition accuracy increased for the SmallV3 dataset with the largest number of images. In further experiments, two other types of deep learning models were analysed: a large network - VGG-16 and a lightweight network - GhostNet. The results show that the VGG-16 network is not suitable for dressed clothing image classification, while the GhostNet obtains a comparable recognition accuracy but is unstable compared to the Small VGG network. Therefore, the Small VGG has the best results in dressed clothing image classification experiments and is expected to further improve its accuracy as the number of samples increases.

Further analysis of other factors affecting accuracy

Based on the experimental results, the author believes that the accuracy of the Small VGG network model can be further improved. There are many factors that affect the recognition accuracy of the model, including clothing style, image orientation, image colour, number of datasets, and so on. However, since the image orientation is consistent in this paper, and the purpose of the experiment is to classify clothing styles, this section mainly summarises the influence of image colour and the number of data sets on the recognition accuracy of deep learning models.

First, the influence of colour on accuracy was analysed. The results obtained with colour and grayscale images of the SmallV2 dataset are compared. Note that the original colour images were converted into grayscale in the comparison, and the network was trained with 20 epochs. The results show no major difference, indicating no significant contribution of colour information to clothing image classification.

Second, the influence of the sample size of the training dataset on accuracy was analyzed. To this end, the Small VGG network and GhostNet network were compared for the SmallV1 and SmallV3 datasets. As shown in Figure 15, the accuracy of clothing image classification increases as the number of training images increases. Accordingly, the quality and quantity of the training dataset significantly impact the accuracy of the network model.

Fig. 15

Accuracy of Small VGG and GhostNet for different numbers of datasets

Conclusion

In the age of data, accurate automated classification of images has been considered crucial to improve the operational efficiency of the e-commerce apparel industry. This paper, targeting small and medium-sized clothing companies or merchants, compares traditional machine learning algorithms and deep learning models to determine suitable models for each group. The experimental results demonstrate that the HOG+SVM algorithm with the Gaussian kernel function obtains the highest accuracy of 91.32% in the classification of pure clothing images. In contrast, the CNN model obtains a higher accuracy than the HOG+SVM algorithm in the dressed clothing image classification. Accordingly, for end-users with only ordinary computing processors, it is recommended to apply the traditional machine learning algorithm HOG+SVM to classify a limited number of pure clothing images. The classification of dressed clothing images with complex image information can be performed using a more efficient, shorter training time and a less computationally intensive lightweight model, such as the Small VGG network. In addition, for small and medium-sized clothing companies or merchants, they need to manage incoming and outgoing goods on a daily basis, add tags to clothing images and upload them to the cloud to save data.

But at the same time, a classification model with low cost and fast computing is needed. In this case, the model in the paper can be adapted to clothing retail in Amazon and similar categories.

Our future works will include constructing a large clothing image dataset with more style labels and more images to test different models, which will provide a reference for large mobile clients. And the experiments will also include tests of the model's operational efficiency, stability and other performance. In addition, the author will further conduct a study of clothing video classification due to the current prevalence of live streaming industry videos.