Deep Learning-based 3D Reconstruction and Simulation Technology for Cheongsam and Hanbok

As the most representative Chinese dress of the Chinese nation, Hanfu has become an important symbol of China’s national cultural self-confidence and dissemination of excellent traditional culture to the world [1–3]. By organizing Hanbok cultural exhibitions, broadcasting film and television works related to Hanbok culture, and organizing online and offline activities such as “Hanbok shows”, Hanbok culture has spread rapidly both in online virtual space and in real life, and it has been promoted from Hanbok enthusiasts to the general public all over the world [4–6]. If Hanbok can arouse people’s aesthetic pleasure, then the value attributes of Hanbok have established a connection with the aesthetic subject, which can trigger emotional resonance, and then make them identify with Chinese traditional culture [7–8].

The current rise of Hanbok fever is a materialized expression of the sublimation of contemporary Chinese culture from self-awareness to self-confidence. Hanbok is the collective name of Han national costumes, which is one of the expressions of the 5000 years culture of the Chinese nation [9–10]. In the past decade, Hanbok movement has evolved from a niche group activity to a popular cultural movement, and gradually penetrated into the traditional cultural kernel from the aesthetics of clothing, which has been more and more emphasized by people [11–12]. At the same time, as an emerging technology, the application of 3D virtual fitting technology in the field of traditional clothing design and display has boosted the dissemination of Hanfu culture and the industrialization process. For example, Chen Xi builds a VR museum of Hanfu through three-dimensional modeling, exploring a new direction for the display of traditional costumes [13–15].

Three-dimensional virtual fitting system relies on computer image processing technology and graphics theory, through three-dimensional human body modeling technology, virtual sewing technology and fabric simulation technology, to achieve the construction of the user’s virtual human body model and virtual clothing sewing and display [16–18].

An adaptive template is proposed on the basis of SMPL to train the cheongsam category. A multi-stage stepwise deformation algorithm model is proposed for the folding or flipping error problem, and GCN is applied to deeply learn the deformation of the cheongsam hanbok and reconstruct the cheongsam hanbok in 3D from the three aspects of pose estimation, feature line fitting and surface refinement. The reconstruction results are evaluated using metrics such as intersection and concurrency ratio, chamfer distance, normal consistency, and F-score on the mesh volume. In order to make the 3D reconstruction of cheongsam and hanfu more realistic, a high-precision fabric animation simulation method based on geometric images is proposed. A super-resolution network model is designed, including a progressive fusion module of spatio-temporal features with pyramid structure, which gradually enriches the information captured by the large receptive field using the features of the small receptive field, and solves the problems of computational volume and global information acquisition.

2

Three-dimensional reconstruction model of a single garment image

With the rapid development of Internet fashion consumption, people’s demand for Internet fashion consumption experience is getting higher and higher, and interactive, personalized and realistic virtual try-on technology has received widespread attention. At the same time, cheongsam and hanfu, as Chinese traditional costumes, is a typical type of high fashion with national cultural characteristics, diverse styles and rich personalization. In order to enhance the interactive, personalized and realistic cheongsam and hanbok virtual try-on experience effect, the 3D reconstruction and simulation of cheongsam and hanbok play an important role. For this reason, this paper researches the reconstruction and simulation technology of cheongsam and hanbok based on deep learning technology.

2.1

Generation of template meshes

Although the style of cheongsam hanbok varies greatly, there are usually only a few underlying topologies. This makes it possible to use a template-based approach. A straightforward approach is to train different models for different cheongsam categories using different predefined templates. However, this can lead to overfitting as less data is used to train each model. To address this problem, this paper proposes an adaptive template, which is a new representation that scales to different garment topologies, using one network to generate all types of garments available in the dataset. The template is built on the SMPL model, removing the head, hand and foot parts [19].

During the training phase, the entire adaptive mesh is input to the network. However, different semantic regions are activated based on the estimated garment topology. In particular, in this paper, the template mesh is denoted as M_t = (V, E, B), where V = {v_i} and E = {e_i} are sets of vertices and edges, respectively, and B = {b_i} denotes a binary activation mask for each vertex. Only when b_i = 1, v_i is activated and supervised during training, otherwise (when b_i = 0) v_i is disengaged from network optimization and removed at the end of training. The activation mask is determined by the pre-predicted clothing category, where the vertex regions are labeled as a whole. In this paper, a classification network based on pre-trained VGGNet is constructed. Given an input image I, the classification network will output the estimated clothing categories. The classification network is trained using real and synthetic images. Synthetic images are used to provide enhanced images of lighting conditions for training. In particular, this paper renders each garment model in a randomized view with different global illumination. Approximately 12,000 synthetic images were generated, 70% of the rendered images were used for training and the rest were used for testing. Based on the classification results, the corresponding template B can be obtained.

2.2

Deep learning based surface reconstruction

A parsimonious approach to shape estimation from an adaptive template M_t is to project the features of I onto a grid map and apply a GCN to learn the deformations [20]. However, this method is prone to produce collapsing or flipping error results. This is because it is difficult to strike a perfect balance between the smoothness of the mesh and the accuracy of the reconstruction given a large deviation of Ground Truth from M_t. To solve this problem, this paper proposes a multi-stage stepwise deformation algorithmic process to adapt to the target shape. The process consists of three steps: attitude estimation, feature line fitting and surface refinement, and the reconstruction algorithm is shown in Fig. 1.

2.3

Network training

The three stages of the network are trained separately. All cascade networks used in the framework share the first part of VGGNet and differ only in the last layer.

2.3.1

Training data generation

The 3D pose of the garment model was obtained by fitting the SMPL model to the reconstructed dense point cloud. The data processing procedure is as follows: 1) For each labeled feature line, calculate its center point as the corresponding skeleton node. 2) Align all point clouds using the joints in the torso region to ensure consistent orientation and scale. 3) Compute the SMPL to provide the pose parameters for each model by fitting the nodes and point clouds.

Although there is already a corresponding multi-view real image for each garment, the variation in viewing angle and illumination is still limited. To ensure that the models in this paper can be generalized to different lighting conditions and views, the input of synthetic images is added. In particular, for each model, rendered images are generated by randomly sampling 3 viewpoints and 3 different lighting environments, resulting in a total of 9 images.

2.3.2

Loss function

A new feature line loss function L_line and edge length regularization loss L_fed are proposed and together they form a feature line fitting loss function defined as follows: 1 $L_{f i t i n g} = L_{l i n e} + λ_{f e d} L_{f e d}$ where λ_fed is set to 1 in the first cascade end and 0.5 in the second segment. Minimizes the average distance between points on the deformation template feature line and points on the ground truth point cloud. Here Chamfer Distance is used as the distance metric. In addition, the edge length regularization loss defined in is used to mitigate the jagged effect on the deformed feature line.

In the surface refinement stage, in addition to feature line loss L_line and feature line edge length regularization loss L_fed, this paper also employs Chamfer loss L_chm, normal loss L_nor, mesh edge loss L_med, and Laplace loss L_lap at the vertices as defined in on the mesh: 2 $L_{r e f i n e} = L_{c h m} + λ_{n o r} L_{n o r} + λ_{m e d} L_{m e d} + λ_{l i n e} L_{l i n e} + λ_{f e d} L_{f e d}$ where λ_nor, λ_lap, λ_med, λ_line and λ_fed are set to 1.6 ×10_–4, 1, 0.5, 1, and 0.5, respectively.

2.4

Analysis of experimental results

2.4.1

Evaluation indicators

The evaluation indexes of the 3D reconstruction task are mainly considered in terms of geometric error, modeling accuracy and data completeness. The geometric error mainly measures the error between the reconstruction result and the actual measurement data, the modeling accuracy mainly measures the difference between the reconstruction result and the real model, and the data completeness measures whether the reconstruction result can recover a complete 3D model from the given data. The evaluation indexes of this experiment are as follows.

1)

Intersection ratio of mesh volume

The intersection and merging ratio of mesh volume IoU is the quotient of the volume of the intersection of the predicted mesh and the real mesh and the volume of their merging, which is used to measure the degree of overlap between the predicted mesh and the real mesh. The specific calculation process is: 3 $I o U (M_{p r e d}, M_{G T}) = \frac{| M_{p r e d} \cap M_{G T} |}{| M_{p r e d} \cup M_{G T} |}$ where M_pred and M_GI denote the set of all sampled points inside or on the surface of the predicted and real meshes, respectively, ∩ denotes the intersection operation, and ∪ denotes the concatenation operation.

2)

Chamfer distance

The chamfer distance Chamfer-L₁ is used to measure the distance between the predicted mesh and the real mesh to evaluate the geometric error of the predicted mesh. Firstly, the accuracy score of the predicted mesh is defined as the average shortest distance between all sampled points on the surface of the predicted mesh and the surface of the real mesh, which is given by: 4 $A c c u r a c y (M_{p r e d} | M_{G T}) = \frac{1}{| \partial M_{p r e d} |} \int_{\partial M_{p r e d}} \min_{q \in \partial M_{G T}} ‖ p - q ‖ d p$ where ∂M_pred and ∂M_GT denote the surfaces of the prediction mesh and the real mesh, respectively. The integrity score of the predicted mesh is subsequently defined as the average shortest distance between all sampling points on the surface of the real mesh and the surface of the predicted mesh, which is given by: 5 $C o m p l e t e n e s s (M_{p r e d} | M_{G T}) = \frac{1}{| \partial M_{G T} |} \int_{\partial M_{G T}} \min_{q \in \partial M_{p r e d}} ‖ p - q ‖ d p$

Finally, the chamfer distance between the predicted mesh and the real network is defined as the average of the accuracy score and the completeness score, which is given by: $C h a m f e r - L_{1} (M_{p r e d}, M_{G T}) = \frac{1}{2} A c c u r a c y (M_{p r e d} | M_{G T}) + \frac{1}{2} C o m p l e t e n e s s (M_{p r e d} | M_{G T})$

3)

Normal consistency

In order to measure the degree of alignment between the surface of the predicted mesh and the surface of the real mesh, the normal consistency adopts a definition similar to the chamfer distance: for each point p on the predicted mesh, the normal vector of the point is calculated first, then the point is projected onto the surface of the real mesh to get the normal vector of the projected point on the real mesh, and finally the absolute value of the cosine of the angle between the normal vector of the point and the normal vector of the projected point is calculated, the absolute value of the cosine of the angle between the normal vector of the point and the normal vector of the projected point is averaged across all sampling points. Finally, the absolute value of the cosine of the angle between the normal vector of the point and the normal vector of the projection point is calculated, and the absolute value of the cosine of the angle between all the sampled points is averaged. The same method is then applied to each point q on the surface of the real mesh: 7 $\begin{array}{l} N o r m a l - C o n s i s t e n c y (M_{p r e d}, M_{G T}) = \\ \frac{1}{2 | \partial M_{p r e d} |} \int_{\partial M_{p r e d}} | 〈 n (p), n (p r o j_{2} (p)) 〉 | d p \\ + \frac{1}{2 | \partial M_{G T} |} \int_{\partial M_{G T}} | 〈 n (p r o j_{1} (q)), n (q) 〉 | d p \end{array}$ where 〈,〉 denotes the inner product of the vectors, n(p) and n(q) denote the unit normal vectors at point p on the surface of the predicted mesh and at point q on the surface of the real mesh, respectively, proj₂ (p) denotes the projection point of point p on the surface of the real mesh, and proj₁(q) denotes the projection point of point q on the surface of the predicted mesh. The larger value obtained by calculating the normal consistency represents the higher degree of alignment between the predicted mesh surface and the real mesh surface, and the better the 3D reconstruction effect.

4)

F-score

Recall is defined as the percentage of the total number of points on the real mesh that lie within a specific distance of the predicted mesh. Whereas precision is defined as the percentage of the total number of points on the predicted mesh that lie within a specific distance of the real mesh.F-score is defined as the harmonic mean between precision and recall. The specific formula is calculated as: 8 $F - S c o r e = 2 \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}$

2.4.2

Comparison with existing methods

In order to quantitatively compare the effect of the improved model with the existing method, this experiment uses the intersection ratio, chamfer distance, normal consistency and F-score evaluation metrics to evaluate the reconstruction results, and the dataset used is the DeepFashion3D dataset.

The values of the metrics of this paper’s model and the existing methods are shown in Fig. 2 to Fig. 5. In order to better analyze the reconstruction results of different types of garments in the DeepFashion3D dataset, the figure shows the metrics computed for nine types of garments in the DeepFashion3D dataset, numbered from 1 to 9: long-sleeved shirts, short-sleeved shirts, sleeveless shirts, long pants, short pants, long-sleeved dresses, short-sleeved dresses, sleeveless dresses, skirts, and averages.

From the calculation results of the evaluation indexes in the figure, it can be seen that compared with OccNet, this paper’s model improves 55.56%, 62.55%, 5.38% and 91.12% in the average intersection and merger ratio, chamfer distance, normal consistency and F-score, respectively. Compared to DyConvONet, this paper’s model improves 5.54%, 35.13%, 2.65% and 19.55% on average intersection and merger ratio, chamfer distance, normal consistency and F-score, respectively. Compared to ConvONet, the model in this paper shows 17.19%, 15.16% and 3.44% improvement in mean intersection ratio, chamfer distance and F-score, respectively, and only a slight decrease of 0.55% in normal consistency. It is important to note in the clothing category that the intersection and concordance ratios of the reconstruction results for pants and shorts are lower than those for other clothing types. Due to the more complex structure of dresses, the reconstruction is more difficult. As a result, the long-sleeved and short-sleeved dresses reconstruction results have larger chamfer distances relative to other garments and do not perform as well as other types of garments in terms of F-scores. Among all garment categories, short-sleeved shirts and sleeveless shirts have the best reconstruction results, which is related to their simple structures as well as the high number of these two types of garments in the DeepFashion3D dataset.

3

High-precision fabric animation simulation method based on geometric images

3.1

Network architecture

After realizing the 3D reconstruction of the cheongsam hanbok, this paper also investigates a fabric simulation method to achieve the simulation of the dress. The structure of the image super-resolution network model designed in this paper is shown in Fig. 6, which predicts the center high-resolution geometric image given 2N +1 continuous low-resolution geometric image. This is an end-to-end network consisting of spatio-temporal feature progressive fusion (SFPF), multi-scale feature extraction (MSFE) and reconstruction modules. To avoid shallow features disappearing during propagation, all MSFE outputs are sent to the end of the network for reconstruction.

3.1.1

Progressive integration of spatio-temporal features

Unlike the traditional image super-resolution dataset where there is not much correlation between images, the fabric motion has continuity, so the geometric images obtained are also connected to each other, which can provide more information for the network, therefore, this paper uses multiple consecutive geometric images as inputs, and in order to extract the spatio-temporal features of the multi-frame inputs, it can be processed by using 3D Conv, which, however, will bring about a huge amount of computation. In order to extract spatio-temporal features at different levels while reducing the number of parameters, this paper is inspired by the feature pyramid and proposes a progressive fusion module for spatio-temporal features with pyramid structure [21]. The features at level L₀ are first extracted using a 3 × 3 convolution and then sequentially downsampled 2 times using a 3 × 3 convolution with step size 2. For each of the 3 consecutive features, F_t–1, F_t, F_t+1, feature progressive fusion is performed starting from the high-level features and gradually moving up to the lower levels. Specifically, for features at level L₂, they are spliced in the channel dimension, the number of channels is reduced using a 1×1 convolution, and the temporal information of the features is further merged, and then they are upsampled to the level L₁ feature size and fused with the level L₁ features. Repeat the above until level L₀. In this paper, the feature maps of the first image and the last image are copied and spliced to the head and tail of the feature maps respectively, so that after this module, the size of the feature maps and the number of channels are unchanged, but each feature map contains the information of the neighboring frames.

3.1.2

Multi-scale feature extraction

In the field of image super-resolution, in order to obtain the global and local information of the image, recent network structures use large sensory fields and multi-scale learning for feature extraction, because feature representations with rich contextual information are very important for super-resolution reconstruction, which is also applicable to geometric images. For the geometric image constructed in this paper, the folds in the fabric mesh can be well represented on the geometric image due to the use of vertex displacement for interpolation, the more intense the folds of the fabric, the larger the pixel value of the corresponding region on the geometric image, due to the different regions of the fabric with different degrees of intensity of the folds, the degree of denseness of the folds are different, it is also necessary to use large receptive fields and multi-scale learning for feature extraction on the geometric image. ASPP extracts multiscale features by cascading multiple null convolution layers with null rates of 1, 4, and 8 in a residual manner; however, too large a null rate and the superposition of multiple null convolution kernels lose the continuity of the information, which is detrimental to pixel-level prediction. Therefore, in this paper, it is improved by using features from small sensory fields to gradually enrich the information captured from large sensory fields. The module consists of null convolution, local residual joining, and channel attention, where the null rate of null convolution is 2, 3, and 4, respectively. the first branch is convolved by 3 × 3 convolution and then the features are divided into two groups y₁, y₂, and y₁ spliced to the final output in the channel dimension, and y₂ is spliced to the next branch, and then fused to the features obtained from the next branch’s null convolution by using 1 × 1 convolution and repeating the above operation, and then finally fused to different scales of features by 1 × 1 convolution, which is the same as 1 × 1 convolution, and finally fused to the final output by 1 × 1 convolution. × 1 convolution to fuse the different scale feature information, which is fed to the channel attention module after residual concatenation.

The attention mechanism measures the importance of different feature information by assigning weights, so that the network can ignore irrelevant information and focus on the key information. Since multi-scale feature extraction has treated temporal information as a channel, the channel attention module in CBAM is introduced in this paper. For a given input feature map, the calculation formula is as follows: 9 $F^{'} = M_{c} (F) \otimes F$ 10 $\begin{array}{l} M_{c} (F) & = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{a v g}^{C})) + W_{1} (W_{0} (F_{\max}^{C}))) \end{array}$ where ⊗ denotes element-by-element multiplication, σ denotes the sigmoid function, $F_{a v g}^{C}$ and $F_{\max}^{C}$ denote mean pooling and maximum pooling over the channel dimension, respectively, and W₀ ϵ R^C/r×C, W₁ ϵ R^C×C/r, and r denote the shrinkage ratio. W₀ followed by RELU activation function.

3.2

Geometric Image Reconstruction

After obtaining the high-resolution geometric image through network prediction, we firstly add its pixel values with the pixel values of the geometric image storing the initial frame vertices of the high-precision mesh to obtain the geometric image storing the current frame vertices of the high-precision mesh, and then connect the neighboring pixels of the image in the transverse and longitudinal directions as well as in the positive slope direction in order to obtain the 3D triangular mesh model of the mesh.

The reconstructed mesh can recover a large number of real folds, however, due to the prediction operation, some “noise” is inevitably generated, which is manifested as the perturbation of some vertices near the dense area of folds as well as the mesh boundaries, this is due to the fact that the loss function in this paper only considers the displacement of vertices, and the problem can be solved to a certain extent by adding additional loss terms such as normals. This is due to the fact that the loss function in this paper only considers vertex displacements, and if additional loss terms such as normal are added, this problem can be solved to a certain extent, but this will add a lot of cost to the construction of the dataset as well as the construction and training of the whole network. We found that the reconstructed mesh can be simply processed to achieve good results, so this paper uses weighted Laplace smoothing to denoise it [22].

3.3

Analysis of experimental results

Table 1 shows the numerical errors of this paper’s method and PCA on the training and test sets. In this paper, a quantitative approach is taken to evaluate the difference in simulation accuracy and time consumption between the proposed fabric animation simulation method and the PCA method. In the comparison between this paper’s method and PCA method with the same output dimension, “T” in the animation name in the animation sequence column indicates that the animation is from the training set and “E” indicates that it is from the test set, the total error is the sum of the vertex error and the δ-coordinate error, and the vertex The total error is the sum of vertex error and δ-coordinate error, vertex error is the MSE error of the vertex, and δ-coordinate error refers to the Laplace term error. Principal component analysis is a classical method of data compression using orthogonal transformations and is widely used for exploratory data analysis and building predictive models. It is usually used for dimensionality reduction by linearly transforming the potentially correlated observations in a variable and projecting each data point onto only the first few principal components to obtain low-dimensional data, and these uncorrelated variables are called principal components. The first principal component can be equivalently defined as the direction that maximizes the variance of the projected data. Specifically, the principal component can be viewed as a linear equation to include a series of linear coefficients to indicate the projection direction. According to the experimental results, the mean values of the position error of the two methods are relatively close to each other, while the error of the δ-coordinate system reaches nearly a factor of ten, with the method of this paper presenting the main advantage. The error of this paper’s method shows a certain degree of increase on the test set, but still outperforms the PCA method.

Table 1.

Loss of our method and PCA on train set and evaluation set

Animation sequence	PCA			Ours
Animation sequence	Total error	Vertex error	δ error	Total error	Vertex error	δ error
T_Walking	0.547	0.015	0.532	0.0583	0.0155	0.0428
T_Dancing	0.305	0.013	0.292	0.0292	0.0049	0.0243
T_Capoeira	0.170	0.017	0.153	0.0416	0.0100	0.0316
T_Fighting	0.362	0.021	0.341	0.0481	0.0162	0.0319
T_Dancing2	0.330	0.001	0.329	0.0144	0.0053	0.0091
T_Hip Hop	0.203	0.005	0.198	0.0392	0.0158	0.0234
E_Running2	0.153	0.002	0.152	0.0880	0.0138	0.0742
E_Salsa	0.207	0.002	0.204	0.0859	0.0069	0.0790
Mean	0.29370	0.01230	0.28141	0.05059	0.01105	0.03954

The second experiment further tested the performance comparison in time between the present method and position-based dynamics (PBD), which is the higher performing simulation method. Table 2 shows the time consumption of this paper’s method and the PBD method. As shown in Table 2, three animations are selected for this experiment to test the time consumption of this method and PBD respectively, while different clothing feature dimensions (two higher and lower values of s=256 and s=80) are taken in testing this method. Overall the time consumption of the present method is about 2% of the PBD method. In a 60-frame video with only 15.65 milliseconds per frame, the fabric animation simulated by the PBD method cannot meet the requirements, this is because in order to represent vertex animation, the garments need to have a large number of vertices (thousands or tens of thousands), which not only have to solve the animation, but also have to deal with the collisions, which leads to a huge amount of time consumed by the traditional methods such as PBD. This method simplifies the computation by learning from offline generated data and exchanging space for time.

Table 2.

Time consumption of our method and PBD

Animation name		Ours(ms/frame)	PBD(ms/frame)
Walking	S=256	6.55	318
Walking	S=80	4.23	318
Capoeira	S=256	6.34	348.3
Capoeira	S=80	4.85	348.3
Rumba Dancing	S=256	6.18	326.7
Rumba Dancing	S=80	4.68	326.7

4

Conclusion

The 3D reconstruction results of cheongsam hanfu are evaluated using intersection ratio, chamfer distance, normal consistency and F-score evaluation metrics. Except for the 0.55% decrease in normal consistency compared with ConvONet, the reconstruction accuracy of this paper’s model is improved compared with other models. The most obvious improvement over the OccNet model is 55.56%, 62.55%, 5.38% and 91.12% in the average intersection ratio, chamfer distance, normal consistency and F-score, respectively. Simulation of the cheongsam fabric animation is constructed, and the quantitative results show that the δ-coordinate system error of the simulation technique proposed in this paper is 0.03954, which is only about 14% of the PCA algorithm. And the time consumption of this paper’s method is only about 2% of the PBD algorithm, which indicates that the deep learning-based 3D reconstruction and simulation technique of cheongsam and hanfu proposed in this paper has relatively excellent results.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Deep Learning-based 3D Reconstruction and Simulation Technology for Cheongsam and Hanbok

Li’na Zhao

Xiaoxuan Nie

Published Online: Sep 22, 2025

Received: Jan 22, 2025

Accepted: May 07, 2025

DOI: https://doi.org/10.2478/amns-2025-0965

KeywordsGCN, Deep learning, 3D reconstruction, Animation simulation, Hanfu qipao

© 2025 Li’na Zhao et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
GCN, Deep learning, 3D reconstruction, Animation simulation, Hanfu qipao