Deep Learning-based 3D Reconstruction and Simulation Technology for Cheongsam and Hanbok
Published Online: Sep 22, 2025
Received: Jan 22, 2025
Accepted: May 07, 2025
DOI: https://doi.org/10.2478/amns-2025-0965
Keywords
© 2025 Li’na Zhao et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
As the most representative Chinese dress of the Chinese nation, Hanfu has become an important symbol of China’s national cultural self-confidence and dissemination of excellent traditional culture to the world [1–3]. By organizing Hanbok cultural exhibitions, broadcasting film and television works related to Hanbok culture, and organizing online and offline activities such as “Hanbok shows”, Hanbok culture has spread rapidly both in online virtual space and in real life, and it has been promoted from Hanbok enthusiasts to the general public all over the world [4–6]. If Hanbok can arouse people’s aesthetic pleasure, then the value attributes of Hanbok have established a connection with the aesthetic subject, which can trigger emotional resonance, and then make them identify with Chinese traditional culture [7–8].
The current rise of Hanbok fever is a materialized expression of the sublimation of contemporary Chinese culture from self-awareness to self-confidence. Hanbok is the collective name of Han national costumes, which is one of the expressions of the 5000 years culture of the Chinese nation [9–10]. In the past decade, Hanbok movement has evolved from a niche group activity to a popular cultural movement, and gradually penetrated into the traditional cultural kernel from the aesthetics of clothing, which has been more and more emphasized by people [11–12]. At the same time, as an emerging technology, the application of 3D virtual fitting technology in the field of traditional clothing design and display has boosted the dissemination of Hanfu culture and the industrialization process. For example, Chen Xi builds a VR museum of Hanfu through three-dimensional modeling, exploring a new direction for the display of traditional costumes [13–15].
Three-dimensional virtual fitting system relies on computer image processing technology and graphics theory, through three-dimensional human body modeling technology, virtual sewing technology and fabric simulation technology, to achieve the construction of the user’s virtual human body model and virtual clothing sewing and display [16–18].
An adaptive template is proposed on the basis of SMPL to train the cheongsam category. A multi-stage stepwise deformation algorithm model is proposed for the folding or flipping error problem, and GCN is applied to deeply learn the deformation of the cheongsam hanbok and reconstruct the cheongsam hanbok in 3D from the three aspects of pose estimation, feature line fitting and surface refinement. The reconstruction results are evaluated using metrics such as intersection and concurrency ratio, chamfer distance, normal consistency, and F-score on the mesh volume. In order to make the 3D reconstruction of cheongsam and hanfu more realistic, a high-precision fabric animation simulation method based on geometric images is proposed. A super-resolution network model is designed, including a progressive fusion module of spatio-temporal features with pyramid structure, which gradually enriches the information captured by the large receptive field using the features of the small receptive field, and solves the problems of computational volume and global information acquisition.
With the rapid development of Internet fashion consumption, people’s demand for Internet fashion consumption experience is getting higher and higher, and interactive, personalized and realistic virtual try-on technology has received widespread attention. At the same time, cheongsam and hanfu, as Chinese traditional costumes, is a typical type of high fashion with national cultural characteristics, diverse styles and rich personalization. In order to enhance the interactive, personalized and realistic cheongsam and hanbok virtual try-on experience effect, the 3D reconstruction and simulation of cheongsam and hanbok play an important role. For this reason, this paper researches the reconstruction and simulation technology of cheongsam and hanbok based on deep learning technology.
Although the style of cheongsam hanbok varies greatly, there are usually only a few underlying topologies. This makes it possible to use a template-based approach. A straightforward approach is to train different models for different cheongsam categories using different predefined templates. However, this can lead to overfitting as less data is used to train each model. To address this problem, this paper proposes an adaptive template, which is a new representation that scales to different garment topologies, using one network to generate all types of garments available in the dataset. The template is built on the SMPL model, removing the head, hand and foot parts [19].
During the training phase, the entire adaptive mesh is input to the network. However, different semantic regions are activated based on the estimated garment topology. In particular, in this paper, the template mesh is denoted as
A parsimonious approach to shape estimation from an adaptive template

Flowchart of garment reconstruction algorithm
The three stages of the network are trained separately. All cascade networks used in the framework share the first part of VGGNet and differ only in the last layer.
The 3D pose of the garment model was obtained by fitting the SMPL model to the reconstructed dense point cloud. The data processing procedure is as follows: 1) For each labeled feature line, calculate its center point as the corresponding skeleton node. 2) Align all point clouds using the joints in the torso region to ensure consistent orientation and scale. 3) Compute the SMPL to provide the pose parameters for each model by fitting the nodes and point clouds.
Although there is already a corresponding multi-view real image for each garment, the variation in viewing angle and illumination is still limited. To ensure that the models in this paper can be generalized to different lighting conditions and views, the input of synthetic images is added. In particular, for each model, rendered images are generated by randomly sampling 3 viewpoints and 3 different lighting environments, resulting in a total of 9 images.
A new feature line loss function
In the surface refinement stage, in addition to feature line loss
The evaluation indexes of the 3D reconstruction task are mainly considered in terms of geometric error, modeling accuracy and data completeness. The geometric error mainly measures the error between the reconstruction result and the actual measurement data, the modeling accuracy mainly measures the difference between the reconstruction result and the real model, and the data completeness measures whether the reconstruction result can recover a complete 3D model from the given data. The evaluation indexes of this experiment are as follows.
Intersection ratio of mesh volume
The intersection and merging ratio of mesh volume IoU is the quotient of the volume of the intersection of the predicted mesh and the real mesh and the volume of their merging, which is used to measure the degree of overlap between the predicted mesh and the real mesh. The specific calculation process is:
Chamfer distance
The chamfer distance Chamfer-
Finally, the chamfer distance between the predicted mesh and the real network is defined as the average of the accuracy score and the completeness score, which is given by:
Normal consistency
In order to measure the degree of alignment between the surface of the predicted mesh and the surface of the real mesh, the normal consistency adopts a definition similar to the chamfer distance: for each point
F-score
Recall is defined as the percentage of the total number of points on the real mesh that lie within a specific distance of the predicted mesh. Whereas precision is defined as the percentage of the total number of points on the predicted mesh that lie within a specific distance of the real mesh.F-score is defined as the harmonic mean between precision and recall. The specific formula is calculated as:
In order to quantitatively compare the effect of the improved model with the existing method, this experiment uses the intersection ratio, chamfer distance, normal consistency and F-score evaluation metrics to evaluate the reconstruction results, and the dataset used is the DeepFashion3D dataset.
The values of the metrics of this paper’s model and the existing methods are shown in Fig. 2 to Fig. 5. In order to better analyze the reconstruction results of different types of garments in the DeepFashion3D dataset, the figure shows the metrics computed for nine types of garments in the DeepFashion3D dataset, numbered from 1 to 9: long-sleeved shirts, short-sleeved shirts, sleeveless shirts, long pants, short pants, long-sleeved dresses, short-sleeved dresses, sleeveless dresses, skirts, and averages.

The results of the IOU calculation

The results of the

The results of the Normal-Consistency calculation

The results of the F-score calculation
From the calculation results of the evaluation indexes in the figure, it can be seen that compared with OccNet, this paper’s model improves 55.56%, 62.55%, 5.38% and 91.12% in the average intersection and merger ratio, chamfer distance, normal consistency and F-score, respectively. Compared to DyConvONet, this paper’s model improves 5.54%, 35.13%, 2.65% and 19.55% on average intersection and merger ratio, chamfer distance, normal consistency and F-score, respectively. Compared to ConvONet, the model in this paper shows 17.19%, 15.16% and 3.44% improvement in mean intersection ratio, chamfer distance and F-score, respectively, and only a slight decrease of 0.55% in normal consistency. It is important to note in the clothing category that the intersection and concordance ratios of the reconstruction results for pants and shorts are lower than those for other clothing types. Due to the more complex structure of dresses, the reconstruction is more difficult. As a result, the long-sleeved and short-sleeved dresses reconstruction results have larger chamfer distances relative to other garments and do not perform as well as other types of garments in terms of F-scores. Among all garment categories, short-sleeved shirts and sleeveless shirts have the best reconstruction results, which is related to their simple structures as well as the high number of these two types of garments in the DeepFashion3D dataset.
After realizing the 3D reconstruction of the cheongsam hanbok, this paper also investigates a fabric simulation method to achieve the simulation of the dress. The structure of the image super-resolution network model designed in this paper is shown in Fig. 6, which predicts the center high-resolution geometric image given 2

Network structure
Unlike the traditional image super-resolution dataset where there is not much correlation between images, the fabric motion has continuity, so the geometric images obtained are also connected to each other, which can provide more information for the network, therefore, this paper uses multiple consecutive geometric images as inputs, and in order to extract the spatio-temporal features of the multi-frame inputs, it can be processed by using 3D Conv, which, however, will bring about a huge amount of computation. In order to extract spatio-temporal features at different levels while reducing the number of parameters, this paper is inspired by the feature pyramid and proposes a progressive fusion module for spatio-temporal features with pyramid structure [21]. The features at level
In the field of image super-resolution, in order to obtain the global and local information of the image, recent network structures use large sensory fields and multi-scale learning for feature extraction, because feature representations with rich contextual information are very important for super-resolution reconstruction, which is also applicable to geometric images. For the geometric image constructed in this paper, the folds in the fabric mesh can be well represented on the geometric image due to the use of vertex displacement for interpolation, the more intense the folds of the fabric, the larger the pixel value of the corresponding region on the geometric image, due to the different regions of the fabric with different degrees of intensity of the folds, the degree of denseness of the folds are different, it is also necessary to use large receptive fields and multi-scale learning for feature extraction on the geometric image. ASPP extracts multiscale features by cascading multiple null convolution layers with null rates of 1, 4, and 8 in a residual manner; however, too large a null rate and the superposition of multiple null convolution kernels lose the continuity of the information, which is detrimental to pixel-level prediction. Therefore, in this paper, it is improved by using features from small sensory fields to gradually enrich the information captured from large sensory fields. The module consists of null convolution, local residual joining, and channel attention, where the null rate of null convolution is 2, 3, and 4, respectively. the first branch is convolved by 3 × 3 convolution and then the features are divided into two groups
The attention mechanism measures the importance of different feature information by assigning weights, so that the network can ignore irrelevant information and focus on the key information. Since multi-scale feature extraction has treated temporal information as a channel, the channel attention module in CBAM is introduced in this paper. For a given input feature map, the calculation formula is as follows:
After obtaining the high-resolution geometric image through network prediction, we firstly add its pixel values with the pixel values of the geometric image storing the initial frame vertices of the high-precision mesh to obtain the geometric image storing the current frame vertices of the high-precision mesh, and then connect the neighboring pixels of the image in the transverse and longitudinal directions as well as in the positive slope direction in order to obtain the 3D triangular mesh model of the mesh.
The reconstructed mesh can recover a large number of real folds, however, due to the prediction operation, some “noise” is inevitably generated, which is manifested as the perturbation of some vertices near the dense area of folds as well as the mesh boundaries, this is due to the fact that the loss function in this paper only considers the displacement of vertices, and the problem can be solved to a certain extent by adding additional loss terms such as normals. This is due to the fact that the loss function in this paper only considers vertex displacements, and if additional loss terms such as normal are added, this problem can be solved to a certain extent, but this will add a lot of cost to the construction of the dataset as well as the construction and training of the whole network. We found that the reconstructed mesh can be simply processed to achieve good results, so this paper uses weighted Laplace smoothing to denoise it [22].
Table 1 shows the numerical errors of this paper’s method and PCA on the training and test sets. In this paper, a quantitative approach is taken to evaluate the difference in simulation accuracy and time consumption between the proposed fabric animation simulation method and the PCA method. In the comparison between this paper’s method and PCA method with the same output dimension, “T” in the animation name in the animation sequence column indicates that the animation is from the training set and “E” indicates that it is from the test set, the total error is the sum of the vertex error and the
Loss of our method and PCA on train set and evaluation set
Animation sequence | PCA | Ours | ||||
---|---|---|---|---|---|---|
Total error | Vertex error | Total error | Vertex error | |||
T_Walking | 0.547 | 0.015 | 0.532 | 0.0583 | 0.0155 | 0.0428 |
T_Dancing | 0.305 | 0.013 | 0.292 | 0.0292 | 0.0049 | 0.0243 |
T_Capoeira | 0.170 | 0.017 | 0.153 | 0.0416 | 0.0100 | 0.0316 |
T_Fighting | 0.362 | 0.021 | 0.341 | 0.0481 | 0.0162 | 0.0319 |
T_Dancing2 | 0.330 | 0.001 | 0.329 | 0.0144 | 0.0053 | 0.0091 |
T_Hip Hop | 0.203 | 0.005 | 0.198 | 0.0392 | 0.0158 | 0.0234 |
E_Running2 | 0.153 | 0.002 | 0.152 | 0.0880 | 0.0138 | 0.0742 |
E_Salsa | 0.207 | 0.002 | 0.204 | 0.0859 | 0.0069 | 0.0790 |
Mean | 0.29370 | 0.01230 | 0.28141 | 0.05059 | 0.01105 | 0.03954 |
The second experiment further tested the performance comparison in time between the present method and position-based dynamics (PBD), which is the higher performing simulation method. Table 2 shows the time consumption of this paper’s method and the PBD method. As shown in Table 2, three animations are selected for this experiment to test the time consumption of this method and PBD respectively, while different clothing feature dimensions (two higher and lower values of s=256 and s=80) are taken in testing this method. Overall the time consumption of the present method is about 2% of the PBD method. In a 60-frame video with only 15.65 milliseconds per frame, the fabric animation simulated by the PBD method cannot meet the requirements, this is because in order to represent vertex animation, the garments need to have a large number of vertices (thousands or tens of thousands), which not only have to solve the animation, but also have to deal with the collisions, which leads to a huge amount of time consumed by the traditional methods such as PBD. This method simplifies the computation by learning from offline generated data and exchanging space for time.
Time consumption of our method and PBD
Animation name | Ours(ms/frame) | PBD(ms/frame) | |
---|---|---|---|
Walking | S=256 | 6.55 | 318 |
S=80 | 4.23 | ||
Capoeira | S=256 | 6.34 | 348.3 |
S=80 | 4.85 | ||
Rumba Dancing | S=256 | 6.18 | 326.7 |
S=80 | 4.68 |
The 3D reconstruction results of cheongsam hanfu are evaluated using intersection ratio, chamfer distance, normal consistency and F-score evaluation metrics. Except for the 0.55% decrease in normal consistency compared with ConvONet, the reconstruction accuracy of this paper’s model is improved compared with other models. The most obvious improvement over the OccNet model is 55.56%, 62.55%, 5.38% and 91.12% in the average intersection ratio, chamfer distance, normal consistency and F-score, respectively. Simulation of the cheongsam fabric animation is constructed, and the quantitative results show that the