Practical Research on Virtual Reality and Augmented Reality Technology in Min Cultural Heritage Digital IP Character Design Innovation

Since ancient times, Fujian has been a place of great wealth and talent. In the history of mankind, it has created many colorful ancient cultures. Fujian local culture is also an inseparable part of the traditional culture of the Chinese nation. However, due to the influence of the natural geographical conditions of Fujian, the development of local culture is also characterized by distinctive regional culture. Because the formation or evolution of any culture is always closely related to a certain historical environment [1–2]. China plans to build a socialist cultural power in 2035, and strengthening the protection and inheritance of cultural heritage is an important part of the construction of a cultural power. Fuzhou City has actively explored and practiced the protection and inheritance of the “roots” and “soul” of Fujian’s eight Fujian cultures and achieved good results [3–4]. However, in the actual conservation and utilization of cultural relics and cultural heritage protection and inheritance, there are still problems, such as the lack of effectiveness of the regulatory mechanism. The overall planning and implementation of the later stage of the problem are obvious. The society does not know the importance of cultural heritage protection and inheritance, and unbalanced development in the region, etc., which seriously affects the effectiveness of the protection and inheritance of cultural heritage in Fuzhou and is also not conducive to the presentation of Fuzhou’s historical and cultural identity and cultural self-confidence [5–7]. From the perspective of cultural power, it is of great significance to strengthen the research on the enhancement path of cultural heritage protection and inheritance and explore the new path of cultural heritage protection and inheritance in order to promote the historical and cultural identity of Fuzhou, show cultural self-confidence, and help the realization of the strategy of cultural power [8–9].

Min cultural heritage digital IP character design is an application design that relies on the development of electronic computer technology. It was born from the continuous pursuit of the human visual art of movies and the rapid progress of science and technology [10]. After decades of development, digital character design has become a comprehensive design art involving multiple character design fields, such as movie roles, animation roles, game roles and corporate images, and is widely used in various visual media [11–12]. Digital character design challenges the limits of human creativity, blending in life but higher than life. Traditional art is the soil from which it draws nutrients, and science and technology are the driving forces for its development. It conforms to the modern digital lifestyle and is representative of the new design forms and design concepts in the digital era [13–14]. At present, the popularization of multidimensional image imaging technology for education, scientific research, engineering construction and other industries creates favorable conditions for the development of the future stage of the main direction of the development of the application of image imaging analysis technology [15–16]. Among them, virtual reality and augmented reality technology are the mainstay of multi-dimensional image imaging technology, making the application of imaging technology more mature, and the related imaging content is also gradually moving towards realism and life, providing effective technical help and support for the application of modern image spatial structure analysis and image processing technology [17–18]. Therefore, the digital IP character design of Min cultural heritage can be innovated from multiple basic spatial dimensions so that the application of virtual reality and augmented reality technology is more in line with the basic needs of the development of Min cultural heritage [19–20].

Based on VR and AR technology, this paper proposes the design process for the digital IP character of the Min cultural heritage, and explores the production and motion capture module for the character. On the one hand, to generate corresponding images efficiently through text vectors, the LAFITE model is used to generate the digital IP image of Min’s cultural heritage. CelebA and Anime Face are selected as the experimental datasets, and the image generation quality and image diversity of the LAFITE model are analysed by comparing the FID metrics and IS metrics of this paper’s model and other image output models on the two datasets. Then, relevant data and information on Minnan culture are collected as input to explore the image quality of the output digital IP image. On the other hand, human structure vectors are constructed from the acquired human joint point data, and the combination of vector angle and vector mode ratio between the structure vectors is chosen to create human pose description vectors, and the extended support vector machine is used for action recognition. A customized dataset containing seven actions was collected, and the model was trained and tested to investigate the effect of the constructed model on the recognition of action capture for digital IP characters.

2

VR and AR based digital IP character design for Min culture

Min culture refers to the cultural resources of Fujian in terms of philosophy, history, literature, art, craftsmanship, folklore, etc., which contains the unity of inland civilisation as well as the openness of ocean civilisation, showing strong regional characteristics. However, at present, the dissemination of Min culture is in a single form, and the cultural products derived from it lack intuition and interest, which are not easy to accept by contemporary youths, and it is not easy to achieve the role of promoting and disseminating Min culture. Therefore, under the background of the rapid development of new media technology and the continuous updating of new media, how to use the optimal solution to develop the Min cultural resources and choose an optimal medium to disseminate and promote the essence of the Min culture is a problem that needs to be solved urgently nowadays.

Virtual reality and augmented reality technology can use computers to construct virtual three-dimensional scenes and virtual characters, set up interactive animation plots and multi-sensory experiences such as vision, hearing, touch, etc., so that the audience can interact with the virtual world and experience various sensory information through virtual interactive devices. Compared with the traditional cultural communication media, the virtual reality animation created by VR and AR technology has the characteristics of immersion, interactivity and experience, which can convey the connotation of Min culture from various perspectives and allow the audience to subconsciously feel the charm of Min culture in the process of watching and experiencing the animation.

The traditional Min culture, with its strong regional characteristics, has the characteristics of age, niche, and unpopularity due to its traditional communication method. However, the younger generation of the audience prefers popular culture and fashionable communication media, which leads to the dilemma that the traditional Min culture is gradually facing the marginalisation of the times, and no one is willing to learn it, and then there is a phenomenon of the fault of the inheritance and dissemination of the Min culture. The Min culture’s charm can be enhanced if the traditional culture is given appropriate animation expression and spread out through new media platforms. Therefore, this paper combines VR and AR technologies to innovatively design digital IP characters from Min culture.

The comprehensive development and application of digital information technology effectively improves the design level of digital IP characters through virtual reality and augmented reality technology, and there are many advanced technologies and software to improve the design quality. After comprehensively analysing the characteristics of the two, this paper focuses on the study of VR, AR technology, and 3D design of digital IP characters.

The design process of the Min culture digital IP character is shown in Figure 1. Three-dimensional model design, motion capture system and the production of digital IP characters are the main structural parts of the current digital IP character design, and virtual reality technology and digital IP three-dimensional character design is a mutually promoting and influential relationship. In the design, animation technology, VR technology, AR technology, and deep learning technology should be fully integrated and applied comprehensively, which can better master the whole form of technology and be able to display it in the virtual reality platform through the interaction and experience of the Min culture digital IP characters, a relatively different viewing experience can be obtained. From the analysis of the actual application effect, the design of the Min cultural digital IP character can achieve the effect of interactivity and a strong sense of viewing, enhancing the cultural connotation and artistic infectivity, thus triggering the artistic resonance of the audience’s heart, and improve the efficiency of the Min culture dissemination. In the following section, the image generation and action recognition capture of Min cultural heritage digital IP design are studied, respectively.

3

Digital IP image generation of Min culture based on LAFITE

3.1

LAFITE model

In this paper, based on deep learning for virtual character image design, the LAFITE model is selected to achieve the generation of digital IP images of Min culture. The LAFITE model not only makes use of the multimodal semantic space aligned with the powerful pre-trained CLIP model to seamlessly alleviate the requirement of textual descriptive data during the training process by generating textual features from the image features but also, for the first time, the textual features are transferred to the Stylespace which can directly control the image features, and more efficiently generate the corresponding virtual digital IP images through the text vectors. It is also the first time that text features are transferred to Stylespace, which can directly control image features to generate corresponding virtual digital IP images from text vectors more efficiently.

3.1.1

N etwork architecture

StyleSpace in StyleGAN2 is a potential space with high linear divisibility and a high degree of de-entanglement, i.e., weak mutual interference between individual features. The LAFITE model takes advantage of this property to inject conditional information directly into StyleSpace, as shown in the following steps: ① The random noise vector z ∈ Z is transformed into the potential space W space by mapping the space. ② Further transformations based on the W-space, for each layer of the generator, a different affine transformation is used to map the intermediate vectors w of the W-space into the style parameters s at the channel level by affine transformations. ③ Pseudo-text feature vectors h′ are output through the multimodal space trained by CLIP. ④ h′ are transformed into conditional vectors c through a 2-layer FC network learnt differently by each generator layer. ⑤ The style parameter s is obtained by mapping it to the conditional vector c to get [s,c], which is further transformed into the channel conditional style vector u by different affine transformations for each generator layer, and then u is input into the synthetic network style block of Style GAN2. The space in which the conditional style vectors u are located is referred to as the conditional style space, or U. generator G synthetic image as: 1 $x' = G (h', z)$

Where x′ is the false image generated by the generator, h′ is the pseudo-text feature vector output by the CLIP model, z is the random noise vector, and G(·) is the overall generator.

Discriminator: In the text-generated image model, the discriminator needs to ensure the image quality and image-text correspondence, the LAFITE model uses a shared discriminator backbone to encode the input image x, and performs two tasks through two fully connected layers: ① f_d(x) discriminates the input image x, confirms whether the input image x is a real image from the training set or a fake image generated by the generator and converts the result into vectors. Into vectors. ② Input image x into the multimodal semantic space, which is similar to the multimodal space of CLIP, and output the text feature vector f_s(x) corresponding to image x. In summary, the output of the discriminator based on the pseudo-text features is: 2 $D (x, h') = f_{d} (x) + 〈 h', f_{s} (x) 〉$ 3 $〈 h', f_{s} (x) 〉 = f_{s} (x)^{T} h'$

Where f_d(x) denotes the probability that the input image x is true or false, and 〈h′,f_s(x)〉 denotes the inner product of the pseudo-text feature vector h′ and the corresponding text feature vector f_s(x) of the image, indicating the degree of match between the two. When the value of D(x,h′) of image x is large, it means that the discriminated image is true and the matching degree of h′ and f_s(x) is high. When the value of D(x,h′) of image x is large, it means that the discriminated image is real and the matching degree of h′ and f_s(x) is high. Similar to the discriminator based on pseudo-text features, the discriminator based on real text features outputs: 4 $D (x, h) = f_{d} (x) + 〈 h, f_{s} (x) 〉$ 5 $〈 h, f_{s} (x) 〉 = f_{s} (x)^{T} h$

Where f_d(x) denotes the probability that the input image x is true or false, 〈h,f_s(x)〉 denotes the inner product of the text feature vector h and the corresponding text feature vector f_s(x) of the image, indicating the degree of match between the two. When the value of D(x,h) of image x is large, it means that the discriminated image is true and the matching degree of h and f_s(x) is high.

3.1.2

Loss function

The LAFITE model ensures that the Stylespace of StyleGAN2 is consistent with the pre-trained CLIP by means of contrast learning, and thus has a contrast loss function with regular conditional GAN losses. The conventional conditional loss functions for the generator and discriminator based on pseudo-text feature vectors are as follows: 6 $L_{G} = - \sum_{i = 1}^{n} \log σ (D (x_{i}', h_{i}'))$ 7 $L_{D} = - \sum_{i = 1}^{n} \log σ (D (x_{i}', h_{i}')) - \sum_{i = 1}^{n} \log (1 - σ (D (x_{i}', h_{i}')))$

Where image x_i′ comes from a small batch of generated image dataset ${x_{i}'}_{i - 1}^{n}$ , h_i′ is the pseudo-text feature generated accordingly to the i th image, and σ(·) denotes the Sigmoid function. The regularity loss function based on real text is: 8 $L_{G} = - \sum_{i = 1}^{n} \log σ (D (x_{i}', h_{i}))$ 9 $L_{D} = - \sum_{i = 1}^{n} \log σ (D (x_{i}', h_{i})) - \sum_{i = 1}^{n} \log (1 - σ (D (x_{i}', h_{i})))$

In order to improve the similarity between the corresponding feature text vector f_i(x_i) of the input image extracted by the discriminator (which contains the real image and the generated image) and the pseudo-text feature vector h_i′ of the input generator, the contrast loss function of the discriminator is designed as follows: 10 $L_{C o n D} = - \sum_{i = 1}^{n} \log \frac{\exp (s i m (f_{s} (x_{i}), {h^{'}}_{i}) / τ)}{\sum_{j = 1}^{n} \exp (s i m (f_{s} (x_{j}), h_{i}^{'}) / τ)}$ 11 $S i m (p, q) = \frac{〈 p, q 〉}{{‖ p ‖}_{2} {‖ q ‖}_{2}}$

Where Sim (p,q) denotes the cosine similarity of the two vectors p and q, ‖·‖₂ denotes the L2 criterion, and τ is a non-negative hyperparameter. For the discriminator using real text, the contrast loss function is: 12 $L_{C o n D} = - \sum_{i = 1}^{n} \log \frac{\exp (s i m (f_{s} (x_{i}), h_{i}) / τ)}{\sum_{j = 1}^{n} \exp (s i m (f_{s} (x_{j}), h_{i}) / τ)}$

The LAFITE model utilises the CLIP model to improve the semantic consistency of the generated image ${x^{'}}_{i}$ and the corresponding pseudo-textual features h′, using the same hyperparameters τ for the generator’s loss of contrast as: 13 $L_{C o n G} = - \sum_{i = 1}^{n} \log \frac{\exp (s i m (f_{i m g} ({x^{'}}_{i}), {h^{'}}_{i}) / τ)}{\sum_{j = 1}^{n} \exp (s i m (f_{i m g} ({x^{'}}_{j}), {h^{'}}_{i}) / τ)}$

Where $f_{i m g} ({x^{'}}_{i})$ is the image feature vector corresponding to the i nd generated image. For the generator using real text, the contrast loss function is: 14 $L_{C o n G} = - \sum_{i = 1}^{n} \log \frac{\exp (s i m (f_{i m g} ({x^{'}}_{i}), h_{i}) / τ)}{\sum_{j = 1}^{n} \exp (s i m (f_{i m g} ({x^{'}}_{j}), h_{i}) / τ)}$

With the above comparison loss function, the final training loss for the generator and discriminator is defined as: 15 $\begin{array}{l} L_{D}^{'} & = L_{D} + γ L_{c o n D} \\ L_{G}^{'} & = L_{G} + γ L_{C o n D} + λ L_{c o n G} \end{array}$

When τ=0.5, λ=γ=20, the model is set up with pseudo-text feature input. When τ=0.5, λ=20, γ=10, the model is set up with text feature input.

3.2

Experiments and analyses

This chapter proposes a digital IP image generation model for Min culture based on the LAFITE model. In order to verify the effectiveness of the model, this section will conduct experiments to measure the virtual digital IP character image generation effect of the model in terms of image quality and image diversity and compare and analyse it with the current mainstream image generation models.

3.2.1

Experimental preparation

The LAFITE model was trained and validated on the public datasets CelebA and Anime Face. The CelebA dataset used in this experiment is a large-scale public dataset of face attributes provided by the Chinese University of Hong Kong. The dataset contains more than 200,000 face images of about 100,000 celebrities, with an image size of 178 pixels × 218 pixels, and each image is annotated with 40 attributes, such as age, gender, hairstyle, expression, and glasses. In addition, CelebA annotates each image with five key points of information commonly used in face recognition and analysis. The Anime Face dataset contains 120 high-definition headshots of anime characters, with each image having a resolution of between 90×90 and 120×120.

3.2.2

Model comparison and analysis

In the image generation task, quantitative methods are required to assess the similarity between the generated images and the real data, as well as the diversity and quality of the generated images. The quantitative evaluation metrics FID and IS in this experiment are used to evaluate the performance of each scheme. The larger value of IS indicates a higher quality of the generated image. The lower the FID value indicates, the closer the generated image distribution is to the real image distribution. The closer the generated image is to the real image, the better the image quality and diversity.

This experiment performs the same tests on ProGAN, StyleGAN, WGAN and TransGAN-XL on CelebA and Anime Face datasets, respectively, and compares them with the LAFITE Min cultural digital IP image generation model proposed in this paper, and counts the average values of FID and IS of all the generated images as the performance evaluation metrics of the models.

The results of the metrics of different models in the two datasets are shown in Table 1. The FID metrics and IS metrics of the LAFITE model proposed in this paper in CelebA and Anime Face datasets are 3.26 and 4.01, 6.32 and 8.05, respectively, which are the optimal values among the five experimental models, which indicates that LAFITE basically outperforms the traditional generative adversarial network model based on convolutional neural network in terms of quality and diversity of generated images. The combined image quality assessment and diversity assessment scores obtained from the experiments indicate that the LAFITE model has excellent performance in both image quality and image diversity, which helps it generate digital IP character images of Min’s cultural heritage.

Table 1.

The results of different models in two data sets

Methods	CelebA		Anime Face
Methods	FID↓	IS↑	FID↓	IS↑
ProGAN	7.04	3.64	9.8	5.61
StyleGAN	4.24	3.77	12.07	6.73
WGAN	11.39	—	10.52	4.22
TransGAN-XL	13.57	2.87	8.71	7.19
LAFITE	3.26	4.01	6.32	8.05

3.2.3

Digital IP image generation

Through the web crawler to obtain the relevant data information of Min culture, select the attributes and categories of the virtual IP image to be combined to form the text feature data, input into the image output model, and output the design results of the virtual Min cultural heritage digital IP character image, the generated image has a good sense of realism and fidelity, and in order to quantitatively test the effect of the generated graphic, Fig. 2 shows the peak signal-to-noise ratio of the outputs generated by the different methods on the digital IP character of the Min cultural heritage. To quantitatively test the effect of the generated graphics, Figure 2 shows the output peak signal-to-noise ratio of different methods for the generation of digital IP character images in Min culture. The output peak signal-to-noise ratio of the Min culture digital IP character graphic generation using this paper’s method is higher, reaching more than 70dB, which indicates that the quality of the Min culture digital IP character imaging is better, while the output peak signal-to-noise ratios of the images of the other models are all below 60dB.

4

MSVM-based digital IP character action recognition

The application of virtual reality technology is both interactive and immersive, and action recognition is a supporting technology and an important component in the real-time interactive system of virtual humans. In this paper, we adopt a support vector machine as the research method, based on the Kinect platform for the pose representation of digital IP characters, and realise the action capture recognition of digital IP characters of Min cultural heritage.

4.1

Digital IP Role Pose Representation

How to extract the amount of human pose representation features is the key to action recognition research, and the selection of the amount of digital IP character pose representation features directly affects the recognition rate of the action recognition algorithm. The process of extracting the pose representation feature volume is shown in Figure 3. In this paper, we use the joint point data to construct the human body structure vectors, take the angle between the structure vectors as the main information, and then select the modal ratio between some of the vectors as the auxiliary information to complete the regularisation process of the joint points.

4.1.1

Vector construction of the human body structure

The information of each joint point in the human structure model is labeled, and 22 sets of human structure vectors are constructed. The human structure vectors are named uniformly, using the names of the two joints that make up the vectors to name the human structure vectors in a way that the starting point is in the front and the end point is at the back, e.g., the structure vector consisting of the right shoulder and the right elbow can be called the right shoulder-right elbow.

4.1.2

Selection of angle information between vectors

A total of 20 sets of angle information between vectors are selected. The naming of the vector angles refer to the naming of human structure vectors, e.g. θ₁ is the angle between the vector neck-left shoulder and the vector left shoulder-left elbow, and the angle is named as neck-left shoulder-left elbow, and θ₂ is the angle between the vector torso centre-left hip and the vector left hip-left knee, and the angle is named as torso centre-left hip-left knee.

4.1.3

Selection of ratio information for vector modes

In some cases only using the vector angle information does not fully describe the detailed part of the pose action, for example, in the hand waving pose, it is necessary to know the relative position of the upper limb part and the head, and the angle value does not provide such information. In order to solve this problem, four sets of modal ratio information, combined with angle information of skeletal joints, are used to complete the regularization of joint data.

Combine the angle information of the skeletal joints to regularise the joint data. Let $\vec{a}$ be the vector pointing to the head from the torso centre, $\vec{b}$ and $\vec{c}$ be the vectors pointing to the left and right hands from the head, and $\vec{d}$ and $\vec{e}$ be the vectors pointing to the left and right hands from the torso centre, and the modal ratios of the four vectors are calculated as follows: 16 $X_{1} = \frac{\vec{| b |}}{\vec{| a |}} X_{2} = \frac{\vec{| c |}}{\vec{| a |}} X_{3} = \frac{\vec{| d |}}{\vec{| a |}} X_{4} = \frac{\vec{| e |}}{\vec{| a |}}$

Using the four sets of vector modal ratio information in Eq. (16), the relative positions of the right and left hands to the head or the centre of the torso can be determined during the movement. Here, the vector modal ratio information is also named uniformly, taking ratio X₁ as an example, according to the vector naming rule, $\vec{b}$ is expressed as head-right hand, and the denominators in the ratios are all vector $\vec{a}$ i.e., torso centre-head, then X₁ can be named as modal ratio: head-right hand, and the rest of the modal ratios are defined in a similar way.

4.1.4

Human Pose Representation Vector

Skeletal data points are scaling invariant and translational after regularisation in the previous section, and thus for human pose P, they can be directly represented using a combination of the vector angles and modal ratios computed at that point in time: 17 $P = (m_{1}, m_{2} \dots m_{24})$

Equation (17) in which m₁ is one of the angle values or vector modulus ratio values.

4.2

MSVM-based action recognition

4.2.1

Support vector machine algorithms

Support Vector Machine (SVM) is a classification method based on statistical learning theory. The problem of human pose recognition is ultimately a nonlinear classification problem, so the support vector machine method has an unparalleled advantage over other classification methods in the application of digital IP character action recognition. The principle of support vector machine algorithm is as follows:

Let there be a dataset ${({\vec{x}}_{1}, y_{1}), ({\vec{x}}_{2}, y_{2}) \dots, ({\vec{x}}_{n}, y_{n})}$ with sample feature vectors $\vec{x} \in R^{D}$ , i.e., $\vec{x}$ is a vector in D-dimensional real number space. The class labels y ∈ {–1,+1}, i.e., there are only two classes of samples, if x_n belongs to the first class, then the corresponding y_n=1. If x_n belongs to the second class, then the corresponding y_n=–1. 1)

If the data samples are linearly divisible, then the first goal is to find an optimal splitting hyperplane to classify the samples: 18 $w \cdot x + b = 0$

In Equation (18), w is the normal vector and b is the intercept. The optimal hyperplane can be obtained by solving the following quadratic optimisation problem: 19 $\min \emptyset (w) = \frac{1}{2} {‖ w ‖}^{2}$

The constraints to be satisfied by Eq. (19) are: 20 $y_{i} (w \cdot x + b) \geq 1, i = 1, 2, 3, \dots, n$

In the case of a large amount of features, (19) can be transformed into a Lagrangian dyadic problem: 21 $\begin{matrix} \max W (a) = \sum_{i = 1}^{n} a_{i} - \frac{1}{2} i \sum_{i, j = 1}^{n} a_{i} a_{j} y_{i} y_{j} (x_{i} \cdot x_{j}) \\ w^{*} = \sum_{i = 1}^{n} a_{i} y_{i} x_{i} \\ b^{*} = y_{i} - w_{•} x_{i} \end{matrix}$

Eq. (21) satisfies the constraints: 22 $\sum_{i = 1}^{n} a_{i} y_{i} = 0, a_{i} \geq 0$ a=(a₁,…,a_n) of Eq. (21) is the Lagrange multiplier, w* and b* are the normal vector and offset of the optimal hyperplane, respectively, which must be satisfied by the solution of Eq. (22): 23 $a_{i} {y_{i} (w \cdot x + b) - 1} = 0, i = 1, 2 \dots, n$

From expression w*, it can be seen that those samples for which a_i=0 does not play any role in the classification, only samples greater than 0 play a role in the classification, the classification function is shown in equation (24), and finally based on the sign of f(x) to determine which category X belongs to: 24 $f (x) = \sum_{i = 1}^{n} y_{i} a_{i} (x \cdot x_{i}) + b^{*}$

2)

For the case that the sample is linearly indivisible, the nonlinear mapping algorithm can be used to map the nonlinear data sample X to the high-dimensional space H, and in this space, the original space function will do the inner product operation, and ultimately make it become linearly divisible in another space. Thus, it makes it possible to use linear algorithms in the high-dimensional feature space to linearly analyse the nonlinear features of the sample. Therefore, it can be known that the key is to find a suitable inner product function on the most hyperplane, then the objective function should be: 25 $\max (a) = \sum_{i = 1}^{n} a_{i} - \frac{1}{2} i \sum_{i, j = 1}^{n} a_{i} a_{j} y_{i} y_{j} K (x_{i} \cdot x_{j})$

The corresponding classification function is as shown in (26), where K is the kernel function, the 26 $f (x) = \sum_{i = 1}^{n} y_{i} a_{i} K (x \cdot x_{i}) + b^{*}$

4.2.2

Multi-class support vector mechanism construction

The traditional binary support vector machine only provides classification algorithms for two classes of problems, while in this paper, the problem of multiple pose classification needs to be solved, so the traditional binary support vector machine cannot meet the needs of this paper to classify multiple digital IP character actions. The multi-class support vector machine developed on the traditional while classification support vector machine can solve the problem of multi-class digital IP character action recognition, and in this paper, the multi-class support vector machine classification method (MSVM) is chosen to classify and recognise the digital IP character actions.

Since the data samples are nonlinear, the kernel function can simplify the inner product operation in the mapping space, and map the nonlinear training data to the high-dimensional space, and the Gaussian function is mainly used as the kernel function in this paper. The expression of Gaussian kernel function is: 27 $K (x_{i} \cdot x_{j}) = \exp (- \frac{{‖ x_{i} - x_{j} ‖}^{2}}{2 σ^{2}})$

The sample to be tested is set to be X and all the poses are of class n, denoted as T={P₁,P₂,⋯P_n}, and P_n denotes class n poses, let V(P₁)=V(P₂)=⋯=V(P)=0. If the classifier of (P₁,P₂) is used to determine X as class P₁ then V(P₁)=V(P₂)+1, and X as class P₂, then V(P₂)=V(P₂)+1, and then in turn, the n(n–1)/2 binary classifiers such as (P₂,P₃),(P₃,P₄)⋯(P_n–1,P_n) are used to X determine the poses. Finally the value of Max(V(P₁),V(P₂)⋯V(P_n)) is taken as the classification result of X.

4.3

Experimental results and analyses

In this experiment, a customised dataset is used for experimental evaluation. The customized dataset is used for data collection in the half-body mode of the AstarPro camera, which contains a total of seven basic actions. In the experiment, the constructed multi-class support vector machine model is analyzed for experimental validation.

4.3.1

Experimental set-up

The customised dataset contains a total of 7 actions: standing at rest (Idle), walking (Walk), running (Run), raising the left hand (LHU), raising the right hand (RHU), waving the hand to the left (WHL), and waving the hand to the right (WRH). A total of 10 people were captured using AstarPro, and the 10 people were numbered digitally from 1-10 using cross-validation, and the sample data from 1-5 were used as the training set, and the sample data from 6-10 were used as the test set during the experiment.

4.3.2

Model training results

The proposed network model is trained and tested on a customised dataset. The accuracy and loss value curves of the network training are obtained as shown in Fig. 4, which shows that the two Accuracy curves and the two Loss value curves converge to 0 and 1, respectively, with the increase of the number of iterations, indicating that the training of the designed network model is effective.

4.3.3

Motion Recognition Effect

In the training process of this multi-class support vector machine digital IP character action recognition model, in order to observe the recognition effect of each action category, the accuracy of seven different actions is evaluated by a confusion matrix, which can compare the predicted value of each category with the real value during the supervised learning process of the network, and the confusion matrix of the customised dataset is shown in Fig. 5. In the recognition of different actions by the MSVM model in this paper, the classification of two actions, raising the left hand LHU and waving the hand to the left WLH, is completely accurate, and the accuracy of the classification of all other actions is also above 90%.

For multi-classification problems, the criteria for evaluating model recognition rate generally include the following three characteristics: precision rate P, recall rate R, and F1 value. According to these three evaluation metrics, the recognition results of each action type under model training are shown in Figure 6. The recognition accuracy, recall, and F1 value of the multi-class support vector machine digital IP character action recognition model for different actions are all greater than 92%. The recognition rates for walking and running, which are recurrent dynamic feature sequences, are lower than those for other actions. The values of all three metrics for raising the left hand LHU and waving the left hand WLH are 100%. The proposed MSVM-based action recognition method for digital IP characters has a high recognition rate overall, which verifies the feasibility of the algorithm in the somatic interaction of digital IP characters from Min culture.

5

Conclusion

The rich and colorful Minnan culture has heritage and artistic value, and is a cultural treasure passed down through history. This project combines AR and VR technologies to study the design of digital IP characters in the cultural heritage of southern Fujian, analyses the design process of virtual digital IP characters, applies deep learning algorithms to model construction of the two modules of character image generation and action recognition therein, and sets up experiments to explore the actual performance of the models.

The LAPITE model was used to build a digital IP image generation model of Fujian culture, and the experimental results found that the generated images of the LAPITE model had better image quality and image diversity, and the FID index and IS index were increased by 30%~316% and 6%~48% respectively compared with other models, and the peak signal-to-noise ratio of the output digital IP characters of Fujian cultural heritage was greater than 70dB, which was higher than that of other models. It is shown that the constructed LAFITE model is capable of generating high-quality digital IP character images of Min culture.

With the help of the Kinect platform for the pose representation of digital IP characters, a multi-class support vector mechanism is used to construct the action recognition model. The model has a high recognition rate for seven different actions, with accuracy, recall and F1 values above 93%, and the recognition accuracy of two of the actions reaches 100%, which is conducive to the interactive presentation of the digital IP characters.

With the help of virtual reality and augmented reality technology to create digital IP characters of Min cultural heritage, the visual, intuitive, interactive and experiential presentation in the form of virtual character IP is not only conducive to the digital preservation and dissemination of Min culture, but also conducive to the inheritance and innovation of Min culture from the audience.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Practical Research on Virtual Reality and Augmented Reality Technology in Min Cultural Heritage Digital IP Character Design Innovation

Honglin Li

Jiankai Weng

Pubblicato online: 05 feb 2025

Ricevuto: 15 set 2024

Accettato: 30 dic 2024

DOI: https://doi.org/10.2478/amns-2025-0045

Parole chiaveLAFITE model, Support vector machine, Image generation, Action recognition, Digital IP character, Min culture

© 2025 Honglin Li et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
LAFITE model, Support vector machine, Image generation, Action recognition, Digital IP character, Min culture