Construction of a Dynamic Interaction System for Digital Media Art Incorporating Affective Computing and Graph Neural Networks

The trend of mankind’s advance towards the digital era is unstoppable, and for the field of artistic creation, the integration of digital media application technology will bring amazing effects. We should also realize that while digital technology is profoundly affecting artistic expression, it is also being influenced by traditional forms of artistic expression [1–2]. The development of digital media art in China has been transformed from the practical application of basic functions to the initial presentation of digital media art. At this stage, digital media art is practiced in more fields of creation and application, and its artistic expression also presents diversified characteristics, which makes digital media art constantly moving towards maturity [3–5].

Digital media art refers to the creative art formed by relying on computer information technology, as well as its integration of a series of research results in the field of communication information, image editing and processing, biological science and other fields [6–7]. Digital media art mainly contains the following aspects read-only CD-ROM art, digitized images, Internet art, virtual reality type of art works, digital audio art works, virtualized bioscience works and synthesized digital art [8–9]. It can be seen that through the integration with different disciplines and artistic factors, digital media art derives a new form of media art expression, which is not only the inheritance of traditional art, but also an innovative contribution to the future of art development [10–11].

The dynamic elements of digital media art dynamic elements of the main performance of two aspects, first of all, because the dynamic elements in the promotion of cultural dissemination has gained wide application, and with the innovation of science and technology, it gives the digital dynamic greater breakthrough space [12–13]. Computer animation production concentrates on the reflection of digital dynamic elements, which can not only restore the analog dynamic segments, but also make innovations on the basis of it [14]. Secondly, digital dynamic elements are also widely used in the process of film and television production, it is due to the active participation of digital dynamic elements that make the action happen continuously, therefore, the creators can not only reproduce the action, but also create new action [15–16].

The application of digital media art based on interaction perspective can effectively enhance the user participation, immersion and personalized experience, thus improving the artistic expression and viewing experience of digital media art [17–18]. Interaction design makes digital media art works more participatory, the audience can interact with the works through their own actions, sounds, touch and other behaviors, this interactivity not only enriches the relationship between the audience and the works, but also provides the audience with a sense of participation and the pleasure of exploration, making the viewing more active and interesting, through interaction design, digital media art works can provide a more immersive experience, the audience Through interaction design, digital media art works can provide a more immersive experience, the audience can interact with the works to get a deeper sense of participation and experience, making the viewing process a more immersive experience, and enhancing the expressive and infectious force of the works. Interaction design can be adjusted according to the audience’s behavior and feedback on the performance of the works, so that each viewer can get a personalized experience, and the viewer can influence the presentation of the works through their own behavior to get unique The audience can influence the presentation of the work through their own behavior, thus obtaining a unique viewing experience, which enhances the emotional resonance between the work of art and the audience. Interaction design provides artists with new ways of creation and forms of expression, so that digital media art works can present a richer and more complex artistic expression, and artists can create challenging and avant-garde works through interaction design, thus promoting the innovation and development of digital media art [19–22].

In conclusion, studying the application of digital media art design from the perspective of interaction can realize its strong interactivity and immersion, thus allowing the viewer to produce a richer and deeper artistic experience, and at the same time, it also provides the artist with a broader creative space and expressive carrier, which pushes the digital media art to a more diversified, open and avant-garde art field [23–24].

In this paper, we design a deep temporal modeling network for multimodal sentiment analysis tasks to solve the problem of insufficient ability of RNN and its variant models and Transformer model to model modal vectors time series. The deep temporal modeling network is able to extract the temporal correlation of modal vectors in depth and capture the remote dependencies within modal vectors. At the same time, the emotion information that depends on temporal characteristics is modeled by the global attention local loop module and the text syntax graph convolution module, which reflects the coherence of people expressing their own emotions, and conducts emotion analysis experiments and emotion state transfer experiments in digital media art.

2

Method

2.1

Multimodal emotion feature extraction

The multimodal sentiment analysis flowchart is shown in Fig. 1, which is divided into two phases: training and testing. The training phase starts with audio features of the trained text data, audio data and visual data, and then the extracted features of each modality are used to train the multimodal sentiment analysis model, and after the training, the model parameters are fixed for testing. In the testing phase, the same feature extraction tools as in the training phase are used to extract features from the text, audio and visual of the test set, and the extracted features of each modality are fed into the multimodal sentiment analysis model with fixed parameters to output multimodal sentiment prediction results.

2.1.1

Text feature extraction tools

Both BERT and GloVe are important tools and techniques in the field of natural language processing for representing and processing textual data [25].

BERT is a deep bi-directional representation model based on the Transformer architecture, proposed by Google. It has been successful in natural language processing tasks. BERT acquires generic text representations by pre-training large-scale textual data, which can then be tuned for various downstream tasks like question and answer, text categorization, named entity recognition, and more. BERT introduces two new tasks in the pre-training phase, i.e., masked language modeling and next-sentence prediction, in order to better understand the contextual information in the text. BERT has the ability to encode words in bidirectional contexts, which allows for better understanding of meaning and context in a sentence. GloVe is an unsupervised learning algorithm for generating word vectors, developed by researchers at Stanford University. It represents words as vectors of real numbers by transforming statistical information from a global corpus into word vectors. The basic idea of GloVe is to utilize global word-word co-occurrence matrix information and train it to obtain a vector representation of each word. These vectors capture the semantic relationships between words, e.g., in the semantic space, semantically similar words will be closer in the vector space. Therefore, the vectors generated by GloVe can be used for various natural language processing tasks, such as text categorization, semantic matching, and so on.

2.1.2

Audio Feature Extraction Tools

COVAREP and LibROSA are two important tools used in speech signal processing for acoustic feature extraction and audio analysis, respectively. COVAREP is an open source acoustic feature extraction toolkit for extracting relevant acoustic features from speech signals. These features include fundamental frequency, sound intensity (sound energy), sound clarity (acoustic signal-to-noise ratio), and other parameters related to sound quality and speech characteristics. COVAREP also provides some tools and methods for acoustic modeling of speech signals, which can be used in the fields of speech synthesis, speech recognition, emotion recognition, etc.LibROSA is an audio signal analysis and processing Python library. It provides rich functions, including loading audio files, extracting features, performing spectral analysis, performing time series operations, etc. LibROSA can be used for preprocessing audio data, feature extraction, and visualization and analysis of audio data, and has been widely used in speech recognition, music information retrieval, audio processing, and other fields.

2.1.3

Visual feature extraction tools

FACET and OpenFace are both tools for facial analysis and emotion recognition that can be used to analyze facial expressions in images or videos and extract features about facial expressions, emotions and facial movements.FACET is a facial expression analysis tool for recognizing and analyzing facial expressions from still images or videos. It is based on the theory of facial action coding and can recognize and track facial actions in images and map these actions to corresponding facial expressions.FACET can extract rich facial features such as eye blinks, mouth smiles, etc., and then analyze these features to infer the emotional state of the face, such as happiness, sadness, anger, etc.FACET is commonly used in psychological research, marketing, user experience and other fields to help understand people’s emotions and reactions in different situations.

OpenFace is an open source facial recognition and emotion recognition toolkit for analyzing facial expressions in images or videos. It provides a range of algorithms and tools for facial detection, keypoint localization, facial tracking, and emotion recognition.OpenFace is characterized by high efficiency, accuracy, and a wide range of application areas. It can be used in a variety of fields such as human-computer interaction, facial authentication, emotion analysis, psychological research, etc. The main features of OpenFace include facial detection, key point localization, facial tracking, and facial expression recognition, which can help users better understand the information and emotional state conveyed by facial expressions.

2.2

Graph Attention Neural Network Algorithm Preparation

2.2.1

Graph Neural Network GNN

Conventional neural networks can only process regular Euclidean-structured data, where the Euclidean data structure is characterized by nodes with a fixed arrangement rule and order, such as 2-dimensional grids and 1-dimensional sequences. However, non-Euclidean data must be considered in numerous practical situations. Non-Euclidean data structures do not have a fixed arrangement rule and order of nodes, which results in the inability to directly migrate traditional deep learning models to the task of processing non-Euclidean structured data. If a convolutional neural network is directly applied to it, it is not possible to define a convolutional kernel in non-Euclidean data due to the fact that the number of neighboring nodes of the nodes in the center of the non-Euclidean data is not in a fixed number and order of arrangement, which does not satisfy the translation invariance. Graphs are typical non-Euclidean data with points and edges, and in practical situations various non-Euclidean data problems can be abstracted into graph structures then solved using graph neural networks GNN. For example, in transportation systems, the use of graph-based network models can make efficient predictions based on real-time road condition information. In computer vision, human-object interaction is viewed as a graph structure that can effectively categorize and recognize different parts of an image. The graph contains nodes, edges, and the entire graph structure. The processing tasks of GNN mainly start at the node level, edge level, and the whole graph level.

The theoretical basis of GNN is the immovable point theory, also known as the compressed mapping theorem or the compressed mapping principle, which is an important tool in the theory of metric spaces, which guarantees the existence and uniqueness of self-mapping immovable points in a metric space and provides a constructive method for deriving these immovable points [26].

The immovable point theorem reads as follows: let (X,d) be a nonempty complete metric space. Let T:X → X be a compact mapping on X. A compact mapping is defined by the existence of a real number q(0 < q < 1) such that for all x, y ∈ X, there is d(T(x),T(y)) ≤ q·d(x, y), then the mapping T has and only one immovable point x(T(x) = x within X.

If the immovable point theory is applied to graph neural networks by first denoting by F a function obtained by stacking a number of f, also known as the global update function, then the process of updating the states of all nodes on the graph can be expressed as the following equation: 1 $H^{t + 1} = F (H^{t}, X)$

At this point the theory of immovable points means that H⁰ can take any value, as long as F is a compressed mapping, and H⁰ will converge to a fixed point, or immovable point, after constant iterations.

Inputting the graph into the GNN to get the result requires two steps, the first step is the propagation process, which is the process of letting the nodes update over time. The second step is the output process, which is the process of getting the target output based on the final node representation. The most important of these is the propagation process, which is needed to ensure that the state mapping on the entire graph is a compressed mapping.

The propagation process of GNN refers to the process of obtaining node representations through iteration, initializing the vector representation of node v as ${\vec{h}}_{ν}^{(l)}$ , l_v as the label of node v, the set consisting of the labels of the neighboring nodes of v is denoted as l_NBR(v), and the set consisting of the labels of the edges that are directly connected to v is denoted as l_CO(v). The propagation process can be expressed as in the following equation: 2 ${\vec{h}}_{ν}^{(t)} = f^{*} (l_{ν}, l_{C O (ν)}, l_{N B R (ν)}, {\vec{h}}_{N B R (ν)}^{(t - 1)})$ where f* is a linear function with respect to ${\bar{h}}_{ν^{*}}$ , which can be further decomposed into a sum of polynomials, which can be expressed as Eq. (3), IN(v) denotes the set of parent nodes of node v, OUT(v) denotes the set of child nodes of node v, and NBR(v) denotes the set of neighboring nodes of node v. i.e: 3 $\begin{array}{l} f^{*} (l_{ν}, l_{C O (ν)}, l_{N B R (ν)}, {\vec{h}}_{N B R (ν)}^{(t - 1)}) \\ = \sum_{ν^{'} \in D N (ν)} f (l_{ν}, l_{(ν^{'}, ν)}, l_{ν^{'}}, {\vec{h}}_{ν^{'}}^{(t - 1)}) \\ + \sum_{ν^{'} \in O U T (ν)} f (l_{ν}, l_{(ν, ν^{'})}, l_{ν^{'}}, {\vec{h}}_{ν^{'}}^{(t - 1)}) \end{array}$

f The argument of the function is related to the label of the node and can be expressed as equation (4), where A denotes the node-to-node connection relationship: 4 $f (l_{ν}, l_{(ν, ν)}, l_{ν^{'}}, {\vec{h}}_{ν^{'}}^{(t)}) = A^{(l_{ν}, l_{(ν, ν)}, l_{ν^{'}})} {\vec{h}}_{ν^{'}}^{(t - 1)} + b^{(l_{ν}, l_{(ν, ν)}, l_{ν^{'}})}$

A simple differentiable function is used in the output process of a normal case GNN. In case of a node-specific task, where the outputs of each node are independent of each other, the final node representation can be mapped to the corresponding output ${\vec{o}}_{v}$ by a neural network g, and the computational process can be expressed as equation (5): 5 ${\vec{o}}_{v} = g ({\vec{h}}_{v}, l_{v})$

In case of graph-specific tasks, it is necessary to set up a supernode that transforms the supernode into a node-specific task, which normally needs to be connected to all the nodes using edges of a specific type, and subsequently, the final representation of the supernode is borrowed to accomplish the graph-specific task.

2.2.2

Graph Attention Neural Network GAT

Graph Attention Network GAT uses the attention mechanism to do the aggregation operation on the neighboring nodes, and can automatically learn the importance of different neighboring nodes, compared with the fixed importance of neighboring nodes in Graph Convolutional Network GCN, GAT has a strong history of flexibility, and in some cases it can better improve the expressive ability of the graph neural network model.

Attention mechanism can effectively allocate attention to the information, and more attention will be allocated to the more important information.Query represents a certain condition or a priori information, Source is the information source that needs to be processed by the system, and Attention Value is the information extracted from Source by the attention mechanism, given the Query information. There are many different kinds of information contained inside the source, which can be represented in the form of key-value pairs. The attention computation process can be defined as equation (6): 6 $A t t e n t i o n (Q u e r y, S o u r c e) = \sum_{i} s i m i l a r i t y (Q u e r y, K e y_{i}) \cdot V a l u e_{i}$

The similarity in Eq. (6) represents the similarity between Query and Key, and the common method is to find the inner product of the two to calculate the similarity, if the inner product is closer to 1, it means that Query and Key are closer to overlap, and the similarity is higher. Simply summarize the attention mechanism is a weighted sum of all the Value information, the weight is the similarity between Query and Key.

Introducing the attention mechanism into the process of graph neural network aggregation of neighbor nodes, let the feature vector corresponding to a node v_i in the graph in layer i be ${\vec{h}}_{i}$ , and after the neighbor node aggregation operation combined with the attention mechanism, the output of the updated node’s feature vector ${\vec{h}}_{i}^{'}$ . This aggregation operation combined with the attention mechanism can be regarded as a graph attention layer.

Assuming that the current central node is v_i and one of the neighbor nodes is v_j, the computation process of the attention value e_ij from neighbor node v_j to node v_i can be described as the following equation: 7 $e_{i j} = a (W {\vec{h}}_{i}, W {\vec{h}}_{j})$

W is a weight parameter used for feature transformation, a is a function that calculates the correlation between the feature vectors of two nodes, and the function a can be a feed-forward neural network, and finally outputs a scalar value expressing the correlation between the two. The specific formula is shown in equation (8): 8 $e_{i j} = L e a k y Re L U ({\vec{a}}^{T} [W {\vec{h}}_{i} ∥ W {\vec{h}}_{j}])$

The feature vectors of node i and node j are respectively spliced by feature transformation, and then the spliced vectors are mapped to a real number using feed-forward neural network, and finally the correlation between node i and node j is obtained after LeakyReLU activation. In order to facilitate the weight allocation, the correlation between the central node and all its neighboring nodes needs to be normalized uniformly, so as to obtain the final attention coefficient α_ij between node i and node j, the specific calculation process is shown in Equation (9): 9 $α_{i j} = s o f t \max_{j} (e_{i j}) = \frac{\exp (e_{i j})}{\sum_{k \in N_{i}} \exp (e_{i k})}$

According to the idea of weighted summation of the attention mechanism, the process of calculating the updated feature vector ${\vec{h^{'}}}_{i}$ of node v_i is shown in Equation (10): 10 ${\vec{h^{'}}}_{i} = σ (\sum_{j \in N_{i}} α_{i j} W {\vec{h}}_{j})$

In order to enhance the expressive power of the attention layer even further, a multi-head attention mechanism can be used, i.e., calling K set of mutually independent attention mechanisms for Eq. (9) and then stitching the outputs together. The calculation process is shown in Eq. (11): 11 ${\vec{h^{'}}}_{i} = | | k = 1 K σ (\sum_{j \in N_{i}} α_{i j}^{(k)} W^{(k)} {\vec{h}}_{j})$

$α_{i j}^{(k)}$ is the weight coefficients computed by the group k attention mechanisms, and W^(k) is the corresponding transformation parameters under the group k attention mechanisms. Adding multiple groups of mutually independent attention mechanisms improves the model’s representational capability by allocating attention to multiple relevant features between the central node and its neighbors.

2.3

Multimodal Sentiment Analysis Based on Deep Temporal Modeling Networks

In the multimodal sentiment analysis task, the input to the model is the original single-peak sequence X_m ∈ R^l_m×d_m from the same video clip, where l_m is the length of the sequence and d_m is the dimension of the modality m ∈ {t, a, v} vector representation. In this paper, t,a,v represents the three modalities, textual, acoustic and visual, respectively, obtained from the dataset. The model needs to extract and integrate task-relevant information from these input vectors, picking an accurate prediction of emotional polarity.

In this section, the main components of the sentiment analysis model based on deep temporal modeling network are introduced, which consists of three main modules: the global attention local loop module, the text syntax graph convolution module, and the multimodal adaptive fusion module. The main structure of the sentiment analysis model based on a deep temporal modeling network is shown in Figure 2.

First, raw acoustic and visual data are processed into numeric sequence vectors using a feature extractor (firmware for visual and acoustic, no parameters for training). For text modality, sentences are inputted into the pre-trained BERT model for encoding, and then header codes are extracted from the output of the last layer as initial text vectors. After that, the visual and acoustic modal vectors capture the long-distance dependencies between the elements inside the vectors by the global attention local loop module, and for the textual modal vectors, the Spacy toolkit is firstly used to construct the syntactic dependency tree, which transforms the initial encoded vectors of the textual modal into a tree structure and establishes the links between the word grammars. The links between word grammars in each sentence are used to construct a word grammar adjacency matrix to transform the tree structure into a graph structure, and then the remote dependencies within the text modal vectors are learned by a multilayer graph convolutional network. Finally, after obtaining the feature vectors of each modality, the multimodal adaptive fusion module assigns different weights to the three modalities to fuse the different modal vectors, and the fused feature vectors are used to predict the final sentiment polarity.

2.3.1

Global Attention Local Loop Module

For acoustic and visual modalities, after transforming them into numerical sequence vectors using a fixed feature extractor, the sentiment analysis model based on a deep temporal modeling network extracts long-range dependencies within the acoustic and visual vectors through local and global sequence modeling due to the rich time-series relationships embedded in the acoustic and visual modalities. The global attention local loop module returns a tensor with the same dimension as the input vectors. The global attention local loop module consists of two main computational stages, the local loop layer and the global attention layer, which are used to combine short-range temporal sequence modeling with long-range temporal sequence modeling, and the specific operations of each layer are described in detail below.

The local loop layer uses a unidirectional LSTM model for the modal vectors [27]. It mainly models the local information of the input sequence and models the short-term dependencies within the modal vectors. 12 $h_{m} = s L S T M (X_{m}; θ_{m}^{L S T M}) m \in {v, a}$

In order to preserve each modality’s own affective properties when extracting multimodal feature vectors, this paper uses residual connections to nonlinearly transform the modal vector representations. The residual connection includes the sum operation and the layer normalization operation. The sum operation can promote the circulation of information and the propagation of gradient in the model and improve the training effect of the model under the circumstance of extra parameters and computations. Layer normalization operation on the one hand can solve the problem of gradient vanishing in the deep network, on the other hand can improve the convergence speed and stability of the model. The formula is expressed as follows: 13 $F_{m} = L a y e r N o r m (h_{m} + X_{m}) m \in {v, a}$

A global attention model to capture long-term dependencies based on a local loop layer. Some recent work in related directions has shown the good performance of the attention mechanism in learning long-term dependencies. The main reasons for choosing the multi-head attention mechanism in this model are as follows: first, the arithmetic burden inherent in the attention model is small, and we can control the length of the sequences by varying the size of the segmentation window; second, with the multi-head attention mechanism, the global dependencies between each segment can be modeled directly without the need to memorize the segments one by one, as is required in the RNN model Finally, the multi-head attention mechanism in the attention layer output contains the coded representation information between the picked individual homo-hanks, which greatly enhances the expressive power of the model. Before entering the global attention layer, the output of the local loop layer is first processed as follows: 14 $L_{m} = F_{m} + P E m \in {v, a}$ where PE denotes the position encoding vector. After adding the location encoding, the global attention layer is entered and multiple self-attention heads are used to learn the encoded information for different subspaces. The attention mechanism consists of a query Q and a key value K, which interact together to output a mapping applied to context V. 15 $A t t e n t i o n (Q, K, V) = S o f t \max (Q K^{T} / \sqrt{d}) V$

If it is a self-attention mechanism, then Q,K,V is the same input. Expression $\sqrt{d}$ is a scale factor, and the multi-head attention mechanism attends to information at different locations and in different subspaces by stacking multiple self-attention heads: 16 $M H A (Q, K, V) = C o n c a t (h e a d_{1}, …, h e a d_{h}) W_{o}$ 17 $h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

A subspace is defined as a segment with feature dimension k. If there are four heads, the length of a fragment is k/4. The idea is to generate different attention weights for different feature subspaces. After each self-attentive head is computed, the model connects all the outputs and projects them back to the original dimensional representation.

Similarly, in order not to lose the affective properties of the modality itself, at the global attention layer, we also use residual connections to nonlinearly transform the modal vector representation: 18 $Z_{m} = L a y e r N o r m (L_{m} + M H A_{m}) m \in {v, a}$ $${Z_m} = LayerNorm\left( {{L_m} + MH{A_m}} \right)\>m \in \left\{ {v,a} \right\}$$

The final output of the global attention local loop module is the residual sum of the local model output and the global model output.

2.3.2

Text Grammar Map Convolution Module

The generation of textual representations by recurrent neural networks with various attention mechanisms has achieved good results, but the complex syntactic relationships between sentences internally make the attention mechanisms susceptible to noise. Considering that the syntactic structure of sentences in text has a distinct graph structure relationship, the newly emerging graph neural network, compared to temporal networks and Transformer model, can capture the long time dependency relationships within sentences more accurately.

Dependency syntactic trees can establish connections between the words of a sentence, thus building a graph structure between the words. Dependency Syntax Tree theory assumes that there are syntactic links between words in a sentence, and these links are constrained by representing the master-detail relationship between words as a tree structure. In a dependency syntax tree, each lexical node has one and only one parent node, and all nodes form a directed tree. This theory reduces the syntactic relation between words to a binary inequality relation.

To encode the syntactic information in a sentence, we use the probability matrix of dependency arcs in the dependency parser. The dependency probability matrix can capture rich structural information by providing all potential syntax. Our goal is to utilize the displayed syntactic information to encourage the model to learn a syntax-aware representation.

The Spacy toolkit is first used to construct dependent syntactic trees for each sentence and the adjacency matrix is obtained based on the syntactic relationships between words A^ds ∈ R^l_t×l_t. In practice, we make each word adjacent to its child node and itself and set the value of the adjacency node to 1. The syntactic information is then further encoded using a graph convolutional network. 19 $h_{t} = B E R T (X_{t}; θ_{t}^{B E R T})$ 20 $Z_{t} = G C N (A^{d s}; h_{t}; θ_{t}^{G C N})$

2.3.3

Multimodal adaptive fusion module

Each of the three modalities, textual, visual, and acoustic, in the multimodal vectors is rich in emotional information and has a different impact on the final emotion score prediction. Therefore, we should allow each modality to contribute differently to the final sentiment polarity prediction result by assigning a weight to each modal feature vector. To alleviate the limitations imposed by mean-based aggregation, a more intuitive solution, inspired by the attention mechanism, is to assign a personalized weight to each modal feature vector.

In the multimodal adaptive fusion module, a two-layer neural network is used to model the attention weights α_i and is called the attention network. The inputs to the attention network are the modal feature vectors obtained after temporal feature extraction, and the attention network formulation is as follows: 21 $α_{i}^{*} = W_{2}^{T} \cdot σ (W_{1} \cdot [Z_{t} \oplus Z_{a} \oplus Z_{v}] + b_{1}) + b_{2}$

The final attention weights are obtained by normalizing the above attention scores using the Softmax function: 22 $α_{i} = \exp (α_{i}^{*}) / \sum_{i \in m} \exp (α_{i}^{*})$

Finally, the three modal fused eigenvectors Z_ada are obtained for subsequent sentiment score prediction: 23 $Z_{a d a} = α_{i} \cdot Z_{i} i \in m$

2.4

Dynamic Interactive System Design for Digital Media Arts

2.4.1

Designing the technical architecture

The technical framework of this project design is: three-tier architecture + Web standards + object-oriented thinking + Web2.0. The three-tier architecture is shown in Figure 3.

2.4.2

Database design of the system

Database design of the system The background development tool of the system database is Microsoft SQL Server2008, and the database part applies the database management function of SQLServer2008 to establish various tables, views, stored procedures, triggers and user account information. Among the tables include: database table, database table: user table, user rights table, message table and so on. Database analysis and creation are as follows:

Analysis phase work: to determine the functions of the system modules, and the use of relational databases to reorganize the data and design a relational database tables, so that the tables can maintain due contact with each other, but also to achieve the optimal relationship between the database tables, to avoid data redundancy, difficult to manage and maintain and other issues. After completing the analysis of various data and data structure, apply the database management function that comes with SQLServer 2008 to create the corresponding database tables and establish the relational structure to complete the views, constraints, indexes, rules, default values, triggers, and stored procedures in each part of this system. The creation process of this part is as follows: 1)

Labeling the data in the system;

2)

Adding the labeled fields to the data table;

3)

Identifying the main key fields in the data table;

4)

Creating data charts;

5)

Creating canonical data sets based on the paradigm;

6)

Labeling the field information;

7)

Labeling the relationships between data tables to create foreign key fields;

8)

Creation of physical tables.

9)

Administration of the database.

Add SQLServerASP.NET users and roles, authorize roles and permissions for users, and create groupings to divide users into groups and manage users through groupings. Create security management specification, stored procedure and trigger design. Optimize the configuration of SQLServer.

2.4.3

Database security design

1)

Introduction

According to the objectives to be achieved by this system, it is designed using program code for security policy. Authentication and authorization of users are implemented by the program code.

2)

Database authentication strategy

The user side of the application program of this system is verified by the SQL Server database in the process of connecting with the database side. CAPICOM technology is applied to secure the database connection string, which is stored in the system configuration file.

3

Results and Discussion

3.1

Sentiment analysis experiment

3.1.1

Data sets

In scientific research, the construction of datasets plays a key role, and the variability among samples directly affects the output results of model algorithms, so if you want to achieve a high degree of accuracy in deep learning models, it is especially critical to carefully select training datasets. With the rapid development of multimodal sentiment analysis, this field has attracted extensive research interest in recent years, accompanied by the emergence of many high-quality datasets, in order to ensure the validity and comparability of the research results, it is necessary to select the high-quality datasets used by most researchers and scholars, in this context, this study selected two authoritative datasets for sentiment analysis, CMU-MOSI and CMU-MOSEI. For experimental validation, the CMU-MOSI dataset covers the evaluations of 89 respondents, mainly aged around 25 years old, on their experiences with a variety of products, and the dataset contains video, audio, and corresponding transcribed text samples, with 2,199 samples for each modality.

The CMU-MOSEI dataset, on the other hand, expands on CMU-MOSEI, which contains 22,852 annotated video clips from 1,000 different speakers covering 250 different topics collected from online video sharing sites, and the training, validation, and test sets of the CMU-MOSEI dataset contain 16,322, 1871 and 4659 samples, and it is worth noting that the sentiment score intervals of both datasets are [-3, 3], which provides a standardized evaluation scale for the training and evaluation of sentiment analysis models.

Figure 4 shows the basic information of the CMU-MOSEI dataset. The total number of dialogs is over 21,000.

Figure 5 shows the basic information of the CMU-MOSI dataset. The least number is 94 for neutral attitude and the most is 489 for positive.

3.1.2

Baseline model

In order to better compare the experimental effects brought by the designed models, the following baseline models and the constructed multimodal sentiment analysis models are all used in the experiments: 1)

TFN: A novel tensor fusion network model is proposed, which is capable of capturing intra- and inter-channel dynamics end-to-end, and this tensor-based fusion strategy shows superior performance in both multimodal and unimodal sentiment analysis tasks.

2)

LMF: A new feature-level fusion method based on bilinear pooling theory is proposed, which significantly improves multimodal sentiment recognition performance based on simple cascade operations.

3)

MULT: A specially designed pairwise cross-modal attention strategy is adopted to realize the interaction between different modalities, and the model achieves inter-modal transformation and understanding by converting the information from one modality into the form of another modality.

4)

MISA: A modal invariant and representation-specific learning approach is proposed to provide an effective strategy for multimodal data analysis, which improves the comprehension and performance of the model by capturing shared and unique features among modalities.

5)

ICCN: The approach using DeepCCA technology, which deeply associates text with audiovisual content, is able to mine the deep connections between text and audiovisual content, which not only enhances the comprehension ability of the model, but also provides a new perspective for dealing with complex multimodal tasks.

6)

CMJRT: A multimodal sentiment analysis framework called Cross-Modal Joint Representation Transformer is designed to realize a hierarchical inter-modal interaction model by transferring the joint representation of bimodal to unimodal, which not only learns the inter-modal consistency but also captures the intra-modal specificity, effectively enhancing the model’s ability to comprehend and process multimodal data, and providing an effective solution for multimodal data analysis.

In order to make a fair comparison with the baseline model, this study uses the same feature dataset to retrain the baseline model, and at the same time compares the model constructed in this study with the baseline model in each of the three chapters, and then analyzes the comparison results.

3.1.3

Analysis of experimental results

In this study, we construct a multimodal sentiment analysis model for multilevel attention enhancement fusion, which is mainly used to construct a multimodal sentiment analysis model using the combination of the attention mechanism and two maximization constraints. To ensure the accuracy and efficacy of the model, this study conducted a series of training and testing experiments on two authoritative multimodal sentiment analysis datasets, the CMU-MOSI and CMU-MOSEI datasets, which not only cover a rich range of sentiment expressions and multimodal information, but also provide standardized sentiment scores, which provide a reliable baseline for the performance evaluation of the model. Table 1 shows the performance comparison of the model on the CMU-MOSEI dataset, which allows us to visualize the performance of the model on different datasets and its effectiveness in multimodal sentiment analysis tasks.

Table 1.

Performance comparison results of each model on the CMU-MOSEI dataset

Model	MAE	Corr	Acc-7	Acc-2	F1
TFN	0.645	0.746	47.7	77.8	78.2
LMF	0.661	0.684	48.4	79	80.1
MULT	0.614	0.731	50.3	82.6	83.4
MISA	0.565	0.768	51.9	83.8	83.5
ICCN	0.612	0.734	51.2	84	83.5
CMJRT	0.646	0.818	51.1	84.8	84.6
This model	0.526	0.872	54.4	86.7	87.2

As can be seen from Table 1, compared with TFN and LMF, the proposed model improves the F1 score by 9% and 7.1%, respectively, while the Acc-2 index increases by 8.9% and 7.7%, respectively, and the gain on the Acc-7 index is 6.7% and 6%, respectively, indicating that the traditional data cascade fusion method has certain limitations in the feature-level modal fusion, and the traditional data cascade fusion method cannot parse and make full use of the context relationship between modalities. In order to achieve more refined and hierarchical information fusion, it is necessary to use cross-attention mechanisms such as MULT, MISA and ICCN, which can identify and strengthen the key semantic connections in different modalities, and can focus on capturing closely related information fragments between cross-modalities, which shows that the difference in sentiment classification between models using attention mechanisms is not very large, and the model in this paper has an improvement of 3.7% in F1 score compared with MISA, MULT and ICCN, 3.8% and 3.7%, while in the Acc-2 indicator, it increased by 4.1%, 2.9% and 2.7% compared with MULT, MISA and ICCN, respectively, and in the Acc-7 indicator, the improvement was 4.1%, 2.5% and 3.2%, respectively.

3.2

Experiments on the analysis of affective state transfer

3.2.1

Emotions in Digital Media Art

1)

Diversity and individualization

Digital media art emotional expression focuses on individual emotional expression and diverse experience. In general, it expresses its emotions and opinions in innovative ways, displaying a distinctive style and personality. Emotional expression is often the communication of joy, sadness, thinking, and thoughts, and it can include images, sounds, animation, and so on.

2)

Experimental and innovative

The emotional expression of digital media art is characterized by experimentation and innovation. Generally, it creates new ways of expression by trying new materials, technologies and media, changes traditional artistic concepts and aesthetic standards, pursues unique and avant-garde artistic expressions, and constantly explores and breaks through traditional boundaries.

3)

Depth and complexity

The emotional expression of digital media art seeks to express the depth and complexity of emotions. In digital media art works, designers will incorporate more emotions and trigger the audience’s thinking and emotional resonance through multi-level and multi-dimensional expression. At the same time, through the combination and use of visual, sound, symbols, and other elements, a rich and complex emotional atmosphere is created.

4)

Interactivity

Digital media art emotional expression encourages audience participation and interaction. The works are no longer static displays, but interact with the audience through interactive installations, virtual reality, augmented reality and other means, triggering the audience’s emotional resonance and enhancing the audience’s sense of participation. The audience can interact with the works, thus obtaining a more personalized and rich emotional experience.

3.2.2

Emotional state transfer analysis

A series of simulation tests were conducted using real fuzzy stimulus inputs and emotional pipelines for ID classification, whose main purpose was to validate the model by exploring the effects of different personalities under the same music video stimuli. Figures 6 and 7 show 2 sets of comparison experiments.

In the experiments, a negative music video and a positive music video were randomly selected as targets, and the emotion channels associated with them were identified by ID. The width of the channel is a quantitative index of uncertainty, and the larger the channel width is, the more ambiguous the expression of emotion in response to that video clip.

In Fig. 6, (a) to (c) are the emotion pipelines when the probability of emotion transfer is 10, 0, and - 10, respectively, and (d) is the polarity score. In Fig. 6, a negative music video stimulus was randomly selected and its emotional pipeline was constructed, which continuously inputs ambiguous stimuli into the model, and the emotional probability state transfer curves of optimists, neutrals, and pessimists in this video were obtained, respectively. It was found that positive optimists were not prone to negative emotions when faced with a negative video stimulus. On the contrary, neutral and negative people are more likely to elicit their negative emotions, which is basically consistent with the reality. For example, when the probability of affective transfer was 0, the positive affect pipeline stabilized below 0.08 until 17 seconds, and then slowly fell back after a brief elevation at 18 seconds. The neutral and negative pipelines, on the other hand, remained consistently above 0.1.

Similarly, a positive music video was randomly selected to describe the emotional state transfer curves of people with different personalities, as shown in Fig. 7, which also simulated reasonable results, with (a) to (c) being the emotional pipeline at an emotional transfer probability of 10, 0, and -10, respectively, and (d) being the polarity score.

With a positive video stimulus, negative people are not prone to positive emotions. In contrast, neutral and positive people are more likely to elicit positive emotions from them. The negative pipeline was consistently below 0.25 in both (a) and (b), and never above 0.5 in (c). And there are moments above 0.5 for either positive or neutral pipelines.

In the analysis of emotional state transfer, by means of simulation, we focus on elaborating that in the transfer of emotional state, due to the continuity of emotion, the latter is called the principle of consistency of emotion, under the regulation of different personality factors, the simulation of different personalities, the emotional changes embodied in different stimuli are different. This provides a line of thought that can be drawn upon for the expression of machine-like human emotions.

4

Conclusion

In this paper, on the basis of multimodal sentiment feature extraction, a multimodal sentiment analysis model based on deep temporal modeling network is constructed by combining graph attention neural network algorithm. Sentiment analysis and emotion state transfer experiments are conducted. It is found that compared with TFN and LMF, the model in this paper improves 9% and 7.1% in F1 score, while 8.9% and 7.7% in Acc-2 index, and 6.7% and 6% in Acc-7 index, which indicates that the traditional data cascade fusion method has some limitations in feature-level modal fusion, and the traditional data cascade fusion The traditional data cascade fusion method is unable to parse and fully utilize the contextual relationships between modalities, and in order to achieve a more fine-grained and hierarchical information fusion, it is necessary to use a cross-attention mechanism similar to that used in models such as MULT, MISA, and ICCN. In digital media art, a negative person under a positive video stimulus is not prone to positive emotions. In contrast, neutral and positive people are more likely to elicit positive emotions from them.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Construction of a Dynamic Interaction System for Digital Media Art Incorporating Affective Computing and Graph Neural Networks

Tianxing Chen

Published Online: Mar 19, 2025

Received: Oct 12, 2024

Accepted: Jan 30, 2025

DOI: https://doi.org/10.2478/amns-2025-0472

KeywordsFeature extraction, Graph neural network, Attention mechanism, Sentiment analysis

© 2025 Tianxing Chen, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Feature extraction, Graph neural network, Attention mechanism, Sentiment analysis