Emotional analysis and semantic understanding of multimodal network language data
Publié en ligne: 31 mars 2025
Reçu: 05 nov. 2024
Accepté: 13 févr. 2025
DOI: https://doi.org/10.2478/amns-2025-0818
Mots clés
© 2025 Chen Weimiao, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
The internet has become an essential element of everyday life for people, and as an important carrier of information, the forms of network language data are increasingly rich and diverse, which are no longer limited to a single text form, but cover many modes such as images, audio and video. These multimodal data not only contain rich information content, but also contain users' emotional attitudes and semantic intentions, which is of great significance for understanding users' behavior, monitoring social public opinion and improving human-computer interaction experience [1–2]. As a result, the emotional analysis and semantic interpretation of multimodal network language data have emerged as a key research focus in the areas of natural language processing (NLP) and artificial intelligence (AI).
As an important branch of NLP, sentiment analysis aims to identify and extract emotional tendencies expressed in text, images, audio and other data, such as positive, negative or neutral [3]. With the emergence of multimodal data, how to effectively fuse emotional information of different modes to improve the accuracy and robustness of emotional analysis has become an urgent problem [4–5]. At the same time, semantic understanding, as a key technology to realize effective communication between human and machine, requires the machine to accurately understand the semantic information in multimodal data and make corresponding responses or decisions accordingly [6]. However, the complexity and heterogeneity of multimodal data bring great challenges to semantic understanding.
Different modes of information can capture different levels of emotional information, such as text information can express abstract emotional concepts, while voice and image information can capture specific emotional expressions [7–8]. In multimodal emotion analysis, fusion strategy is the key. Common fusion approaches encompass early fusion, late fusion, and mixed fusion. Early fusion involves combining features at the feature level, where the features from various modalities are integrated simultaneously during the input stage [9]. For example, multi-modal feature vectors can be fused into a multi-modal feature vector by simple splicing, addition or multiplication [10–11]. Multimodal sentiment analysis is widely used in many fields. In the context of evaluating films and TV dramas, multimodal emotion analysis can offer more precise assessments and recommendations by examining the emotions expressed in network-related texts and videos [12–14]. Similarly, in the medical and health sector, different types of online feedback serve as valuable resources for conducting multimodal sentiment analysis [15]. As the ultimate form of emotional analysis, multimodal emotional analysis can understand and analyze human emotional state more comprehensively and carefully by integrating information from multiple modes [16].
This study constructs an efficient and accurate emotional analysis and semantic understanding model by fusing the data characteristics of different modes. This will not only help to improve the level of NLP technology, but also provide more intelligent and accurate services for social media monitoring, customer feedback analysis, intelligent customer service and other application fields.
This model realizes efficient and accurate emotional analysis and semantic understanding by integrating data features of text, image, audio and other modes. The overall structure is illustrated in Figure 1 below: multimodal emotion analysis module and multimodal semantic understanding module [17–18]. The overall structure is illustrated in Figure 1 below.

Overall framework of model
This model can effectively improve the level of NLP technology and provide more intelligent and accurate services for social media monitoring, customer feedback analysis, intelligent customer service and other application fields by integrating multi-modal data characteristics and using deep learning methods for emotional analysis and semantic understanding.
Data preprocessing
For text data, preprocessing includes word segmentation, stop words removal and word vector representation. Word segmentation is the key to Chinese text processing, that is, dividing continuous text into independent words; Then remove those words that don't contribute much to emotional analysis to reduce noise; Finally, words are mapped into high-dimensional space by Word2Vec technology, which captures semantic relations and provides rich text representation [19].
Images need to be scaled and cropped to ensure uniform size and focus on relevant areas, and then features such as edges, textures, colors, etc. are extracted by using convolutional neural networks (CNN) [20]. Audio data captures both frequency and temporal features by extracting Mel-frequency cepstral coefficients (MFCC) (as shown in Figure 2), and segments the long audio to accurately capture the emotional changes.
Feature fusion
Feature fusion is a key step in multimodal sentiment analysis, which aims to effectively combine the feature vectors of different modalities to make full use of the complementarity of multimodal data. In this model, the feature level fusion method is adopted, that is, the feature vectors of text, image and audio are directly spliced into a joint feature vector [21]. This approach is straightforward and clear, allowing the preservation of the original information in each mode.
Let
Where ⊕ represents the splicing operation of vectors.
In order to balance the importance of different modal features, the feature vectors are weighted. If all modes are considered equally important, they can be given the same weight, that is,
Model construction
Model construction is the core part of multimodal sentiment analysis. The fused feature vectors are classified by deep learning model. In this model, the neural network model with enhanced attention mechanism is used for emotional analysis. Attention mechanism can dynamically adjust the importance of different features, so that the model can pay more attention to the features that contribute to emotional analysis. Specifically, the attention mechanism model based on Transformer is used, which has advantages in processing sequence data and capturing long-distance dependencies [22]. The attention mechanism model based on Transformer is shown in Figure 3.
In the process of model training, it is necessary to select the appropriate loss function and optimization algorithm to optimize the parameters. In multimodal emotion analysis, our task is to classify emotions into positive, neutral and negative categories, and a common loss function is cross entropy loss.
If
In practical application, the following simplified forms are usually used, which are suitable for multi-classification problems:
Where
For the binary classification problem, the cross entropy loss function can be further simplified as:
Where

Extracted MFCC features

Transformer model structure
Semantic representation
Multimodal semantic understanding can deeply understand the meaning and context of the content by integrating information from various modes such as text, image and audio, thus providing more accurate analysis and interpretation.
In multimodal semantic understanding, semantic representation is the cornerstone. For text data, the BERT pre-trained language model is employed to capture advanced semantic features. BERT can capture the complex relationship between words by encoding context information in both directions [23]. For image data, the pre-trained ResNet model is used to extract high-level features of the image, such as objects and scenes. ResNet solves the problem of gradient disappearance in deep network through residual connection, so that deep network can be trained effectively. Fig. 4 is a ResNet residual block.
In order to unify the semantic representation of text and image into the same space, a mapping function is introduced to map their feature vectors into a common semantic space. Let the text feature vector be
Where
Semantic reasoning
In the semantic reasoning stage, the deep learning-based Seq2Seq model, especially the Transformer model, is adopted to carry out cross-modal semantic reasoning. Transformer model captures the dependencies within and between sequences through self-attention mechanism and cross-attention mechanism.
For a given sequence of text and image feature vectors, the Transformer model can be expressed as:
Among them,
Context understanding
Context understanding is the key to multimodal semantic understanding. This paper uses attention mechanism to capture the information of text context, image context and cross-modal context. Specifically, a multi-headed attention mechanism is designed, which can process information in different representation subspaces in parallel. The mechanism of multi-head attention feature fusion is shown in Figure 5.
Let
Among them,

Schematic diagram of ResNet residual block

Multi-head attention feature fusion mechanism
In this experiment, a large-scale data set-"Multimodal Emotion and Semantic Understanding Data Set (MUSE)" is selected, which integrates multimodal emotion analysis and semantic understanding. The data set comes from social media platforms and contains posts posted by users. Each post contains text descriptions, pictures and audio (such as voice comments). The total size of the data set reaches 100,000 pieces, and each piece of data is manually labeled, including emotional categories (positive, negative and neutral) and semantic labels (such as event types and subject objects).
The experiment was carried out on a server equipped with NVIDIA Tesla V100 GPU and Intel Xeon Gold 6148 CPU, with Ubuntu 18.04 as the operating system and PyTorch 1.7 as the deep learning framework. Text data is represented by word vector through 300-dimensional Word2Vec model; ResNet-50 is used for image feature extraction to generate 2048-dimensional output features; In audio processing, 13-dimensional MFCC features are extracted from audio every second, and the segment length is set to 1 second, and the number of segments n is dynamically adjusted according to the total audio duration. The fused joint feature vector dimension consists of three parts: text 300 dimension, image 2048 dimension and audio 13*N dimension.
Transformer with enhanced attention mechanism is selected as the model structure, and it is equipped with 6-layer network, 8 attention heads and 512-dimensional hidden layer. In order to evaluate the performance of the model, the task of sentiment analysis uses accuracy, recall and F1 value as metrics, while the part of semantic understanding evaluates the recognition effect of the model on semantic tags according to precision, recall and F1 value.
The performance comparison of different models in emotion analysis tasks shows (see Table 1 and Figure 6) that the multi-modal fusion model is significantly superior to other single-modal models with 85.2% accuracy, 86.7% recall and 85.0% F1 value, which shows that combining multi-source information such as text, image and audio can greatly improve the accuracy and comprehensiveness of emotion analysis. In contrast, the models using only text, image or audio reach the F1 values of 78.2%, 81.8% and 81.7% respectively. Although each model has its own advantages, it is not as good as the multimodal fusion model, emphasizing the promise of integrating multimodal data to improve sentiment analysis performance.

Performance comparison of different models in emotional analysis tasks
Overall performance of different model in affective analysis task
types of models | Accuracy | Recall | F1 values |
---|---|---|---|
Text only | 78.6% | 77.8% | 78.2% |
Image only | 80.5% | 83.2% | 81.8% |
Audio only | 82.1% | 81.4% | 81.7% |
Multimodal fusion | 85.2% | 86.7% | 85.0% |
The fusion of multimodal data significantly enhances the model's comprehensive understanding of emotion, especially the image and audio information provide an important supplement for the interpretation of text emotion, such as judging emotional tendency through non-verbal information such as facial expression, tone and intonation.
BERT and ResNet models in multimodal semantic understanding module effectively extract the deep features of text and images, while the self-attention and cross-attention mechanism of Transformer model promote the effective fusion and reasoning of cross-modal information and improve the model's understanding ability of complex contexts. As shown in Table 2, the model achieves a precision rate of 79.3%, which means that 79.3% of the semantic tags identified by the model are correct. This shows that the model performs well in avoiding wrong classification. The recall rate is 76.8%, slightly lower than precision. This shows that the model can identify 76.8% of all the actual semantic tags. A slightly lower recall rate may mean that the model has missed some semantic tags. The F1 value is 78.0%, The F1 score serves as a combined metric that balances precision and recall. It approximates the harmonic mean of these two measures, providing an overall assessment of their performance, which shows that the performance of the model in these two aspects is relatively balanced. The model is robust in semantic tag recognition, especially in precision. However, the slightly lower recall rate may indicate that the model still has room for improvement in capturing all relevant semantic information.
Performance of the model in semantic tag recognition
index | value |
---|---|
precision | 79.3 |
Recall rate | 76.8 |
F1 value | 78.0 |
It can be seen from Figure 7 that the model performs well in identifying happiness categories, and most of the samples of actual happiness categories are correctly predicted as happiness. There is some confusion in the identification of sadness categories in the model, and some samples of actual sadness are wrongly predicted as happiness or anger. The recognition of anger category is also relatively good, but a certain proportion of anger samples are wrongly predicted as sadness. Generally speaking, the model performs well in identifying the categories of happiness and anger, but there is some confusion in the categories of sadness. This may mean that the model needs further optimization, especially the ability of feature extraction and classification in distinguishing sadness from other emotional categories.

Confusion matrix of model on data set
Figure 8 shows the distribution of semantic tags identified by the model in the data set. As can be seen from the figure, the category of "happiness" is the most recognized, followed by the categories of "surprise" and "anger", while the categories of "sadness" and "neutrality" are relatively few. This shows that the model performs best in identifying "happy" emotions, but relatively weak in identifying "sad" and "neutral" emotions.

Semantic tags identified by the model on the data set
Comparing the methods of single modality (text only, image only, audio only) and multimodal fusion, it is found that multimodal fusion model performs best in all evaluation indexes. This is mainly due to the fact that multimodal data provides more comprehensive information and reduces the ambiguity and misunderstanding that may be brought by a single modality.
The application of attention mechanism also significantly improves the performance of the model. In particular, the Transformer model dynamically adjusts the importance of features, which makes the model more focused on the information that is crucial for emotional analysis and semantic understanding, thus improving the accuracy of judgment. The multi-modal emotion analysis and semantic understanding model designed in this experiment effectively integrates multi-modal data such as text, image and audio, and combines advanced deep learning technology to realize efficient and accurate emotion recognition and semantic understanding, which provides valuable reference for future research in the field of multi-modal information processing.
This paper studies the emotional analysis and semantic understanding of multimodal network language data, and constructs an efficient and accurate emotional analysis and semantic understanding model by integrating the data characteristics of text, image and audio. The experimental results indicate that the multi-modal fusion model achieves a notable advantage over the single-modal model in emotion analysis, attaining an accuracy of 85.2%, 86.7% recall and 85.0% F1 value, which shows that combining multiple information sources can greatly improve the accuracy and comprehensiveness of emotion analysis. In terms of semantic understanding, the precision of the model is 79.3%, the recall rate is 76.8%, and the F1 value is 78.0%, which shows that the model has a good performance in avoiding wrong classification, but there is still room for improvement in capturing all relevant semantic information. Generally speaking, this study has achieved efficient and accurate emotion recognition and semantic understanding by effectively fusing multi-modal data and combining advanced deep learning technology, which provides valuable reference for future research in the field of multimodal information processing.