Multi-level Visual Communication Design Strategies and Practices for Digital Brand Image Construction

With the gradual improvement of the market economy, China has become the world’s factory, a variety of products to compete with similar products at home and abroad on the same stage, each enterprise in order to their own survival and development needs, are in how to better improve their own brand visualization system construction on the investment of great energy [1-2]. However, many Chinese enterprises due to a variety of conditions, the brand image shaping and foreign development level gap is still relatively large. Mainly reflected in the brand image and brand connotation can not be organic integration between the visual system is incomplete resulting in brand image rigidity, lack of vitality. This leads to a non-stop cycle of brand building and makes it difficult to cultivate consumers’ brand loyalty [3-4]. How to combine the enterprise’s own characteristics and brand building organically through scientific visual identity design to win in the brand war is of great significance [5-6].

On the other hand, the leaping development of the current digital media technology makes digital technology penetrate into various fields. The combination of digital technology and brand design can activate the cultural industry and enhance the cultural brand recognition and influence. The purpose of brand image design is more effective communication [7-9]. Digital image design for the brand and consumers to build a new communication bridge, giving the brand image with the characteristics of the new era [10].

Enterprise is the most basic unit in the market economic system, enterprise brand image refers to the enterprise special name, symbols and combination of forms [11]. Brand image is not only the identification of the enterprise to distinguish from other enterprises, but also the visual embodiment of the enterprise’s cultural connotation and unique products [12]. Corporate brand image is a unique intangible asset of the enterprise, which can effectively enrich the cultural connotation of the enterprise and enhance the added value of the enterprise’s products. The traditional corporate brand image refers to the form that contains elements such as logo, vi, color, graphic and text, which is confined to the plane manifestation. In the context of digitalization, a single brand image composed of visual images can no longer meet the needs of consumers, consumer behavior in addition to the purchase of goods, more reflected in the consumption process of experience and service links [13-14]. In the era of digital economy where everything is interconnected and shared in the cloud, it is an inevitable choice for enterprises to upgrade their brand image by using design means to make it a digital brand image that meets the requirements of digital communication media. The new digital brand image must be a visual image that integrates user experience and emotion, is suitable for multi-channel communication and can be perceived by consumers [15-17].

Digital technology empowers brand image design mainly by enhancing the interactivity, flexibility and convenience of brand image design, and there are also some research topics on how to improve digital literacy through relevant teaching and training in order to enhance the ability of digital brand image design as well as how to incorporate traditional artistic and cultural elements in brand image design based on digital technology. Literature [18] discussed the kernel and norms of brand design in the digital context, and concluded that digital technology subverts the perceptual, contextual and interactive aspects of brand image design. Literature [19] summarized the development of digital brand image design models and methods, pointing out that digital brand design emphasizes simplicity, interactivity and innovation, and can quickly respond to changes in information from customers, markets and digital media. Based on the concept of brand image and the perspective of brand visual image design, literature [20] examined how multimedia technology and Internet technology play a role in brand visual image design, and concluded that the trend of brand visual image design develops in the direction of dynamism, interactivity and flexible image. Literature [21] analyzes the forms and characteristics of aesthetic elements in product styling and brand image design, reviews the development and teaching effect of brand image design education, and puts forward specific and effective suggestions for the success of brand image design. Literature [22] discussed the value perception and emotion perception characteristics of brand image, and concluded that the design of brand image should be combined with the deep knowledge of the industry in which the brand is located, and carry out innovative and differentiated design, in order to provide consumers with a good value experience. Literature [23] analyzed the visual communication characteristics of Florasis, and concluded that the success of the new brand image design in China should be rooted in the foundation of traditional Chinese culture and integrated into modern visual communication design concepts, taking into account the expression of traditional culture while enhancing the artistic aesthetic value.

In the research around brand image visual design, the integration of visual metaphor, art design, and perceptual identification in brand image visual design is mainly studied, and practical cases are used to support the theory. Literature [24] conceived a visual communication computing system with conceptual metaphor as the core logic to realize the display of information and expression of ideas in brand image design, and confirmed the feasibility of the proposed system through theoretical analysis and practical feedback. Literature [25] takes Disney Mickey Mouse cartoon character brand in global visual communication practice as a research case, aiming to explore the integration of brand image and public aesthetics design practice based on visual communication, which deepens the understanding of brand image visual communication design guidelines and concepts, and helps to present a better aesthetic art design for the public. Literature [26] summarizes four brand packaging design strategies involving emotional design, brand vision and material innovation, and concludes that interesting and emotional brand visual packaging design can easily stimulate consumer empathy, and the study provides an important reference for the interesting and green development of brand packaging visual design. Literature [27] takes Morikawa Shi brand as a research case, explores the overall integration of emotional identification and artistic design in the brand’s visual design, and points out that the brand’s innovative strategies in visual communication design, user experience and music customization service make positive contributions to the brand’s innovative design.

This paper firstly describes the human visual system and visual perception characteristics, and designs the color migration algorithm PCGAN and the style transformation algorithm SPGAN for brand images combined with the generative adversarial network, respectively. The PCGAN model designs the color feature extraction module, constructs the color style feature map of the reference image by means of primary color family clustering, encoding, and broadcasting, and uses grayscale compression to perform color removal operations on the products in the image. Finally, a dual-input generator is used to perform multi-layer cross-feature fusion of product images and color style feature maps to generate product color information. The SPGAN model incorporates an attention mechanism layer into the generator, substituting traditional convolutional feature maps with feature maps with attention to enhance the model’s performance. Finally, questionnaires are investigated and analyzed to investigate and analyze the influencing elements of multi-level visual communication design, which provides a basis for further optimization of the brand image construction.

2

Human visual system and visual properties

Human beings perceive and recognize external things mainly through vision, and more than 70% of the external information acquired by human beings is obtained through the human eye system, and this information is transmitted to the brain after processing by the visual nerve center system. Therefore, understanding the composition and working mechanism of the human visual system (HVS), and understanding the relationship between the visual characteristics of the human eye and the perception of image quality is crucial to the study of multi-level visual communication design strategies, and then build a digital brand image.

2.1

Basic Components of the Human Visual System (HVS)

The basic components of the human visual system (HVS) are shown in Figure 1, which includes the human eye system and the visual nerve center system. The human eye system mainly consists of the cornea, iris, lens, and retina. The cornea’s light-gathering effect can focus external light onto the retina. Light enters the eye through the pupil located in the center of the iris, which can be adjusted to control the amount of light entering the eye. The lens acts as a lens and converges light onto the retina. The retina has cone cells, which are responsible for light-sensitive imaging, and rod cells, which are located in the central concave area of the retina and are sensitive to bright light and color with high resolution. Rod cells are located at the edge of the retina and are sensitive to dark light. They have low resolution and cannot perceive color.

The visual nociceptor system is a cluster of nerve cells in the cerebral cortex associated with the formation of vision that does lateral geniculate (LGN) processing and optic cortex processing of visual information transmitted from the optic nerve. The LGN is connected to the retinal ganglion cell axonemes synapses and to the optic cortex, and receives both electrical signals from the retina and feedback from the optic cortex. The LGN is a system that processes visual information from the optic nerve in a high-level hierarchical manner. The optic cortex consists of the primary visual cortex (V1) and the extrastriate cortex (e.g., V2, V3, V4, and V5), which perform high-level hierarchical processing of visual information. First, V1 receives visual information from the LGN and extracts local features such as structure and color. Then the information is passed to the extrastriate cortex (V2, V3, V4 and V5, etc.) iteratively for processing, and after more and more high-level abstraction to get finer classifications such as image content recognition, comprehension, matching and memory.

When a human observes an object, reflected light is imaged through the human eye system to produce light signals on the retina, photoreceptor cells on the retina are stimulated by the light signals to produce electrical signals, and the optic nerve transmits the electrical signals to the visual nerve center system for LGN processing and the brain’s optic cortex for processing, and continues to analyze and respond to them so that a person can perceive the presence of things and comprehend and remember them.

2.2

Visual perception characteristics of the human eye

With some understanding of the basic components of the HVS, scientists have studied the visual properties of the human eye in biology, neurology, and psychophysics, laying a theoretical foundation for modeling the visual system. Several common visual perception properties of the human eye are shown below:

2.2.1

Multiscale Visual Characterization

The human eye is a complex optical system regulated by the optic nerve. In the whole process of human eye observing and perceiving external things, from the photoreceptor cells on the retina to the optic cortex cells in the visual nerve center system have different receptive fields, which are oriented to different visual features such as different directions and frequencies and process them separately, making the HVS a system with multi-scale and multi-channel processing mechanisms.

2.2.2

Visual attention mechanisms

Humans can select visually salient regions or regions of interest (ROI) from natural scenes for prioritization. This active selective mental activity is known as the visual attention mechanism.

The visual attention mechanism is categorize into two modes: top-down and bottom-up. The top-down mode is characterized by a subjective conscious influence that directs visual attention towards the target area in an image while disregarding other areas after receiving a specific instruction. Bottom-up mode is influenced by the salience of the image itself, such as the brightness and color of the image, as well as when the image contains content information such as a specific shape or a specific thing, the human eye will pay more attention to the region and perceive the detailed information in the region.

Due to the visual attention mechanism of the human visual system, there are differences in the acquisition of information in different areas of the image when the human eye is viewing the image, and this visual characteristic will have a great impact on the visual communication design. Visual optimization design in the image area that the human eye pays more attention to will cause a stronger response from the human visual system, and it is easier to perceive the three-dimensional corporate digital brand image, while visual optimization design in the image area that is ignored by the visual attention mechanism will relatively not perceive the optimization of the digital brand image. Therefore, effectively simulating the visual attention mechanism, finding the region of interest in the image and differentiating it can improve the effectiveness of visual communication design, thus building a multi-layered digital brand image for the enterprise.

2.2.3

Visual masking effects

The visual masking effect of the human eye is caused by mutual interference between visual information, including luminance masking, texture masking, and color masking. Luminance masking effect refers to the fact that when the target object is close to the background luminance, it will be difficult for the human eye to distinguish between the two. The texture masking effect refers to the lower sensitivity of the human eye when observing images with complex textured backgrounds, which usually shows that the human eye is insensitive to the distortion of the textured regions of the image, while it can easily detect the quality degradation of the edge regions. The color masking effect is caused by the fact that the human eye has different sensitivity to different colors, and is especially sensitive to red, green, and blue colors. Additionally, the ability of the human eye to perceive color changes and distinguish color distortion is linked to both color saturation and image brightness.

The visual masking effect prevents the human eye from perceiving changes below a certain threshold, which is known as the Just Noticeable Distortion (JND). In the research of multi-level visual communication design strategy, the JND threshold can help us distinguish which distortions in the image are detectable by HVS and which changes in the image information are not perceptible by HVS. According to the JND threshold, finding out the quality changes that can be perceived by HVS and ignoring the imperceptible quality changes can effectively reduce the complexity of the multilevel visual communication design strategy, so that the constructed digital brand image can be more in line with the human eye’s perception of image quality.

2.2.4

Brightness and Contrast Sensitivity Characteristics

The human visual system perceives luminance in a logarithmic relationship with light intensity, which makes the human eye insensitive to gray-scale changes in brighter image regions, resulting in the human eye being less sensitive to absolute luminance and more likely to perceive changes in relative luminance. The luminance sensitivity properties of the human eye are usually expressed using the Weber-Rechner law, which is calculated as shown in Equation (1): (1) $C_{w} = \frac{Δ L}{L_{b}}$ where ΔL represents the minimum difference between the image content and the background brightness perceivable by the human eye, L_b represents the background brightness, and C_w approximates a constant.

The contrast sensitivity property refers to the phenomenon where the human eye is unable to distinguish the blurring of edges within a certain degree due to its limited discriminatory ability. Contrast sensitivity is usually related to the spatial information and luminance contrast of the image region, which can be measured according to the frequency or strength of spatial signals. The luminance and contrast sensitivity properties of the human eye reflect the ability of HVS to discriminate the intensity of light and edge contrast when perceiving images, which lays a theoretical foundation for the study of multilevel visual communication design.

3

Multi-level visual communication design model for digital brand identity

In this chapter, based on Generative Adversarial Network (GAN), the color generation algorithm PCGAN based on style coding and the style transformation algorithm SPGAN are designed respectively, which realizes the multi-level visual communication design of the product from the perspectives of color generation and style transformation, so as to construct and display the digital brand image.

3.1

Generating Adversarial Networks

Generative Adversarial Networks (GAN) is an emerging technique for semi-supervised and unsupervised, capable of implicitly modeling high-dimensional data distributions, and its most important feature is the training of a pair of networks competing with each other by means of adversarial gaming [28].

The adversarial gaming process of GAN can be transformed into a process of solving the minimal-extremum value of the objective function, which aims to map a noisy distribution into a real data distribution. The objective function is shown in equation (2): (2) $\min_{G} \max_{D} V (G, D) = E_{z ~ p_{z} (z)} \log (1 - D (G (z))) + E_{x ~ p_{d a t a} (x)} \log D (x)$ where G(·) denotes the generator function, D(·) denotes the discriminator function, z denotes the random sample, and x denotes the real sample.

The discriminator D(x) can be regarded as a binary classifier, whose goal is to make the output probability of the real sample close to 1 and the output probability of the synthetic sample close to 0, while the generator is the goal to make the output probability of the synthetic sample close to 1. The optimization directions of the generator and the discriminator are in conflict, and the two form an antagonism to keep bringing the distance between the synthetic sample and the distribution of the real sample close to the real sample.

In practice, the training process will iterate the parameters of the generator and the discriminator using an alternating training approach. For the generator, the goal is to minimize the output of $\max_{D} V (G, D)$ . For the discriminator, the objective is to maximize V(G,D). If a derivative operation is performed on V(G,D), the optimal discriminator D^*(x) is obtained as shown in equation (3): (3) $D^{*} (x) = \frac{p_{g} (x)}{p_{g} (x) + p_{d a t a} (x)}$ where p_g(x) denotes the generated sample distribution and p_data(x) denotes the true sample distribution.

Substituting the optimal discriminator D^*(x) into Eq. (2), it can be deduced that the optimization direction of the generator is to minimize the JS scatter between the real sample distribution p_data(x) and the synthetic sample distribution p_g(x). JS scatter measures the distance between two distributions, which has non-negative equivalence and symmetry.

In practice, the function represented by $\max_{D} V (G, D)$ is a high-dimensional non-convex function, and the GAN finds the minimum of the function by means of gradient descent, which is not necessarily the Nash equilibrium point. Feature matching can help the generator produce samples that are more consistent with the distribution of real data. For the image generation task, the purpose of feature matching is to reduce the distance between the synthesized image and the real image in the middle layer of the discriminator, and its loss function is shown in equation (4): (4) $L_{f e a t u r e} (G, F) = {‖ E_{x ~ p_{d a t a} (x)} F (x) - E_{z ~ p_{z} (z)} F (G (z)) ‖}_{2}^{2}$ where F(x) denotes the feature map of sample x in the middle layer of the discriminator.

GANs are prone to the problem of pattern collapse in training, i.e., the generated samples are extremely similar and have poor diversity. This problem can be mitigated by small batch discrimination. Let F(x_i) denote the feature vector of input sample x_i in the middle layer of the discriminator, F(x_i) the matrix obtained by multiplying with the learnable tensor T_n×p×q is denoted as M_i, and the sum of the L₁ distances between x_i and the b th feature in the other samples is denoted as O(x_i)_b: (5) $O {(x_{i})}_{b} = \sum_{j = 1} \exp (- {‖ M_{i, b} - M_{j, b} ‖}_{L_{1}})$

Eventually each sample x_i can be computed from the small batch layer to get the corresponding vector O(x_i). The small batch layer is added on top of the original discriminator, F(x_i) as the input, O(x_i) as the output, and T as the learnable parameters.

In addition to the problem of pattern collapse, the original GAN is also prone to the problem of gradient vanishing during the training process. And WGAN solves this problem from the theoretical level by replacing the JS distance, which measures the distance between distributions, with the Wasserstein distance. The loss function of the generator in WGAN is shown in equation (6): (6) $W [p_{g} (x), p_{d a t a} (x)] = \max_{f_{w}, {‖ f_{w} ‖}_{L} \leq 1} E_{x ~ p_{d a t a} (x)} f_{w} (x) - E_{x ~ p_{g} (x)} f_{w} (x)$ where discriminator f needs to satisfy the 1-Lipschitz limit, which can be ensured by weight trimming during network training. In addition to WGAN, WGAN-GP, BEGAN, EBGAN, etc. all also use improved loss functions to stabilize GAN training.

3.2

Product Color Migration Algorithm Based on Style Coding

The color migration task refers to transferring the color features from the reference image to the content image without destroying the content features of the original content image. In this section, a stylistic coding based Product Color Generation Adversarial Network (PCGAN) is proposed, which encodes the color features of the reference image through a color feature extraction module, and then reconstructs the color information of the product by fusing the color feature maps with features of the grayscaled product image through a two-input generator. The overall network uses a GAN architecture that helps to improve the realism and naturalness of the generated images.

3.2.1

Overall model structure

PCGAN consists of four main parts: grayscale compression module, color feature extraction module, generator, and discriminator for bidirectional feature fusion.

First, the color feature extraction module can generate a local color feature map, which is used as the input for the color condition for the subsequent generator. At the same time, the grayscale compression module performs grayscale scaling and compression operations on the product region in the image to produce a grayscaled product image. Then, both the grayscaled product image and the local color feature map are fed into a dual-input color generator, which adopts the structure of a codec and is capable of generating color information for the product. Finally, the generated image and the real image are fed into a PatchGAN discriminator for confidence scoring, which maps the input image into a probability matrix in the form of a full convolution [29].

3.2.2

Gray scale compression module

Graying is done in PCGAN using weighted averaging method, which takes into account that the human eye has the lowest sensitivity to blue and the highest sensitivity to green, so the three RGB channels are given different weights for weighted averaging to compute the gray values of the pixel points, i.e.: (7) $P_{g r a y} = P_{r e d} \times 0.299 + P_{g r e e n} \times 0.587 + P_{b l u e} \times 0.114$ Where P_red, P_green and P_blue represent the red channel value, green channel value and blue channel value respectively.

For some black system or white system images, their grayscale images are very close to the original image and cannot achieve the effect of color removal. In order to alleviate this problem, PCGAN carries out the processing of grayscale compression on the basis of image grayscaling, i.e., compression of the image grayscale value into a fixed interval.

3.2.3

Color Feature Extraction Module

The color feature extraction module can extract color features from the reference image and convert them into a local color feature map, which is divided into a color clustering module, a color coding module, and a color broadcasting module. Firstly, the color clustering module is responsible for extracting the main color system of the reference image, then the color feature coding module is responsible for compressing the main color system image into a one-dimensional color feature vector, and finally the color feature broadcasting module is responsible for broadcasting the feature vector to the product area of the image. Eventually, the color feature extraction module outputs a local color feature map, which is used as the color style input to the generator.

In order to extract the dominant colors in the reference image and remove the redundant color information, the first step of the color feature extraction module is to use the K-means algorithm to perform clustering operation on the pixel points with the following algorithmic flow:

Step1, randomly select K pixel points in the reference image as the initial K clustering centers.

Step2, traverse all the pixel points in the reference image and calculate the Euclidean distance between them and the K clustering centers, and the clustering center with the closest distance is used as the category of the current pixel point.

Step3, recalculate the clustering centers for each category, and the new clustering center is required to have the smallest sum of Euclidean distances from all other pixel points in the current category.

Step4, repeat Step2~ Step3 until the clustering center does not change.

The K centers obtained by the K-means clustering algorithm can be used as the primary color palette of the reference image, and the primary color palette is sorted according to the number of pixel points in each category, and a pair of primary color palette images are generated proportionally. The generated primary color system image breaks the spatial correlation of various colors in the original reference image, allowing the model to learn color information more freely. In this way, the dominant color family can be extracted not only from the whole image, but also from a limited region of the image, i.e., only the pixel points in the limited region are subjected to the clustering operation.

In order to achieve the alignment of color features with the gray scale region of the human image, PCGAN adopts a coding and then broadcasting approach. The color feature encoding module will perform the next step of feature extraction on the main color system image to obtain a one-dimensional color feature vector by convolution. The color feature broadcast module will randomly initialize a tensor with the same width and height as the human image and broadcast the color feature vector to the specified region of the tensor. The final generated tensor will be used as a color style input to the generator for feature fusion with the grayscaled product image. The introduction of the color feature encoding module and the color feature broadcasting module allows the model to perform color generation tasks for multiple reference images and regions. In the model testing phase, the color feature extraction module can encode multiple reference images into different feature vectors, and then broadcast the different feature vectors to different product regions, thus enabling the rendering of multiple regions of the image with different color styles at one time.

3.2.4

Color Generator for Bidirectional Feature Fusion

The generator part of PCGAN adopts a dual-input single-output structure, where the dual encoders encode the features of the color feature map and the grayscaled product image, respectively, and perform bi-directional feature fusion. The decoder reconstructs the fused features and outputs a new product image with the same color style as the reference image.

The dual encoder consists of a product feature encoder E_p and a color feature encoder E_c. The input to E_p is a grayscaled product image and the input to E_c is a local color feature map. The structures of E_p and E_c are identical and both contain four downsampled convolutional layers, which are followed by a RELU activation layer. The output of each convolutional layer in E_p and E_c is passed through a SPADE module for bidirectional feature fusion [30]. The final generated image is used as the input to the discriminator. The discriminator in PCGAN uses the structure of PatchGAN, which consists of a total of five convolutional layers, the last four of which have a convolutional kernel step of 2, so that each convolutional layer performs a 2-fold downsampling operation of the feature map, which is a total of 16-fold. The output of the discriminator is a confidence score matrix, where each value in the matrix represents the probability of whether its corresponding sense field is a real product image.

SPADE is a spatially adaptive normalization layer, which improves on the traditional batch normalization layer by learning two sets of transform parameters γ and β by convolution, which not only achieves the normalization function, but also can supplement enough semantic information.

First, the SPADE module performs the convolution operation on the semantic image to obtain the scale factor and translation factor. In the batch normalization layer, the scale factor and translation factor are two groups are network parameters, which need to be learned by means of network training. In addition, the scale factor and the translation factor in the batch normalization layer are one-dimensional vectors, while those in SAPDE are three-dimensional tensors, which have not only channel dimension but also width-height dimension, and thus are spatially adaptive. The calculation of the batch normalization layer without considering the spatial dimension is also one of the important reasons for the loss of image information. The output f^{i^out} after SPADE normalization is shown in equation (8): (8) $f^{i^{o u t}} = γ_{c, h, w}^{i} \frac{f_{n, c, h, w}^{i} - μ_{c}^{i}}{σ_{c}^{i}} + β_{c, h, w}^{i}$ where γⁱ is the scale factor, βⁱ is the translation factor, N is the number of batches, and fⁱ is the output feature map of the ith convolutional layer.

Among them, the mean μ and standard deviation σ are calculated in the same way as the batch normalization layer, as shown in Eqs. (9) to (10): (9) $μ_{c}^{i} = \frac{1}{N H^{i} W^{i}} \sum_{n, h, w} f_{n, c, h, w}^{i}$ (10) $σ_{c}^{i} = \sqrt{\frac{1}{N H^{i} W^{i}} \sum_{n, h, w} f_{n, c, h, w}^{i} - {(μ_{c}^{i})}^{2}}$

3.2.5

Loss function design

The loss function of PCGAN contains three parts: the generative adversarial loss of the GAN, the perceptual loss between the real product image and the generated product image, and the reconstruction loss between the real image and the generated image. The computation of generative adversarial loss L_adv is shown in equation (11): (11) $\begin{matrix} L_{a d v} (G, D, x, y, c) = E_{x ~ p_{d a t a} (x), c ~ p_{d a t a} (c)} \log (1 - D (G (x, E (c)))) \\ + E_{y ~ p_{d a t a} (y)} \log D (y) \end{matrix}$ Where, G(·) denotes the generator function, D(·) denotes the discriminator function, E(·) denotes the color feature extraction module, x denotes the local grayscaled product image, y denotes the real product image, and c denotes the reference image.

Perceptual loss is used to measure the Euclidean distance between the perceptual features of the generated image and the output image, which aims to penalize the difference in semantic information between the real image and the generated image. PCGAN uses a VGG-16 network pre-trained on the Imagenet dataset to extract the perceptual features of the image. The perceptual loss L_per is calculated as shown in equation (12): (12) $L_{p e r} (G, x, y, c) = \sum_{j} \frac{1}{C_{j} H_{j} W_{j}} {‖ ϕ_{j} (y) - ϕ_{j} (G (x, E (c))) ‖}_{2}^{2}$ where ϕ_j(y) denotes y the output feature map of the jrd convolutional layer in VGG-16, C_j denotes the number of channels of the jth feature map, H_j denotes the height of the jth feature map, W_j the width of the j th feature map, and ∥·∥₂ denotes the number of computational L₂ paradigms.

The reconstruction loss is used to penalize the color difference between the generated image and the real image, which is crucial for supervised color generation tasks. The reconstruction loss L_rec used in this paper is the Manhattan distance, which is computed as shown in equation (13): (13) $L_{r e c} (G, x, y, c) = \frac{1}{C H W} {‖ y - G (x, E (c)) ‖}_{1}$ where ∥·∥₁ denotes the computation of the L₁-parameter.

In summary, the specific expression of the objective function of PCGAN is shown in Eq. (14), where λ₁, λ₂ can be used to adjust the proportion of the three parts of the loss. When training the discriminator, the parameters of the generator will be fixed, and the optimization objective is to maximize the first loss. When training the generator, the parameters of the discriminator will be fixed, and the optimization objective at this time is to minimize the weighted sum of the three losses. PCGAN achieves the final Nash equilibrium state by alternating the training of the generator and the discriminator. Then: (14) $\begin{matrix} \min_{G} \max_{D} L_{a d v} (G, D, x, y, c) + λ_{1} L_{p e r} (G, x, y, c) \\ + λ_{2} L_{r e c} (G, x, y, c) \end{matrix}$ where λ₁ and λ₂ are the weights of perceived loss and reconstruction loss, respectively.

3.3

SPGAN-based product style transformation algorithm

In this section, the generative adversarial network technique is used to study the product style transformation problem, and by borrowing the idea of cyclic consistency loss and combining the human-computer interaction techniques and attention mechanisms, a specialized SPGAN network (Style Transformation of Product based on Generative Adversarial Networks) is designed to realize the task of product style transformation.

3.3.1

SPGAN network architecture

The SPGAN network model obtains the input image from the B domain by passing the input image through the first generator G_B2A, G_B2A whose function is to transform the image in the B domain into the A domain, and the generated image through another generator G_A2B, whose function is to transform the image in the original B domain into the output image Cyclic_B, and to relate the input image to the output image by means of a cyclic loss consistency function, so that the closer the two are, the better. SPGAN network The overall architecture as well as the data flow is shown in Fig. 2.

Step1, generator G_B2A translates image Input_B into target product image Generated_A, Generated_A generates Cycilc_B by G_A2B, and if Input_B is closer to the image generated by Cyclic_B, the better Generated_A is generated. Its corresponding discriminator is used to identify whether the target product image is a real image or an image generated by the generator.

Step2, SPGAN requires two loss functions for training, the reconstruction loss of the generator and the discrimination loss of the discriminator. The reconstruction loss makes the generated image Cyclic_B as similar as possible to the original image Input_B. The discriminative loss makes both the generated fake image and the original true image are input to the discriminator.

Step3, both the attention model and the null convolution are applied to the generator, and the human-computer interaction technique is applied to the generator loss function.

1)

Generator

The generators GA2B and GB2A have the same network structure, which consists of an encoder, converter and decoder.

Among them, the encoder includes three convolutional layers and utilizes a convolutional neural network to extract features from the input image. The converter converts the feature vectors of the image in the A domain to the feature vectors in the B domain by combining the different features of the image. In this paper, 6 Reset_block modules are used, which can achieve the goal of preserving the original image features while converting. The decoder uses the inverse convolutional layer to reduce the low-level features from the feature vector, and finally obtains the generated image, this paper uses two inverse convolutional layers and one convolutional layer. SPGAN adds a BN layer (Batch Norm) behind each convolutional layer of the generator, which accelerates the convergence speed of the generator and plays a certain role in regularization.

Resnet_block is a neural network layer consisting of 2 null convolutional layers, where some of the input data is added directly to the output to ensure that the information from the input data of the previous network layer is directly applied to the later network layer, narrowing down the deviation of the corresponding outputs from the original inputs. The main purpose of Resnet_block is to preserve the features of the original image such as the target’s size, color, and shape , so residual networks are well suited to accomplish these transformations.

Null convolution is a form of convolution that involves expanding the convolution kernel to increase the sensory field. It avoids the loss of information during the process of convolution of the image first and then pooling. In this model, all the convolutional layers in the six Resnet_block modules are replaced with null convolutional layers to achieve a better output.

2)

Discriminator

The discriminator takes an image as input and predicts whether the image is real or fake. The discriminator belongs to a convolutional network that extracts features from an image and then determines whether the extracted features belong to a particular class by adding a convolutional layer that produces a one-dimensional output. The discriminator of SPGAN contains 5 convolutional layers.

3.3.2

Loss function of SPGAN

The basic principle of SPGAN is that a picture of a domain can be changed back to itself after two transformations. The representation is as follows: (15) $G_{B 2 A} (G_{A 2 B} (x)) = x$ (16) $G_{A 2 B} (G_{B 2 A} (x)) = x$

Based on the cyclic idea of the above equation, the loss function of SPGAN is divided into two parts, one part is used for the adversarial loss that matches the distribution of the generated image with the distribution of the image in the target domain. The other part is the cyclic consistency loss used to learn the two mapping functions. In this paper, the data distributions are represented as x ~ p_data(x) and y ~ p_data(y), including the two mappings G:X → Y and F:Y → X. In addition, two adversarial discriminators, D_X and D_Y, are introduced to distinguish the real image from the generated image.

1)

Adversarial loss

Ideally, the discriminator D needs to determine whether the input data is a real image as accurately as possible, while the generator G tries to deceive the discriminator D as much as possible, so that the discriminator D judges all the generated fake images as real images. Therefore, the optimization objective of this paper is to minimize the loss of the generator and maximize the loss of the discriminator. This loss is defined as the adversarial loss, which is applied to both sets of mapping functions, and is expressed as for mapping function G:X → Y and its discriminator D_Y: (17) $\begin{matrix} \min_{G} \max_{D} L_{G A N} (G, D_{Y}, X, Y) = E_{y - P_{d a t a} (y)} [\log (D_{Y} (y))] \\ + E_{x ~ P_{d a t a} (x)} [\log (1 - D_{Y} (G (x)))] \end{matrix}$ Where E represents the expected value of the distribution function, p_data(x) represents the distribution of real samples, and p_data(y) represents the noise distribution, where D_Y is used to distinguish between the generated samples and the actual samples, and the closer the discrimination result is to 1, the better, so that the loss function is log(D_Y(y)), while x is a random input, and G(x) represents the generated samples, and for the generated samples, the discrimination result of the discriminator D_Y(G(x)) is the closer to 0, the better, i.e., to maximize the total value. Similarly, for the mapping function F:Y → X and its discriminator D_X, express it as: (18) $\begin{matrix} \min_{G} \max_{D} L_{G A N} (F, D_{X}, Y, X) = E_{x ~ P_{d a t a} (x)} [\log (D_{X} (x))] \\ + E_{y ~ P_{d a t a} (y)} [\log ([- D_{X} (F (y)))] \end{matrix}$

2)

Loss of cyclic consistency

SPGAN learns the two mapping relations of G and F at the same time and is required to be able to determine the difference between the generated image obtained with another generator and the original image after converting the class X image to class Y and then converting the class Y image back to class X. The generator is used to reconstruct the image with the aim that the generated image is as similar as possible to the original image, i.e., F(G(x)) ≈ x and G(F(y)) ≈ y. Then it can be very simply taken as L1 loss or L2 loss. The final cyclic consistency loss is denoted as: (19) $\begin{matrix} L_{e y e} (F, G, X, Y) = E_{x ~ p_{d a t a} (x)} [{‖ G (F (x)) - x ‖}_{1}] \\ + E_{y ~ p_{d a t a} (y)} [{‖ F (G (y)) - y ‖}_{1}] \end{matrix}$

3)

Loss function of SPGAN

Thus, the loss function of SPGAN consists of three parts, which are composed of two adversarial losses and one cyclic consistency loss, which will be expressed as: (20) $L = L_{G A N} (G, D_{y}, X, Y) + L_{G A N} (F, D_{X}, Y, X) + λ L_{e y e} (F, G, X, Y)$

3.3.3

Attention Mechanisms in SPGAN

The attention mechanism can find a good balance between reducing the number of parameters and increasing the sensory field while considering the global information. Adding the attention mechanism to the generative adversarial network can enable the network to better learn the detailed features of the product [31].

1)

Attention Mechanism

The attention mechanism filters out a small amount of key information from all the information and focuses attention on this small amount of important information, ignoring most of the unimportant information. The so-called attention mechanism is mainly reflected in the calculation of weight coefficient, the larger the weight coefficient, the more attention is focused on the corresponding value, that is to say, the size of the weight coefficient reflects the degree of importance of the information, and the Value is its corresponding information.

The elements in Source are treated as a series of data pairs composed of (Key, Value), then given an element Query in the original target, the weight coefficient of Value corresponding to each Key is obtained by calculating the similarity between Query and each Key, and then the Value is weighted and summed up to get the Attention value finally. That is, the Attention mechanism is to weight and sum the Value values of the elements in the Source, while the element Query and each Key are used to calculate the weight coefficient of the corresponding Value. That is: (21) $A t t e n t i o n (Q u e r y, S o u r c e) = \sum_{i = 1}^{L_{x}} S i m i l a r i t y (Q u e r y, K e y_{i}) * V a l u e_{i}$ where L_x = ∥Source∥, the length of Source.

The attention mechanism is specifically divided into two processes: the first process is to calculate the weighting coefficients based on Query and each Key, and the second process is to use the weighting coefficients to weight and sum the Value. The first process can be subdivided into two stages: the first stage is to calculate the similarity between the two based on Query and Key. The second stage is the normalization operation on the original score of the first stage. Therefore, the computational process of attention can be abstracted into three stages.

In the first stage of computation, different computational mechanisms and functions can be introduced to compute the similarity between the two based on Query and a certain Keyi by dot-producting the vectors of the two, viz: (22) $S i m i l a r i t y (Q u e r y, K e y_{i}) = Q u e r y \cdot K e y_{i}$

Depending on the method used to generate the first stage scores, the range of values taken is different, so the second stage introduces a Softmax-like calculation to normalize the scores obtained in the first stage, as shown in Equation (23): (23) $a_{i} = S o d t \max (S i m i l a r i t y_{i}) = \frac{e^{S i m i l a r i t y_{i}}}{\sum_{j = 1}^{L_{x}} e^{S i m i l a r i t y_{j}}}$

This calculation method, while organizing the original calculation score into a probability distribution where the sum of the weights of all the elements is 1, can also highlight the weights of the important elements more through the intrinsic mechanism of Softmax.

The result of the second stage of calculation a_i is the corresponding weight coefficient of Value_i, and then the final value of Attention can be obtained by weighting and summing the weight coefficients: (24) $A t t e n t i o n = (Q u e r y, S o u r c e) = \sum_{i = 1}^{L_{x}} a_{i} \cdot V a l u e_{i}$

2)

Network structure and algorithm

The network structure of SPGAN replaces the traditional convolutional feature maps with feature maps with attention by borrowing the self-attention model of SAGAN and adding the attention layer in the 3rd and 4th residual blocks of the generator [32]. The specific calculation method of the feature map with attention is as follows:

Step1: Remember that W_g ∈ R^C×C, W_f ∈ R^C×C, and W_k ∈ R^C×C are the learned weight matrices, all obtained by ordinary 1×1 convolution, where C = C/8. f(x) & g(x) are two formulas for extracting the image feature space, i.e.: (25) $f (x) = W_{f} x$ (26) $g (x) = W_{g} x$

Step2: Use β_ij to indicate the degree of influence of the model on the ird position in synthesizing the jnd region, there: (27) $β_{j, i} = \frac{\exp (s_{i j})}{\sum_{i = 1}^{N} \exp (s_{i j})}, W h i c h s_{i j} = f {(x_{i})}^{T} g (x_{j})$

Step3: The output of the attention layer is then o = (o_l,O₂,…,…,O_j,…,O_N) ∈ R^C×N, ie: (28) $o_{j} = \sum_{i = 1}^{N} β_{j, i} h (x_{i}), W h i c h h (x_{i}) = W_{h} x_{i}$

Step4: Finally, add the concern layer output o multiplied by the scale parameter γ to the input element map x. The final output is shown in Equation (29): (29) $y_{i} = γ o_{i} + x_{i}$ where γ was initialized to 0.0001 and then gradually increased. This was done to allow the model to learn simple tasks first and then slowly increase the complexity of the learning task by assigning more weights to non-local features.

4

Analysis of model applications

4.1

Color migration experiment and result analysis

4.1.1

Data set preparation

To verify the effectiveness of the PCGAN-based product color migration model, this paper applies it to a specific mobile interface color migration task. In this section, user interface visual images from the RICO dataset are used to train the interface color visual style model.

The RICO dataset is the largest repository of mobile application interfaces today, specifically including 27 application categories (including sports, shopping, transportation, food, etc.) from the current mobile application market, more than 9,700 Android apps with high app ratings, and 72,219 user interfaces with design and interaction data (interface screenshots and interface visual hierarchy code). It discloses properties of visual images, text, structure, and interaction elements of more than 7200 unique user interfaces. Compared to other publicly available mobile application datasets, such as ERICA, Shirazi, and Alharbi, the RICO dataset is newer and has the advantage of more interface data, variety, and richness.

4.1.2

Model training results

By designing the specific structure of PCGAN, the RICO dataset is utilized for training, and the experiments in this section divide the dataset into training part and testing part, the training part is divided into TrainA and TrainB, and the testing set is divided into TestA and TestB. Firstly, 80,000 screenshots of mobile application interfaces are randomly selected from the RICO dataset, and according to the PCGAN model, the extracted The application interface is classified into color styles according to the PCGAN model and 80% of the image data of the application interface image dataset of one of the color styles is selected and put into TrainA as the target interface. After that, 80% of the image data from the application interface dataset in a different color style is selected and put into TrainB as the migration body interface. Finally, the remaining 20% of the application interface image data is put into TestA and TestB correspondingly. The objective of this experiment is to transfer the color scheme of the application interface in TrainB to the application interface in TrainA, and verify it using TestA and TestB.

In the parameter selection of the PCGAN network proposed in this paper, the hyperparameter α=30 for cyclic consistency loss, β=10 for ontology consistency loss, and λ=10 for gradient penalty for discriminator network loss. In addition, data enhancement such as light and dark, stretching, saturation, contrast, and rotation are used for model enhancement, and the Step of the whole experiment training is 21000 times.

The process of PCGAN network adversary A loss DA_Loss, adversary B loss DB_Loss, generator A loss GA_Loss, generator B loss GB_Loss change during the whole experiment is shown in Fig. 3’s (a)~(d).

As can be seen from Fig. 3, the antagonist loss in model training is significantly smaller than the generator loss, especially in the pre-training period, and the difference between the two is extremely large. With the increase of training Step, both generator loss and adversary loss gradually converge to 0. However, it can be seen that the loss fluctuation value of generator is much larger than that of adversary, i.e., the loss change of adversary in training is more stable. In addition, comparing the two types of mobile application interface data, A and B, the fluctuation of the loss value of class A data is relatively small.

4.1.3

Analysis of experimental results

Peak Signal to Noise Ratio PSNR is a scientific metric used to measure the quality of an image, the formula for which is shown below: (30) $M S E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {[X (i, j) - Y (i, j)]}^{2}$ (31) $P S N R = 10 \log_{10} \frac{{(2^{n} - 1)}^{2}}{M S E}$ Where MSE denotes the mean square error between two images, X and Y denote the pixel data in the corresponding coordinates of the image, and H and W represent the height and width of the image respectively. In the PSNR formula n is 8, and a larger value of this value indicates a sharper image.

Image Structural Similarity SSIM is a scientific metric used to calculate the structural similarity of two images. One of them is the original high-definition image and the other is the image with distortion, the SSIM of the two images can be used as an indicator of is the effect of the distorted image on the quality of the original image. It is proved experimentally that the results of this method on image quality satisfy the visual judgment of image quality. The SSIM value ranges between [0,1], where the larger the value means the smaller the image distortion.

For a given two samples x and y, their structural similarity is defined as: (32) $S S I M (x, y) = {[l (x, y)]}^{α} {[c (x, y)]}^{β} {[s (x, y)]}^{γ}$

Among them: (33) $l (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} μ_{y}^{2} + C_{1}}, c (x, y) = \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} σ_{y}^{2} + C_{2}}, s (x, y) = \frac{2 σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{3}}$

l(x,y) compares brightness, c(x,y) compares contrast, s(x,y) compares structure, α,β,γ > 0. μ_x,μ_y,σ_x,σ_y is the mean and standard deviation of x and y, respectively. σ_xy is the Pearson correlation coefficient. C₁,C₂,C₃ are constants to ensure l,c,s stability.

In this paper, the image quality of the proposed PCGAN-based color style migration method is compared with ImagineNet method in generating mobile application interfaces, where the SSIM comparison results are shown in Fig. 4.

By analyzing the results of the test experiments, the mean value of PSNR of the application interface generated by the PCGAN model is 19.245, while the mean value of PSNR of the application interface generated by the ImagineNet method is 16.214. Therefore, the proposed method in this paper improves 18.69% over the ImagineNet method in terms of PSNR metrics.

And as can be seen from Fig. 4, the method proposed in this paper finally achieves a structural similarity of 0.98 on the SSIM metric, and any element in the application interface can be easily recognized without any obvious missing contour information. On the other hand, the ImagineNet method only achieves a structural similarity of 0.86, and some elements in the application interface lack obvious contour information and cannot be recognized. This paper shows that the PCGAN-based color migration method performs better than the ImagineNet method and is suitable for the color migration task of products.

4.2

Style transformation experiment and result analysis

4.2.1

Performance comparison experiment results

To demonstrate the effectiveness of the SPGAN-based product style transformation algorithm, this paper conducts a comparison test with the StarGAN model and the latest StarGAN-v2 model. On the basis of 6000 samples generated, the performance differences of the models of different methods are evaluated by calculating the Frechet distance (FID) between the generated samples and the training dataset, where a smaller value of FID indicates a better quality of model generation. The Frechet distance between the three models is shown in Table 1.

Table 1.

The FID of the three models

Model	FID
StarGAN	120.467
StarGAN v2	82.625
SPGAN	60.241

As can be seen from Table 1, SPGAN achieved the smallest FID distance value of 60.241 among the three experimental models, while the StarGAN model and StarGAN-v2 model of the same type of scheme achieved scores of 120.467 and 82.625, respectively, which is a more significant improvement in the quality of the generated images compared to SPGAN, verifying that the SPGAN model in this paper The SPGAN model in this paper can explicitly control the feature resolution of each scale in the network model, expand the convolutional network sensory field, and carry out multi-feature extraction and fusion, which makes the generated image more complete and detailed in terms of the learning and embodiment of stylistic features, and the quality of the image generation is also improved accordingly.

4.2.2

Comparison of experimental results on model convergence stability

In order to verify that the improved SPGAN can improve the network training stability as well as the generator convergence, the SPGAN model is compared with the StarGAN model and the latest StarGAN-v2 model in this subsection in a comparison experiment. In the comparison experiments, the FID value (the smaller the FID value is, the better the generation quality of the model) is used as the evaluation criterion by recording the FID value calculated by each model after every 2,000 iterations under the training background of the seat dataset for a total of 80,000 iterations, and the curve of the FID value of the model training is plotted as shown in Fig. 5. In Fig. 5, the purple curve, the orange curve, and the cyan curve are the convergence trends of the FID values of the StarGAN model, the StarGAN-v2 model, and the SPGAN model, respectively, during the training process.

As can be seen from Figure 5, for the StarGAN model, in the process of 80,000 iterations, the FID value reaches the lowest point in the training process in 14,000 iterations, and then the model training in the subsequent training fluctuates drastically and in the subsequent iterations, the model convergence trend is not clear. In contrast, StarGAN-v2 is relatively stable, the overall curve shows a downward trend, but the model still shows large fluctuations in the convergence process, and oscillates within the convergence interval. SPGAN convergence is slower in the initial iteration, but the curve shows a more stable convergence trend in the first 30,000 iterations, and in the last 50,000 iterations, the model shows some fluctuations, but the fluctuations are not clear. The model has some fluctuations, but the fluctuation is small, and the curve is still relatively smooth. By observing the curve, it can be well confirmed that the SPGAN model in this paper has a more obvious advantage in training stability.

In addition, it can be observed in Fig. 5 that the SPGAN curve is at the lowest position in the graph among the three types of model training results, which indicates that SPGAN has a better convergence of the generator and higher quality of the generated images while the model is stably trained.

4.3

Subjective Questionnaire Survey on Product Multi-Level Visual Communication Design Elements

This section explores the quantitative impact of the three main visual elements of color, brand logo and visual layout on product design by means of a questionnaire survey, taking a product’s packaging design as an object, and links these impacts to product design effectiveness and digital brand image building effects, in order to explore in depth how these elements affect various aspects of product design and carry out analysis.

4.3.1

Evaluation of the effect of color elements

The descriptive statistics and normality test results of the product color elements are shown in Table 2. Based on the statistics in Table 2, it can be seen that the overall score of this group of packaging in brand evaluation is relatively excellent, reaching a mean value of 4.13 and 4.16 respectively. Among the many brand evaluation indicators, the contribution to shaping the brand image received the highest rating of 4.16, which reflects the effectiveness of the package design in creating a strong brand impression. The low score in emphasizing brand uniqueness indicates that there is a lack of brand awareness. The highest ratings for color design were given to attractiveness and aesthetics, which indicates that the package is significantly attractive and representative in its color application. The evaluation participants were more consistent in their recognition of the package, indicating a high level of agreement due to the small standard deviation of the ratings. The slightly lower rating of 3.89 for color in terms of overall visual coherence may be due to the large amount of information in the package design, which has a certain degree of distracting effect on the focus of marketing objectives.

Table 2.

Descriptive statistics and normal test results of product color elements

Dimension	Item	Mean	Std. Deviation	Skewness	Kurtosis
Color elements	P101: Reasonable color matching in overall visual sense	3.89	0.819	-0.134	-0.673
	P102: The overall color collocation is reasonable, have strong appeal.	4.01	0.691	-0.119	-0.516
	P103: Beautiful colors and infectious power	4.08	0.817	-0.746	0.912
Psychological evaluation	P104: Easy to attract visual attention	4.12	0.792	-0.607	-0.117
	P105: Arouse people’s interest	3.91	0.685	0.004	-0.963
	P106: Conveying information is easy to understand	3.93	0.762	-0.527	0.639
	P107: Impressive packaging brand	3.93	0.832	-0.512	0.622
	P108: Willingness to like and purchase	4.04	0.749	-0.117	-0.642
Brand evaluation	P109: The packaging has prominent brand features	4.13	0.837	-0.634	0.947
Brand evaluation	P110: The packaging is beneficial for shaping the brand image	4.16	0.885	-0.583	-0.359

4.3.2

Evaluation of the Effectiveness of Brand Mark Elements

The results of the evaluation of the effectiveness of the brand logo elements of the product are shown in Table 3. In the score statistics of the group’s packaging evaluation, the logo elements of the packaging design received scores of 3.94, 3.87 and 4.03 in terms of the overall layout matching, the appropriateness of the size and proportion of the brand logo, and the prominence of the brand logo, respectively. This indicates that in the overall design, the salience of the brand logo performed most prominently, while its size, proportion, and overall layout rationality did not differ much in the evaluation. In terms of the four psychological impact ratings, the packages in this group received high ratings for “liking and willingness to buy”, which is mainly attributed to the brand’s high level of awareness. Meanwhile, ratings were also relatively high for “intrigued and impressed by the packaging brand”. This suggests that consumers are highly aware of brands they are familiar with and have a good understanding of the brand’s message.

Table 3.

Descriptive statistics and normal test results of visual layout elements effect

Dimension	Item	Mean	Std. Deviation	Skewness	Kurtosis
Marker element	P201: The overall visual combination of brand logo and packaging is reasonable	3.94	0.662	-0.192	0.212
	P202: The size and proportion of the brand logo are appropriate and reasonable, and representative	3.87	0.685	-0.035	-0.295
	P203: Highlighting the brand logo	4.03	0.729	-0.377	-0.297
Psychological evaluation	P204: Easy to attract visual attention	3.99	0.804	-0.373	-0.179
	P205: Arouse people’s interest	4.04	0.751	-0.405	-0.267
	P206: Conveying information is easy to understand	3.98	0.734	-0.241	-0.078
	P207: Impressive packaging brand	4.04	0.692	-0.127	-1.012
	P208: Willingness to like and purchase	4.14	0.695	-0.178	-0.764
Brand evaluation	P209: The packaging has prominent brand features	4.15	0.691	-0.523	0.354
Brand evaluation	P210: The packaging is beneficial for shaping the brand image	4.01	0.742	-0.352	-0.043

4.3.3

Evaluation of the effect of visual layout elements

The results of the descriptive statistics and normality test for the elements of the visual layout of the product are shown in Table 4. According to the scoring results in Table 4, it can be observed that the visual layout scores 3.91, 3.96 and 4.10 in the three aspects of the visual layout as a whole with reasonable collocation, the visual layout helps the content understanding and the visual layout is memorable respectively. The scores of the various scores are similar, so when changing the visual layout elements, the effect on the overall presentation of the package is small, and this group of packages is suitable for the change of the visual layout elements in the Comparative analysis. In terms of psychological evaluation, this type of packaging has less information, but it scores higher in terms of attracting visual attention, triggering preference, and stimulating purchase desire. However, it scored lower in brand effect scores, which was mainly due to the lack of a prominent logo and clear copy descriptions in this package, resulting in poor performance in understanding the brand message and highlighting the brand image.

Table 4.

Descriptive statistics and normal test results of visual layout elements effect

Dimension	Item	Mean	Std. Deviation	Skewness	Kurtosis
Visual layout elements	P301: The visual layout is reasonably matched overall	3.91	0.682	-0.245	0.033
	P302: Visual layout helps with content understanding	3.96	0.749	0.072	-1.005
	P303: The visual layout is unforgettable	4.10	0.755	-0.744	0.933
Psychological evaluation	P304: Easy to attract visual attention	4.12	0.652	-0.282	-0.171
	P305: Arouse people’s interest	4.01	0.771	-0.467	0.525
	P306: Conveying information is easy to understand	4.07	0.639	-0.251	-0.425
	P307: Impressive packaging brand	4.14	0.662	-0.283	0.187
	P308: Willingness to like and purchase	4.16	0.706	-0.137	-0.763
Brand evaluation	P309: The packaging has prominent brand features	4.11	0.706	-0.085	-0.525
Brand evaluation	P310: The packaging is beneficial for shaping the brand image	4.02	0.667	-0.259	-0.634

5

Conclusion

In this paper, based on the human visual system and visual perception characteristics and generative adversarial network, we design a multi-visual communication design strategy based on the color migration algorithm PCGAN and the style transformation algorithm SPGAN, so as to construct a digital brand image.

Comparison of the quality of images generated by the proposed PCGAN-based color style migration method with the ImagineNet method on mobile application interfaces. The mean values of the peak signal-to-noise ratio PSNR for the PCGAN model and the ImagineNet method are 19.245 and 16.214, respectively, and the PCGAN model improves the PSNR metric by 18.69% compared to the ImagineNet method. Meanwhile, in terms of the structural similarity of images (SSIM) metric, the PCGAN model finally reaches 0.98, and any element in the application interface can be easily recognized without obvious missing contour information. On the other hand, the ImagineNet method finally reaches only 0.86, and some elements in the application interface have no obvious contour information and cannot be recognized. The PCGAN-based color migration method in this paper has a better performance than the ImagineNet method and is more suitable for the color migration task of products, as evidenced by this.

In addition, the SPGAN model has the smallest Frechette Distance (FID) value of 60.241 in the style migration experiment for products. In contrast, the StarGAN model and StarGAN-v2 model of the same type of scheme achieved scores of 120.467 and 82.625, respectively. In comparison, SPGAN has a significant improvement in the quality of generated images, verifying the effectiveness of the SPGAN model in this paper for product image style migration.

Finally, taking the packaging design of a product as an example, the influencing factors of a multi-level visual communication design of the product are explored to ring. Through the analysis, it can be seen that in the product packaging design based on visual attention mechanisms, color matching, typography, visual paths, logo design, and packaging form are all key elements. Color matching requires sharp contrast and harmony, which can trigger consumers’ emotional response and visual attention. Typographic layout needs to guide the consumer’s eye so that they can quickly access key information. The brand logo, as a distraction, needs to be simple, unique, and able to attract consumers’ attention. Packaging formats should be tailored to product attributes and market needs, while considering ease of use and environmental friendliness. Together, these factors affect consumers’ visual attention, which affects their choices and purchasing decisions.

Funding:

This paper is a phased achievement supported by the education and teaching research project (major project) of Fujian undergraduate colleges and universities in 2023: “Exploration on the training mode reform of first-class applied talents of art majors under the background of strategic engineering of comprehensive reform of higher education” (project number: FBJY20230310).

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 1 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere

Zeitschrift RSS Feed

Multi-level Visual Communication Design Strategies and Practices for Digital Brand Image Construction

Miao Nie

Online veröffentlicht: 19. März 2025

Eingereicht: 03. Nov. 2024

Akzeptiert: 04. Feb. 2025

DOI: https://doi.org/10.2478/amns-2025-0535

SchlüsselwörterHuman visual system, Generative adversarial network, PCGAN, SPGAN, Digital brand image

© 2025 Miao Nie, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Schlüsselwörter
Human visual system, Generative adversarial network, PCGAN, SPGAN, Digital brand image