Research and Application of Key Technology of Safety Situational Awareness for the Whole Process of Grid Infrastructure Construction Based on Edge-Side Scene Recognition and Knowledge Fusion

With societal development and progress, the demand for electricity continues to rise, and the pace of power grid infrastructure construction has accelerated, attracting increased attention to construction safety. During actual construction of power grid infrastructure, various safety incidents frequently occur, emphasizing the importance of safety management in these projects. [1–3]. To effectively address safety issues, it is essential to analyze existing problems, establish comprehensive safety management strategies, eliminate potential hazards, ensure construction quality and safety, and facilitate smooth progress in power grid infrastructure projects [4–7].

In complex ground environments, timely and accurate recognition of personnel behavior is critical for safety monitoring and risk prevention. The development of smart cities and intelligent surveillance technology has led to an increase in demand for real-time recognition of personnel behavior in construction, traffic, and industrial scenes. These scenes often contain irregular obstacles, complex environmental factors, and uncertain lighting conditions, making recognition challenging. Traditional safety monitoring systems typically rely on visible light image data for analysis and recognition. In complex ground environments, it can be difficult to capture subtle behavioral features of personnel when relying solely on visual information. Additionally, in cases of poor lighting or occlusion, recognition accuracy decreases significantly. Consequently, effectively recognizing personnel behavior in complex environments has become an urgent issue in the field of safety monitoring.

Currently, behavior recognition technology mainly relies on visual light image data, using deep learning methods such as convolutional neural networks (CNNs) to extract image features for analysis. However, in complex scenes, relying solely on visible light images faces challenges from occlusion, lighting changes, and missing spatial information, which impact recognition performance. Therefore, initial attempts at multimodal data fusion have emerged in the behavior recognition field, suggesting that fusing 3D point cloud data with image data can effectively improve recognition capabilities, although challenges remain in data alignment, feature extraction, and optimizing fusion strategies. In response to these challenges, this paper proposes a method for recognizing behavior based on 3D point cloud and visible light image data fusion. The method begins by aligning and pre-processing both modalities of data, then employs a deep neural network model to extract multimodal features, resulting in a unified behavior recognition framework. Combining spatial information from 3D point clouds with texture features from image data allows the system to exhibit higher accuracy and stability in complex ground environments. A deep neural network model is constructed and trained using a multimodal feature fusion strategy, which makes it more adaptable to environmental complexity and dynamism.

This study’s importance lies in improving the accuracy and reliability of personnel behavior recognition in complex ground scenes by combining 3D point cloud and visible light image data. Firstly, the fusion of multimodal data can effectively capture subtle changes in the environment and dynamic behavior of personnel, providing real-time, comprehensive support for safety monitoring. Secondly, the proposed deep neural network model leverages rich multimodal features, overcoming the limitations of traditional methods in complex environments and thus improving recognition performance. This has important practical implications for safety management, risk prevention, and emergency response.

2

Personnel Positioning Method Based on Multi-Sensor Fusion

The aim of this chapter is to enhance the accuracy and reliability of on-site safety monitoring by utilizing multi-sensor fusion methods for personnel positioning in power construction sites. Through the fusion of data from visual and non-visual sensors, precise positioning and dynamic tracking of personnel at the construction site can be achieved. Additionally, the method calculates the safe distance between different objects, providing real-time support for safety management on-site.

2.1

Sensor Fusion Scheme Design

The design of the sensor data fusion scheme is introduced in this section, with a focus on the fusion strategy between visual and non-visual sensors. In this study, visual sensors are used to capture twodimensional image information of personnel, while non-visual sensors provide three-dimensional spatial information, compensating for the limitations of visual sensors.

The fusion of visual and non-visual sensors employs a weighted fusion algorithm to align the data from different sensors in spatial positions to obtain more precise personnel location coordinates. Let the detected position from the visual sensor be P_v = (x_v, y_v) and the detected position from the nonvisual sensor be P_n = (x_n,y_n,z_n). The fused coordinate position Pf can be expressed as: 1 $P_{f} = w_{v} \cdot P_{v} + w_{n} \cdot P_{n}$

Where: w_v and w_n are the weight coefficients of the visual and non-visual sensor data, satisfying w_v + w_n = 1. The weights are set through a calibration process to minimize the localization error after fusion.

For localization accuracy, the error of the fused coordinates E can be measured using the Root Mean Square Error (RMSE): 2 $E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(P_{f i} - P_{r i})}^{2}}$

Where: P_fi is the i-th fused localization point. P_ri is the corresponding true position. N is the number of sampling points.

Additionally, safety distance detection is achieved by calculating the Euclidean distance between different objects at the construction site. For the distance d between two personnel or between a personnel and equipment, the calculation formula is: 3 $d = \sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2} + {(z_{1} - z_{2})}^{2}}$

If d is less than the preset safety distance threshold D_safe an alarm mechanism is triggered.

2.2

Personnel Position Detection and Tracking

Based on the multi-sensor fusion method, this section introduces how to achieve accurate position calibration and continuous tracking of personnel on-site.

In the construction site, the initial position of personnel is first detected using visual sensors and then combined with data from non-visual sensors for three-dimensional position correction. The final position estimate of personnel is: 4 $P_{f i n a l} = α \cdot P_{v i s i o n} + (1 - α) \cdot P_{n o n_{v} i s i o n}$

Where: P_vision is the position detected by the visual sensor. P_{non_vision} is the data from the non-visual sensor. α is the fusion coefficient, tuned experimentally to achieve optimal accuracy.

To ensure the safety of personnel at the construction site, this study conducts real-time calculations of safety distances between various objects. When two objects A and B are detected, the threedimensional distance d_AB is calculated as: 5 $d_{A B} = \sqrt{{(x_{A} - x_{B})}^{2} + {(y_{A} - y_{B})}^{2} + {(z_{A} - z_{B})}^{2}}$

When d_AB < D_safe a warning signal is triggered to alert personnel to take safety measures.

2.3

Experimentation and Result Analysis

Since it is necessary to perform three-dimensional spatial perception of the power construction site and take into account the horizontal and vertical fields of view, a laser radar is selected in this article, and the laser radar finally used is AVIA from Lanwo.

As shown in Figure 1, the AVIA scanning path does not overlap, and within the field of view, the area covered by laser beams increases over time. This means that the field-of-view coverage significantly improves as time progresses. With the accumulation of time, the density of the point clouds generated by the radar increases, exhibiting an accumulation characteristic. The advantage of this is a reduced probability of missing objects within the field of view, which enables the detection of more details in the scanned area. As the scanning time increases, AVIA will sample nearly all areas within the field of view, supporting fine-grained perception of the power construction site.

The effectiveness of spatial positioning for personnel can be seen in Figure 2, with the distance measurement in the second frame being normal without any abrupt changes. This confirms the effective use of multi-sensor information for accurate spatial positioning of personnel, providing essential spatial information for subsequent safety monitoring algorithms.

3

Multimodal Fusion Method of 3D Point Cloud and Visible Light Image Data

The multimodal fusion methods used to integrate 3D point cloud and visible light image data are detailed in this chapter. The aim is to leverage the complementary advantages of both data types for improved recognition accuracy in complex ground scenes, allowing for both detailed spatial analysis and accurate behavior recognition.

3.1

Data Alignment and Preprocessing

Data alignment and preprocessing are essential for ensuring the accuracy and coherence of the multimodal fusion process, particularly when combining data from different sensors.

Spatial Coordinate Alignment: To align 3D point cloud data with 2D image data, spatial transformation matrices are employed. Given a point P_3D = (x,y,z) in the 3D point cloud, and its corresponding 2D point P_2D = (u,v) in the image, a projection matrix M is used to transform 3D coordinates to 2D coordinates: 6 $P_{2 D} = M \cdot P_{3 D}$ where M incorporates rotation and translation parameters, calibrated based on the relative positions of the sensors.

Temporal Synchronization: Since different sensors may operate at different frequencies, temporal synchronization is achieved by matching timestamps or using interpolation methods to align data frames. This ensures that the point cloud and image data represent the same moment in time, enabling accurate fusion.

Point cloud data often contains noise and outliers, especially in outdoor environments. Statistical outlier removal and voxel grid filtering are applied to eliminate noisy points and reduce the size of the point cloud while preserving key structural features. To ensure consistent scaling across data sources, scale normalization is performed on the point cloud and image data. The scale normalization factor S can be calculated as: 7 $S = \frac{Reference Scale}{Actual Scale}$

Where: “Reference Scale” is a predetermined value based on scene dimensions, and Actual Scale” is the measured scale of the current data.

3.2

Feature-Level Fusion Method

In feature-level fusion, distinct features are extracted from both point cloud and image data and then combined to form a richer feature representation for behavior recognition.

Using techniques such as voxelization and down-sampling, structural and spatial features are extracted from the point cloud. In practice, 3D convolutional neural networks (3D CNNs) are often used to capture features from voxelized data, representing spatial details. For image data, conventional 2D CNNs are used to extract texture, color, and shape information. The extracted image features F_img and point cloud features F_pc can then be represented as: 8 $F_{i m g} = C N N_{2 D} (Image)$ 9 $F_{p c} = C N N_{3 D} (Point Cloud)$

To improve the model’s ability to focus on significant regions, a cross-modal attention mechanism is employed. This mechanism assigns weights to different regions of the feature maps based on their importance. If α and β represent the attention weights for point cloud and image data, respectively, then the weighted fusion can be defined as: 10 $F_{f u s e d} = α \cdot F_{p c} + β \cdot F_{i m g}$ where α + β = 1.

The planar fitting of points with smaller height values in the point cloud is accomplished using the traditional Random Sample Consensus (RANSAC) algorithm. During ground fitting, the point cloud is first divided into regions, and then the lower points in each region along the height direction are selected for fitting. The results are shown in Figure 3 and Figure 4.

4

Behavior recognition model based on deep neural network

The behavior recognition model in this study combines Convolutional Neural Networks (CNNs) and 3D Convolutional Networks to take full advantage of the features of different modalities, improving the accuracy and robustness of recognition. The combination of CNNs and 3D Convolutional Networks is explained in detail below

4.1

Convolutional Neural Network (CNN)

The primary use of the Convolutional Neural Network (CNN) is in processing static image data, extracting spatial features through multiple convolutional layers and pooling layers. The following components are typically included in its basid structure:

Convolutional Layer: Extracts features from the input image through convolution operations. Let the input be X and the convolution kernel be W,then the convolution operation can be expressed as: 11 $Z = X * W + b$ where Z is the result of the convolution, b is the bias term, and * denotes the convolution operation.

Activation Function: The ReLU (Rectified Linear Unit) function is commonly used to introduce non-linearity: 12 $A = \max (0, Z)$

Pooling Layer: Reduces the dimensionality of the features through down-sampling, lowering computational complexity. Max pooling operation is commonly used: 13 $P = \max_{i ϵ_{window}} A [i]$

Fully Connected Layer: Maps the extracted features to the final classification output: 14 $Y = f (W_{f} \cdot P + b_{f})$ whereW_f is the weights of the fully connected layer, b_f is the bias term, and f is the activation function.

4.2

3D Convolutional Network

3D Convolutional Networks are suitable for processing temporal data and can simultaneously extract spatial and temporal features. The basic structure includes: · 3D Convolutional Layer: Similar to 2D convolution but performs convolution in three dimensions simultaneously. Let the input be X ∈ ℝ^D×H×W and the convolution kernel be K ∈ ℝ^{C×D_k×H_k×W_k} , the convolution operation can be expressed as: 15 $Z [t, h, w] = \sum_{c = 1}^{C} \sum_{i = 1}^{D_{k}} \sum_{j = 1}^{H_{k}} \sum_{k = 1}^{W_{k}} X [c, t + i, h + j, w + k] \cdot K [c, i, j, k]$ 3D Pooling Layer: Similar to max pooling, it downsamples across three dimensions to reduce data dimensionality. The 3D max pooling can be expressed as: 16 $P [t, h, w] = max_{i \in window, j \in window, k \in window} Z [t + i, h + j, w + k]$

4.3

Combination of CNN and 3D Convolutional Networks

In this model, the combination of CNN and 3D Convolutional Networks is primarily achieved through the following means:

First, CNN is used to extract spatial features from visible light images. Then, a 3D Convolutional Network processes dynamic features generated from temporal data.

The features extracted from both the CNN and the 3D Convolutional Network are concatenated or weighted to form a combined feature vector that utilizes.17 $F_{c o m b i n e d} = α \cdot F_{C N N} + (1 - α) \cdot F_{3 D}$ where F_combined is the combined feature, F_CNN and F_3D are the features extracted by the CNN and the 3D Convolutional Network, respectively, and α is the fusion coefficient. 3. Classification Output:

Finally, a fully connected layer is used to classify the combined features and perform behavior recognition.

Through this structural combination, the model can effectively capture dynamic behaviors in complex scenes, improving the accuracy and robustness of behavior recognition.

5

Recognition of Personnel Behavior in Different Work Scenarios

5.1

Recognition of Personnel Behavior in Hoisting Operations

In hoisting operations, appropriate construction equipment (such as cranes) is utilized. Therefore, when recognizing dangerous behaviors of personnel in hoisting scenarios, it is essential to consider not only whether the personnel are wearing safety protective devices but also the potential hazards posed by the construction equipment to the personnel. The boom and hoisted load are two types of equipment that pose a significant risk to workers in hoisting scenarios.

To obtain the spatial position of the boom, this project treats it as a straight line in space, similar to the spatial positioning of personnel. The boom rotation detection frame and point cloud data are used to extract the points within the boom frame through inner and outer parameter mapping. Subsequently, the RANSAC (Random Sample Consensus) algorithm is used to fit a straight line. Two points can determine a line, and in each iteration, two randomly selected points are chosen to calculate the equation of the line. The distance between each point and the line is then calculated. If the distance is less than a specified limit, that point is classified as an inlier for that iteration.

After multiple iterations, the iteration with the maximum number of inliers is used to define the fitted line as the boom line. The highest point in the vertical direction is considered to be the boom’s top vertex, while the lowest point in the vertical direction is considered to be the boom’s base point.

The results of the boom straight line fitting are shown in Figure 5. It can be seen that the boom straight line can be accurately fitted in space by point cloud image data fusion and can be projected onto the image by internal and external references.

After fitting the boom straight line, the highest and lowest points of the boom straight line can be determined. The highest point is the top of the boom, and the lowest point is the base of the boom. Subsequently, the boom’s radius of rotation is calculated by projecting these two points onto the ground.

The crane site’s rendered point cloud is processed using a point cloud voxelization scheme in this paper. The lifting point cloud scene is meshed by the voxelization process, and the labels can be determined by the end of the rendering, allowing for the creation of a training set for model training. In this paper, the lifting scene data collected from the electric power construction site is used for testing, and the results are shown in Figure 6. When the tower formation is performed, the crane lifts the components for installation, and from the results, it can be seen that the segmentation model can accurately segment the lifting objects and can visualize the outer frame of the lifting point cloud onto the image.

When used on site, the point cloud within 20 meters around the center of the boom is sent to the point cloud segmentation model to segment the point cloud of the hanging object. After obtaining the point cloud of the hanging object, the subsequent behavior identification of people standing under the hanging object, people standing on the hanging object, and people holding the hanging object can be completed. When the hanging object is raised, for the behavior of people standing under the hanging object, the longest distance from the center of the hanging object point cloud to the vertex of the external frame is projected to the ground, and the projection of the center of the hanging object point cloud on the ground is taken as the center of the circle. If the projection of the person on the ground is within the circle and the height of the person is less than the height of the lowest point of the hanging object, it is considered that the person has stood under the hanging object; when the projection of the person on the ground is within the circle and the height of the person is greater than the height of the lowest point of the hanging object, it is considered that the person has stood on the hanging object; when the person is less than 0.5 meters away from the hanging object in space, the person is too close to the hanging object and it is considered that there is a risk of holding the hanging object by hand.

In addition, for the possible crooked and slanted lifting behaviors during hoisting, this project uses the point cloud of the rope output by the segmentation model, selects the point cloud of the unbranched part of the rope to find the straight line direction of the rope through singular value decomposition, and calculates the angle between it and the ground normal vector. If the angle is greater than the set threshold, it is considered that crooked and slanted lifting behaviors have occurred.

5.2

Personnel behavior identification in foundation pit operation scenes

Excavators are mainly used in foundation pit operation scenes. Similar to hoisting operations, when identifying dangerous behaviors of personnel in foundation pit operations, the potential dangers caused by construction equipment to personnel must also be considered.

According to the safety operation specifications, it is a dangerous behavior for personnel to be within the rotation radius of the digging arm during foundation pit operations. In order to identify the safe behavior of personnel in foundation pit operations, it is necessary to determine the spatial position relationship between personnel and digging arms. Similar to the straight line fitting of the crane arm, the straight line fitting of the digging arm in this project also uses a 3D-2D combination method. The only difference is that the crane arm only has one rotation detection frame, while the digging arm has more freedom and will have multiple rotation frames. Therefore, the digging arm frame needs to be matched to determine the starting and ending points of the digging arm.

As shown in Figure 7, when performing arm detection, the excavator is first detected in the whole picture, and then the arm detection is performed in the detected excavator. The entire arm is divided into three parts, the base part corresponds to frame 0, the middle part is frame 1, and the part close to the bucket is frame 2. As can be seen from the figure, the arm detection does not necessarily detect the three frames 012 accurately. Due to reasons such as viewing angle or occlusion, there may be more or less than three frames. Therefore, it is necessary to align the arm frames to obtain frames on the same arm. First, determine whether frame 1 exists. If frame 1 exists, and there is frame 0 or frame 2, or both, continue making subsequent judgments. Two frames 2 are detected in this scene, but only frame 2 on the left and frame 01 constitute a complete arm. Therefore, it is necessary to screen the two frames 2 to find the frame 2 that matches frame 01. In this project, 2D sequence rotation frames are used and combined with 3D point cloud information to complete the matching of the arm.

Specifically, after determining that there is frame 1, the spatial position of frame 1 is calculated using the distance measurement principle in personnel positioning. Since the excavator arm is a complete arm, each part of the arm frame will have an overlap area. Therefore, the center point of the overlapped area can be calculated and back-projected into the space using the internal and external parameters to form a straight line back-projected. Then calculate the distance from the points in frames 2 to the straight line and take the minimum value. If the minimum value is within the set threshold range, the frame 2 corresponding to the minimum value is considered to be the matched frame 2. Similarly, the matching frame 0 can be determined using the above method. After determining the appropriate excavator arm frame, the point cloud in the excavator arm frame can be removed to perform straight line fitting of the excavator arm. Taking the above figure as an example, there are three excavator arm frames 012. When performing straight line fitting of the excavator arm, the excavator arm point cloud is first projected to the ground, and then a straight line fitting is performed, because the excavator arm is curved in space and a straight line cannot be directly fitted.

After fitting the digging arm straight line, we can find the points close to the base and far from the base in the digging arm straight line. The starting point is defined by the point near the base, and the end point is defined by the point far from the base. Then project these two points onto the ground to calculate the rotation radius of the digging arm.

First, the position of the person in space obtained from the personnel distance measurement is projected onto the ground, and then the distance between the person’s projection point and the lowest point of the boom on the ground is determined. If the distance is less than the rotation radius of the boom, it is determined that the person is within the rotation radius of the boom, which is a dangerous behavior. Figure 8 displays a schematic diagram of the alarm for individuals who are within the rotation radius of the excavator arm. The rotation radius is determined by linear fitting of the excavator arm, and the position of the person in the construction scene can be determined by the spatial positioning of the person, and then it can be determined whether the person is within the rotation radius of the excavator arm, and an alarm is issued in time.

5.3

Experimental analysis

This section first verifies the effects of personnel positioning, which is divided into personnel positioning using laser radar and guns. For personnel spatial positioning, this project chooses to measure the distance between personnel in a wide field. First, standard measuring equipment is used to measure the distance between personnel, and then the personnel distance measurement method proposed in this article is used.

The distance between people obtained by using the personnel spatial positioning technology in this article is 32.67 meters, and the measurement result of the field equipment is 32.62 meters, with an error of 5 centimeters (see Figure 9). The error is within the acceptable range in actual use and does not have any impact on the algorithm, nor does it affect the final result of personnel behavior identification. For personnel image positioning, this project chooses to conduct experiments on people 50 meters away.

In order to verify the feasibility of the proposed scheme, the Weizmann data set is used as the sample base of the model training, and the five behaviors of the pressure, the bending, the step and the hand of the five behaviors are used as the training and test action of the power infrastructure construction.

The experiment is divided into two parts: (1) the algorithm is analyzed to determine the accuracy of identifying the five basic actions. (2) compared with the control group of the support vector motor recognition algorithm without the integration learning algorithm. In addition, both the experimental group and the control group adopted the same training data collection and hardware configuration. Table 1 displays the accuracy of five basic actions.

Table 1.

The accuracy of the basic action

Action	Recognition accuracy(%)
Bow down	71.36
Stooping	74.82
Pendulum	71.53
Stride	74.13
Reach out	73.93
Average	73.154

According to the table, the accuracy of the bow is the lowest (71.36%). The accuracy of bending is the highest, but it can reach 74.82%. The five movements had a recognition accuracy of 73.154%. The reason for the difference is that the number of human parts involved in the body is different from that of the position, and the head is only smaller in the position. In bending, the head, chest, arms, and waist will vary greatly, so the recognition accuracy is higher.

This time, the average square root error (RMSE) is the criterion for the comparison experiment. The comparison of RMSE is shown in Figure 10. The experimental group and the control group continued to decline after 50 iterations, but the changes were slow. This indicates that the model has converged, with an RMSE value of approximately 0.0038.

6

Conclusion

This study proposes a human behavior recognition method based on the fusion of 3D point cloud and visible light image data, aiming to solve the problem of insufficient recognition accuracy of traditional single data sources in complex ground scenes. Through the fusion of multi-modal data, this method can effectively utilize the spatial information provided by 3D point clouds and the texture features of visible light images to enhance the analysis and recognition capabilities of human behavior.

In complex scenarios, the proposed method is significantly superior to methods that rely on a single data source in terms of both accuracy and robustness of behavior recognition, as evidenced by experimental results. In complex environments, the effective fusion of 3D point clouds and visible light images can overcome traditional methods’ limitations and provide more refined and reliable behavioral recognition support, as evidenced by this result.

The aim of this study is to present an innovative solution for security monitoring and risk prevention, emphasizing the significance of multimodal data fusion in practical applications. Future research can continue to examine behavior recognition issues in more complex scenarios, enhance the model’s generalization and practicality, and provide more comprehensive technical support for security management.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research and Application of Key Technology of Safety Situational Awareness for the Whole Process of Grid Infrastructure Construction Based on Edge-Side Scene Recognition and Knowledge Fusion

Bo Chen

Hongyu Zhang

Runxi Yang

Hongyu Du

Mengzhu Xu

Published Online: Feb 03, 2025

Received: Sep 06, 2024

Accepted: Dec 18, 2024

DOI: https://doi.org/10.2478/amns-2025-0004

KeywordsPower system, Personnel violation, Multi-sensor fusion, Deep neural networks, Behavior recognition

© 2025 Bo Chen et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Power system, Personnel violation, Multi-sensor fusion, Deep neural networks, Behavior recognition