3D Occupancy Network Modelling for Multi-view Image Fusion Techniques in Autonomous Driving

Nowadays, autonomous driving technology is a hot topic in the automotive industry and the field of science and technology, and it is one of the important development directions for future transportation. Autonomous driving, also known as driverless driving, is a technology that uses computers, sensors and artificial intelligence to enable vehicles to travel autonomously without human control. This technology allows a vehicle to understand its surroundings through sensors, maps, and other data sources that sense its surroundings and use this information to make decisions, including how to accelerate, brake, turn, and avoid obstacles [1]. The gradual application of autonomous driving technology can effectively reduce traffic accidents, improve transportation efficiency, and enhance the driving and traveling experience [2].

In the field of autonomous driving, high-quality 3D reconstruction is important for achieving more accurate scene understanding and interaction. According to the different ways of data acquisition, 3D reconstruction techniques can be divided into two categories: active and passive [3]. Active 3D reconstruction techniques mainly use a transmitter to emit a specific signal to the surface of an object, and then the receiver receives the reflected signal and calculates the 3D coordinates of the surface of the object by measuring information such as the propagation time or phase of the signal [4-7]. The passive 3D reconstruction technique calculates the 3D coordinates of the object surface by capturing the information of the light naturally reflected from the object surface and using image processing and computer vision techniques [8-9]. However, active 3D reconstruction methods have higher costs, increased system complexity, restricted operation range, security, privacy, and insufficient real-time issues, and traditional 3D reconstruction methods are often limited by high computational complexity, low reconstruction accuracy, and sensitivity to scene and lighting conditions [10-14]. 3D reconstruction based on multiview images requires only common visual sensors to input image information conveniently and thus has great potential for development and research value.

Literature [15] proposes a single-stage multi-view multimodal 3D target detector (MVMM) that fuses image information through a data-level fusion method with point cloud coloring, extracts semantic and geometric information from it and generates a 3D-based result using the captured colored points and range view features. Literature [16] proposes a multi-view reprojection structure to estimate the attitude of the self-driving vehicle and surrounding objects, which utilizes a 2D detection box to extend the orientation of the 3D enclosing box of the detected object and its dimensions and then employs a perspective geometric regression method to obtain the multi-view dimensions, orientations, and reprojection box of the 3D reconstruction layer, enabling accurate recovery of the 3D box even when the 2D box is truncated. Literature [17] combines sensor fusion, hierarchical multiview networks, and traditional heuristics to build an autonomous vehicle target detection system by efficiently fusing 2D RGB images with 3D point cloud data through a hierarchical multiview proposal network (HMVPN) and then generating a candidate set of suggestions for driving procedures through machine learning and heuristics Literature [18] shows that the recording and linking of multiview images can be used to reconstruct a 3D object from a 2D images to reconstruct 3D objects using a self-organizing mapping approach to build a voxel grid-like structure that has a powerful 3D reconstruction capability to recover 3D objects from event-based stereo data and does not require any a priori knowledge of camera parameters. Literature [19] explored the optimization scheme of a deep learning based multi-view stereo vision (MVS) system, using the neural network channel pruning method to prune the redundant parameters of the system by compressing and accelerating the 3D reconstruction model with high complexity. It not only improves the 3D reconstruction rate of the MVS system but also improves the accuracy and integrity of the model. Literature [20] constructed UniScene, a unified multi-camera pre-training framework, which then plays a larger role in 3D target detection and semantic scene complementation tasks and solves the spatial and temporal neglect of multi-camera systems in monocular 2D pre-training, which provides an important practical value for the realization of autonomous driving.

The article utilizes multi-view image fusion technology to acquire image data, determine the structural parameters of the vehicle, and apply the Ackermann steering principle to construct the vehicle kinematic model. Starting from the definition of a 3D occupancy network, a multi-scale mechanism is formulated to make the extracted features richer, the construction of a 3D occupancy network is realised through Transformer-based feature fusion, and the cross entropy is selected as the loss function of the network. After that, iterative training and optimisation of the network are carried out in combination with the research data, the experimental environment and network parameters are set up, and in order to verify the effectiveness of the model in this paper, the network in this paper is used to carry out experimental simulation and analysis of vehicle detection and path planning.

2

Three-dimensional occupancy networks for autonomous driving

2.1

Multi-view image fusion techniques

2.1.1

Automatic white balance for lens colour deviation

1)

The source image is first transformed from RGB colour space to YC_bC_r colour space, and near-white point regions consisting of candidate reference white points are defined based on the statistical properties of the colours. These candidate points are selected by dividing the image into 12 regions and calculating the average absolute errors D_b and D_r of the blue chromaticity $C_{b} (i, j)$ and red chromaticity $C_{r} (i, j)$ components for all pixel points within each region, as shown in the following equation: (1) $D_{b} = \sum_{i, j} (| C_{b} (i, j) - M_{b} |) / N$ (2) $D_{r} = \sum_{i, j} (| C_{r} (i, j) - M_{r} |) / N$

where M_b and M_r correspond to the mean values of $C_{b} (i, j)$ and $C_{r} (i, j)$ , respectively, for a given region, and N is the number of pixel points. The near-white point region consists of a combination of points that satisfy the following inequality constraints: (3) $\begin{array}{l} | C_{b} (i, j) - (M_{b} + D_{b} \times s i g n (M_{b})) | < 1.5 \times D_{b} \\ | C_{r} (i, j) - (1.5 \times M_{r} + D_{r} \times s i g n (M_{r})) | < 1.5 \times D_{r} \end{array}$

Then the average of the cumulative chromaticity mean values of all near-white-point regions is selected as the final ${\bar{M}}_{b}$ , ${\bar{M}}_{r}$ , ${\bar{D}}_{b}$ and ${\bar{D}}_{r}$ . Based on the highlight value, the top 10% of the candidate point set within the near-white-point region is selected as the final reference white point. 2)

The gain of each channel R_gain, G_gain and B_gain can be calculated by the following formula: (4) $\begin{array}{l} R_{g a i n} = Y_{\max} / R_{a v g} \\ G_{g a i n} = Y_{\max} / G_{a 10} \\ B_{g a i n} = Y_{\max} / B_{a v g} \end{array}$

Where R_avg, G_avg and B_avg are the average greyscale values of the reference white point in step (1) and Y_max is the maximum luminance value of the image pixel point. This white balance algorithm is less complex and less sensitive to the block size, making it well suited for automated operations.

2.1.2

Lens distortion correction

The optical principle of the equivalent spherical model method is shown in Fig. 1, P is the ideal small hole imaging plane, and the imaginary projection surface S is a sphere with the lens axis as the sphere centre and the lens focal length f as the radius. The intersection of the optical axis O′O and the plane P is the centre of aberration, which is also the point of tangency between the spherical surface S and the plane P. For the scene target point B, if the lens is distortion-free, the point A₁ projected on the plane P by the small hole model is the undistorted point, and the intersection of the projected extension line with S is A, which corresponds to the pixel point A₂ on the imaging surface P where the distortion occurs.

Assumption OA = OA₂, define the distances OA = L₁ and OA₁ = L₂, which correspond to the distances from the ideal imaging point and the aberration imaging point to the aberration centre O, respectively. From the equivalent spherical model, the following relationship is obtained: (5) $\begin{array}{l} α = \arctan (L_{2} / f) \\ L_{1} = 2 f \sin (α / 2) \end{array}$

Easily accessible: (6) $L_{2} = f \tan (2 \arcsin (L_{1} / f / 2))$

In general, the centre of aberration can be simply considered to be the geometric centre of the aberrated image. If the width of the image is W and the height is H, and the upper-left corner of the image is the origin to establish a planar right-angle coordinate system, then the coordinate of point O is $(W / 2, H / 2)$ . The purpose of distortion correction is to restore all the distortion points A₂ (corresponding to coordinate $(\bar{x}, \bar{y})$ ) to the corresponding ideal position A₁ (corresponding to coordinate $(x, y)$ ). The whole procedure is as follows: 1)

Calculate the distortion distance L₁: (7) $L_{1} = \sqrt{{(\bar{x} - W / 2)}^{2} + {(\bar{y} - H / 2)}^{2}}$

2)

Calculate the angle β between OA and the horizontal direction of the optical axis: (8) $β = \arctan ((\bar{x} - W / 2) / (\bar{y} - H / 2))$

3)

Calculate the coordinates of the corrected pixel point A₁: (9) ${\begin{array}{l} x = W / 2 + L_{2} \times \cos (β) \\ y = H / 2 + L_{2} \times \sin (β) \end{array}$

Substituting Eqs. (6), (7), and (8) into Eq. (9) gives the spatial coordinate mapping relationship: (10) $\begin{matrix} x = M / 2 + f \tan (2 \arcsin (L_{1} / f / 2)) \\ \times \cos (\arctan ((\bar{x} - M / 2) (\bar{y} - N / 2))) \\ y = N / 2 + f \tan (2 \arcsin (L_{1} / f / 2)) \\ \times \sin (\arctan ((\bar{x} - M / 2) / (\bar{y} - N / 2))) \end{matrix}$

Then traversing the pixel points of the distorted image, performing coordinate transformation using equation (10), and then recovering the grey values using the interpolation algorithm, the correction of the distorted image can be achieved.

2.1.3

Panoramic stitching of multi-view images

The creation of the overlapping region of the empowered edge undirected graph and the two search trees are shown in Fig. 2. For the overlapping region B between the neighbouring images I_f and I_g to be stitched together, an assignment edge undirected graph is created $G (V, E)$ . Each pixel point is used as a vertex, which is combined to form a vertex set V, E is the set of edges formed by the neighbouring vertices, and each edge is assigned a different weight value. In addition, two terminal points are added: super-source S and super-sink T. The two terminal points are connected to all the vertices to form new edges, and the corresponding edge weights are assigned to infinity, thus constructing two search trees.

Let L_p be the labelling of point p in the segmentation result Fig. I, which is used to indicate the source image to which pixel point p belongs. Then the grey value $I (p) = I_{L_{p}} (p)$ is easily obtained. Thus, the following energy generalisation is constructed: (11) $E (p, L_{p}) = \sum_{p \in L_{p}} D_{p} (p, L_{p}) + \sum_{| p, q | \in B} V_{p, q} (p, q, L_{p}, L_{q})$

where items V and D denote the smoothing and data items, respectively, and L_p, L_q denote the image source labels of pixel points p and q, respectively.

The data term $D_{p} (p, L_{p})$ indicates whether the pixel point p is located in the valid region of the image pointed to by L_p. This is because the true overlapping region is the red border in Fig. 2 and not a regular rectangle. The mathematical definition is shown below: (12) $D_{p} (p, L_{p}) = {\begin{array}{l} + \infty, & p \notin I_{L_{p}} \\ 0, & p \in I_{L_{p}} \end{array}$

The smoothing term $V_{p, q} (p, q, L_{p}, L_{q})$ is used to represent the variability between two neighbouring pixel points p and q in an overlapping region and has various metrics. In general, there are mainly the following metrics: 1)

Colour space, denoted as: (13) $V_{p, q} (p, q, L_{p}, L_{q}) = {‖ I_{L_{p}} (p) - I_{L_{ψ}} (p) ‖}_{2} + {‖ I_{L_{p}} (q) - I_{L_{v}} (q) ‖}_{2}$

Where $I_{L_{p}} (p)$ is the grey value of pixel point p in image $I_{L_{p}}$ and $I_{L_{q}} (p)$ is the grey value of pixel point q in image $I_{L q}$ . Here, only the difference in the grey values of the pixel points in the overlapping region corresponding to different images is considered. 2)

The gradient space, denoted as: (14) $V_{p, q} (p, q, L_{p}, L_{q}) = {‖ \nabla I_{L_{p}} (p) - \nabla I_{L_{q}} (p) ‖}_{2} + {‖ \nabla I_{L_{p}} (q) - \nabla I_{L_{q}} (q) ‖}_{2}$

Where $\nabla I_{L_{p}} (p)$ is the gradient value of point p in the RGB colour domain of image $I_{L_{p}}$ , of which generally only six components need to be considered. The variability measure of the gradient domain can weaken the effect of colour differences caused by exposure differences. 3)

Colour space + gradient space, denoted as: (15) $V_{p . q} (p, q, L_{p}, L_{q}) = α \cdot V_{c o l o r s} + β \cdot V_{g r a d i e n t s}$

Where α and β are proportionality constants that take on values depending on the importance of the colour and gradient domain metrics. The combination of the two allows the energy generalisation to portray the pixel variability between images more accurately.

2.1.4

Image Fusion Algorithm (DMS)

On the basis of panoramic splicing of multi-view images, in order to improve the integrity of image information, the principle of the image fusion algorithm will be the difference part and the same part of the two images to be fused to be processed separately, the difference is modelled to obtain the difference part of the two to ensure the completeness of the image, the multiplication is used to obtain the same part of the two to improve the image signal-to-noise ratio, and the two parts are respectively normalised and summed to obtain the fused image. The DMS method can avoid the redundancy of the fused image caused by repeated phase superposition, and it has obvious advantages in suppressing the background clutter [21]. Assuming that the two normalised single-view images are denoted as $a (x, y)$ and $b (x, y)$ , and the fused image is denoted as $f (x, y)$ and $(x_{n}, y_{n})$ for the $(m, n)$ th pixel of the image, and $p_{s} (x_{n}, y_{n})$ and $p_{b} (x_{n}, y_{n})$ for the $(m, n)$ th pixel of the two original images, the steps of the DMS image fusion algorithm are as follows. Then the steps of DMS image fusion algorithm are: 1)

Find the identical parts and normalise them: (16) $p_{f s} (x_{m}, y_{n}) = \frac{p_{a} (x_{n}, y_{n}) \cdot p_{b} (x_{m}, y_{n})}{\max (p_{a} (x, y) \cdot p_{b} (x, y))}$

2)

Find the part of the difference and normalise it: (17) $p_{f d} (x_{m}, y_{n}) = \frac{| p_{a} (x_{n}, y_{n}) - p_{b} (x_{m}, y_{n}) |}{\max (| p_{a} (x, y) - p_{b} (x, y) |)}$

3)

Summing and normalising: (18) $p_{f} (x_{m}, y_{n}) = \frac{p_{f} (x_{m}, y_{n}) + p_{f d} (x_{m}, y_{n})}{\max (p_{f} (x, y) + p_{f d} (x, y))}$

Where $p_{i} (x_{n}, y_{n})$ represents the pixel value of the $(m, n)$ nd pixel point of the fused image, $| \cdot |$ represents the absolute value, and max represents the maximum value of all pixels as the normalisation criterion. To fuse two images with different viewpoints, all the pixel points of the two images can be traversed to carry out the above three-step operation to get the same part and different part of the two images respectively, and normalised and summed up respectively to get the fused image.

2.2

Vehicle kinematics modelling

On the basis of multi-view image fusion technology, the outer contour of the vehicle is simplified to a rectangle according to the basic structural parameters of the vehicle, the kinematic equation of state of the vehicle is established based on Ackermann’s steering principle, and the relationship between the four vertices of the body and the centre point of the vehicle’s rear axle is deduced based on the geometrical structure of the vehicle, so as to lay the foundation for the construction of the subsequent three-dimensional occupancy network.

2.2.1

Vehicle structural parameters

The structural parameters of the vehicle are shown in Table 1. The size of the vehicle’s structural parameters has a great influence on parking planning and control, and the complex and varied outer contour of the vehicle can be simplified due to the low vehicle speed during the parking process, which produces very little pitch and roll. Taking the maximum length and maximum width of the vehicle as the length and width of the rectangle and replacing the shape of the vehicle’s outer contour with a rectangle, this simplified treatment not only facilitates the subsequent analysis of vehicle kinematics but also makes intelligent driving safer.

Table 1.

Vehicle structure parameters

Vehicle parameter	Symbol	Unit	Numerical value
Car width	w	m	0.99
Wheelbase	l	m	1.00
Front suspension	l_f	m	0.38
After suspension	l_r	m	0.49
Minimum turning radius	R_g	m	2.45
The maximum front wheel is turning to the angular velocity	δ_max	m	0.5236

2.2.2

Ackermann steering principle

Ackermann steering geometry is the theoretical basis for establishing the vehicle kinematics model, while the zero front wheel positioning angle of the vehicle and the no-slip phenomenon of all wheels in the steering process are the prerequisites of Ackermann steering geometry [22]. Equivalent front wheel angle schematic shown in Figure 3, when the use of a four-link equal crank, the vehicle’s inner wheels will be larger than the outer wheels of the steering angle of 2 ° ~ 4 °so that the vehicle in the steering of the four wheels in the trajectory of the center of the circle intersected in the vehicle’s rear axle on the extension line of a certain point O, point O is the instantaneous steering centre of the four wheels.

The front and rear wheelbases of a vehicle are usually equal or have a small difference, and the geometrical relationship between the inside wheel angle α and the outside wheel angle β of the front axle is: (19) $\cot α = \frac{l_{w}}{l} + \cot β$

Where l is the wheelbase of the vehicle. l_w is the wheelbase of the vehicle.

According to the single-vehicle model, the trajectory of the front and rear wheels of the vehicle can be expressed by the two equivalent wheels at the centre of the front and rear axles, in which the equivalent steering angle is δ, and the simplified front and rear corner relationship is: (20) $\cot α + \cot β = 2 \cot δ$

2.2.3

Vehicle kinematics modelling

Vehicle kinematics model is to study the relationship between vehicle position, speed and time from a geometric point of view, based on Ackermann’s steering principle to establish the kinematics model with the centre point of the rear axle as the origin as shown in Fig. 4, and to establish the model the following basic assumptions need to be made: 1)

Neglecting the influence of tyre side-slip, the tyre and the ground do pure rolling.

2)

Neglecting the external loads on the vehicle, the vehicle, and the suspension system are rigid.

3)

Consider only the vehicle moving on the horizontal plane and ignore the motion perpendicular to the direction of the road surface.

4)

Based on Ackermann’s steering principle, the motion states of the vehicle’s left and right tyres are described equivalently as one tyre.

In the inertial coordinate system, i.e. the global coordinate system, the coordinates of the centre points of the front and rear axes of the vehicle are denoted by $(x_{f}, y_{f})$ and $(x_{r}, y_{r})$ , the velocities of the centre points of the front and rear axes of the vehicle are denoted by v_f and v_r, and the heading angle of the vehicle is denoted by θ.

Firstly, the differential equations are established with the centre point of the rear axle of the vehicle, based on the vehicle kinematics model only considering the longitudinal motion of the front and rear wheels without considering the transverse side slip, the following equations can be obtained: (21) ${\dot{x}}_{r} \sin θ - {\dot{y}}_{r} \cos θ = 0$

Then the following relationship exists between the coordinates of the centre points of the front and rear axles of the vehicle: (22) ${\begin{array}{l} x_{r} = x_{f} - l \cos θ \\ y_{r} = y_{f} - l \sin θ \end{array}$

Deriving Eq. (22) with respect to time gives: (23) ${\begin{array}{l} {\dot{x}}_{r} = {\dot{x}}_{f} + l \dot{θ} \sin θ \\ {\dot{y}}_{r} = {\dot{y}}_{f} - l \dot{θ} \cos θ \end{array}$

Substituting Eq. (23) into Eq. (21) gives: (24) ${\dot{y}}_{f} \cos θ - {\dot{x}}_{f} \sin θ = \dot{θ} l$

The differential equation is established at the centre of the front axle of the vehicle, and the transverse and longitudinal displacements at the centre of the front axle of the vehicle are derived with respect to time to obtain the transverse and longitudinal partial velocities: (25) ${\begin{array}{l} {\dot{x}}_{f} = v_{f} \cos (θ - δ) \\ {\dot{y}}_{f} = v_{f} \sin (θ - δ) \end{array}$

Substituting Eq. (25) into Eq. (24) according to the formula for the sum and difference of two angles of a sine in trigonometry: (26) $\dot{θ} = \frac{v_{f} \sin δ}{l}$

Integrating Eq. (26) over time gives: (27) $θ = \frac{v_{f} t \sin δ}{l}$

Substituting equations (25) and (26) into equation (24) gives: (28) ${\begin{array}{l} {\dot{x}}_{r} = v_{f} \cos θ \cos δ \\ {\dot{y}}_{r} = v_{f} \cos θ \sin δ \end{array}$

The velocity at the centre point of the front and rear axles of the vehicle satisfies the following relationship: (29) $v_{r} = v_{f} \cos δ$

Taking the vehicle rear axle centre point velocity as the whole vehicle velocity, the joint formula (27) (29) can be obtained as the vehicle kinematics model: (30) ${\begin{array}{l} \dot{x} = v_{r} \cos θ \\ \dot{y} = v_{r} \sin θ \\ \dot{θ} = (v_{r} \tan δ) / l \end{array}$

Integrating x and y in Eq. (30) over time, the coordinates of the centre point of the rear axle of the vehicle at moment t can be found: (31) ${\begin{array}{l} \begin{array}{l} x_{r} (t) = \int_{0}^{t} v_{r} \cos ((v_{f} \tan δ) / l) d t \\ = l \sin ((v_{r} t \tan δ) / l) / \tan δ \end{array} \\ y_{r} (t) = \int_{0}^{t} v_{r} \sin ((v_{f} \tan δ) / l) d t \\ = l / \tan δ - l \cos ((v_{r} t \tan δ) /) / \tan δ \end{array}$

Squaring and adding both sides of Eq. (31) simultaneously yields the geometric relationship for the trajectory of the centre point of the rear axle of the vehicle: (32) $x_{r} {(t)}^{2} + {(y_{r} (t) - l \cot δ)}^{2} = {(l \cot δ)}^{2}$

From Equation (32), it can be seen that when the vehicle moves with a fixed front wheel angle, its trajectory is a circular trajectory related to the vehicle parameters, in which the vehicle wheelbase and the front wheel angle affect the size of the radius of the circular trajectory. In contrast, the vehicle’s traveling speed has nothing to do with the size of the radius of the circular trajectory. The theoretical basis can be provided for the subsequent process of constructing a three-dimensional occupancy network oriented towards automatic driving according to this law.

2.3

Construction of a three-dimensional occupancy network

2.3.1

Definitions

3D Occupancy Network is a deep learning based 3D reconstruction method that represents 3D surfaces as continuous decision boundaries for deep neural network classifiers [23]. Such a representation is capable of encoding high-resolution 3D outputs without excessive memory consumption, it is able to encode 3D structures efficiently, and it is able to infer models from different kinds of inputs, applying it widely to 3D target detection.

2.3.2

Multi-scale mechanisms

In the construction of 3D occupancy networks for multi-view image fusion techniques in autopilot scenarios, the diversity of object sizes and occlusion levels poses a serious challenge to the construction of the networks, aiming to construct 3D occupancy networks by capturing richer detail information. In this paper, the outputs of the last three convolutional layers of the backbone network (VoxelNet) of LiDAR data are selected as inputs, and these outputs are downsampled by 2x, 4x, and 8x, respectively, to obtain 3D voxel features at different scales. These 3D voxel features are then mapped (projected) into 2D multi-scale BEV feature maps with dimensions $(H / 2) \times (W / 2) \times d$ , $(H / 4) \times (W / 4) \times d$ and $(H / 8) \times (W / 8) \times d$ , where H, W and d represent the height, width and depth of the original voxel features, respectively. These multiscale BEV feature maps are represented by F₁, F₂ and F₃, respectively, while the original voxel feature is represented by F₀. The following equation can represent the projection process: (33) $F_{1} = P (D_{2} (F_{0}))$ (34) $F_{2} = P (D_{4} (F_{0}))$ (35) $F_{3} = P (D_{8} (F_{0}))$

Where F₀ represents the original LiDAR BEV feature map, $P (\cdot)$ represents the projection operation from the 3D voxel features to the 2D BEV feature map, and $D_{k} (\cdot)$ represents the k-fold downsampling operation.

2.3.3

Transformer-based feature fusion

Most existing data-level point cloud image fusion methods rely heavily on calibration matrices to establish point-to-point associations between point clouds and images. However, points in point clouds are much sparser compared to image pixels, which means that fusion methods based on point-to-point associations lose a large number of semantic features in the image and are sensitive to calibration errors. To overcome these limitations, this paper proposes a Transformer-based fusion mechanism that is able to adaptively fuse LiDAR and camera features without relying on point-to-point associations. Specifically, the fusion mechanism in this study uses two layers of Transformer decoders to fuse LiDAR and camera features. These decoder layers learn the correlation between the Object Query and the two modal features and output the 3D bounding box parameters of the object. The structure of the fusion mechanism is shown in Figure 5. On the left is the first Transformer decoder layer, located in the point cloud branch, and on the right is the second Transformer decoder layer, located in the image branch.

The first decoder layer takes as input a set of Object Query and frequency-enhanced LiDAR BEV feature maps and outputs an initial bounding box for each Query. The Object Query is a learnable embedding that represents the object to be detected. The frequency-enhanced BEV feature maps are obtained by using the FAFEM module on the LiDAR BEV feature maps. The first decoder layer applies the self-attention module and the cross-attention module to the Object Query and the frequency-enhanced BEV feature map, respectively. The self-attention module allows the Query to interact with each other to capture the global context, while the cross-attention module enables the Query to focus on relevant regions in the frequency-enhanced BEV feature map. The output of the first decoder layer is a set of refined Object Queries encoding frequency, spatial and semantic information from the point cloud. The initial bounding box is then predicted by applying a feed-forward network (FFN).

The second decoder layer takes the refined Object Query and the frequency-enhanced image feature maps as input and outputs the final bounding box for each Query. The frequency-enhanced image feature maps are obtained by using the FAFEM module on the image feature maps. The second decoder layer similarly applies the self-attention module and the cross-attention module to the Object Query and the frequency-enhanced image feature maps, respectively. The self-attention module further filters the Query by combining information from other Queries, while the cross-attention module adaptively fuses the Query with relevant image features using frequency, spatial and contextual relationships. The output of the second decoder layer is a set of Object Query fused with LiDAR and image information. The final bounding box is predicted by applying another FFN.

The self-attention module, cross-attention module and FFN are computed as follows: 1)

Self-attention module: (36) $Q = W_{Q} X$ (37) $K = W_{K} X$ (38) $V = W_{V} X$ (39) $A = S o f t \max (\frac{Q K^{T}}{\sqrt{d}})$ (40) $Y = A V$

Where X is the input Query. W_Q, W_K and W_V are the learnable projection matrices. d is the dimension of Query. A is the attention matrix. Y is the output Query set. 2)

Cross-attention module: (41) $Q' = W_{Q'} Y$ (42) $K' = W_{K'} F$ (43) $V' = W_{V^{'}} F$ (44) $A' = S o f t \max (\frac{Q' {K^{'}}^{T}}{\sqrt{d'}})$ (45) $Z = A' V'$

Where Y is the output query set of the self-attention module. F′ is the frequency-enhanced feature map from LiDAR or image modality. $W_{Q'}$ , $W_{K'}$ and $W_{V'}$ are the learnable projection matrices. d′ is the dimension of the feature map. A′ is the attention matrix: Z is the output query set of the cross-attention module. 3)

FFN: (46) $B = W_{2} (Re L U (W_{1} Z + b_{1})) + b_{2}$

Where Z is the set of output queries from the cross-attention module. W₁, W₂, b₁ and b₂ are the learnable parameters.B is the final bounding box prediction for each Query.

2.3.4

Loss function

There are many types of loss functions, including classification error rate, mean square error, and cross-first loss function. The formula calculates the classification error rate by multiplying the number of prediction errors by the total number. It is very understandable, but it is not very effective. In many cases, it can’t tell. And the mean square error is also a more common type of loss function. It will average the loss values of all samples, as shown in formula (47). But in the classification problem, using sigmoid/softmax to get the probability value, when using the mean square error loss function, using the gradient descent method will have a very slow learning rate when the model starts training. To wit: (47) $M S E = \frac{1}{n} \sum_{i}^{n} {({\hat{y}}_{t} - y_{i})}^{2}$

Therefore, RRNBNet uses the cross-entropy loss function, which is a frequently used loss function in classification problems, mainly used to measure the discrepancy information between two probability distributions. The formula for cross-entropy loss in the case of multiple classification is shown in (48), where M denotes the number of categories. y_ic can take 0 or 1, if the true category of sample i is equal to c take 1, otherwise take 0. p_ic denotes the predicted probability that observation sample i belongs to category c. I.e: (48) $L = \frac{1}{N} \sum_{i} L_{i} = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} y_{i c} \log (p_{i c})$

Because this model uses softmax to get the output, the combination of softmax and cross-first loss is good at learning inter-class information, which can achieve the effect of learning faster when the model is poor and slower when the model is good. It can also be seen from the formula that the cross-first loss function pays more attention to the accuracy of the correct label predictions and ignores the other incorrect predictions, which may lead to a scattering of the learned features.

2.3.5

Realisation and training

Two versions of the network were designed for vehicle detection and path planning objectives, trained and tested separately. The detailed formulation is shown below: 1)

Vehicle detection

Consistent with the vehicle target detection range, the network in this paper refers to the processing of the point cloud within the range of $[x \in (c_{1}, c_{2}), y \in (c_{3}, c_{4}), z \in (c_{5}, c_{6})]$ metre, and set the grid size of d_L = 0.22 metres and d_W = 0.22 metres, the maximum number of points in the grid is set to T = 1000, and the maximum number of grids is set to 10,000. After the network in this paper for the grids of the feature extraction, to obtain 10,000 grid features, combined with the coordinates of each grid, and ultimately obtained a size of 496 × 432 sparse feature map, the proportion of non-null grids is less than 5.6%. A sparse feature map with a size of 496×432 is obtained, and the percentage of non-empty grids is less than 5.6%. Setting up each module in this paper network. Finally, the feature map is output from the module, the number of channels is 128, and the size is 256 × 128 × 3. In this way, detecting the first layer convolution parameter in all the branches in this paper network is $(256, 128, 3)$ , and the input channel of the second layer convolution is 128.

2)

Path Planning

Compared with the single-feature trajectory prediction method, which only uses the positioning information for trajectory prediction, the multi-feature combined trajectory prediction method combines the vehicle positioning information and vehicle state information. The collected data are combined as the input parameters of the network in this paper, and the input sequence is denoted by I, where $I_{i} (i = 1, 2 \dots, n)$ is the input multi-feature data, i denotes the input multi-feature data serial number, x is the vehicle longitude, y is the vehicle latitude, φ is the vehicle yaw angle, v is the vehicle speed, fl is the vehicle left front wheel speed, fr is the vehicle right front wheel speed, rl is the vehicle left rear wheel speed, and rr is the vehicle right rear wheel speed. That is:

(49)

I = [\begin{matrix} I_{1} \\ I_{2} \\ ⋮ \\ I_{n} \end{matrix}] = [\begin{matrix} x_{1} & y_{1} & φ_{1} & v_{1} & a_{1} & f l_{1} & f r_{1} & r l_{1} & r r_{1} \\ x_{2} & y_{2} & φ_{2} & v_{2} & a_{2} & f l_{2} & f_{2} & r l_{2} & r r_{2} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ x_{n} & y_{n} & φ_{n} & v_{n} & a_{n} & f l_{n} & f r_{n} & r l_{n} & r r_{n} \end{matrix}]

The two trajectory point gaps were analysed separately at a system data acquisition frequency of 40Hz, a time step of 30ms for the input data, an input data sequence of $I = [I_{1}, I_{2}, I_{3}, \dots, I_{n}]$ and a prediction of $O = [O_{1}, O_{2}]$ .

3

Analysis of 3D Occupancy Network applications

3.1

Analysis of vehicle detection in autonomous driving

In order to verify the feasibility of the 3D occupancy network proposed above, this paper focuses on experiments using dataset (A) to analyse and compare the detection effects of different models. In this section, the experimental environment is first introduced, and the experimental results are analysed in terms of loss and accuracy analysis, vehicle distance detection, and the effectiveness of the 3D occupancy network is verified based on the results of the experimental analysis.

3.1.1

Environment for model training

To optimise the loss function, the Adam optimiser with one-cycle strategy is used with an initial learning rate of 1 × 10⁻⁴. On an NVIDIA RTX 4090Ti graphics card, 8 samples are processed per batch, and all samples are trained a total of 200 times. The individual loss weights are set to: α = 2, β = 1, w₀ = 1, w₁ = 2, and w₂, w₃, w₄, w₅, w₆ = 0.2. The IoU threshold for non-maximal suppression is set to 0.1 and the scoring threshold is set to 0.3 for inference.

3.1.2

Loss and accuracy analyses

1)

Loss value

2)

The network was trained with a Batch Size set to 16 and epoch set to 200, and the network obtained high recognition accuracy while avoiding overfitting. The training process was visualised using Tensorboard in the TensorFlow third-party library. The training set loss function is shown in Figure 6, where (a)~(c) is the regression loss (CIOU Loss) curve, confidence loss (Focal Loss) curve, and category loss (cross-entropy loss function) function, respectively, and the validation set loss function is shown in Figure 7, where the horizontal coordinate is the period (epoch), and the vertical coordinate is the loss value. Comprehensive Figure 6 and Figure 7 can show whether, in the training set or the validation set, the cross-entropy loss function selected in this paper converges faster than the CIOU Loss, Focal Loss, which is reflected in the data from the initial 0.045 ultimately reduced to 0.003.

3)

Accuracy

In order to further validate the superiority of the 3D occupancy network proposed in this chapter for autonomous driving, a comparative analysis is carried out. With the same training period and learning rate, the same vehicle detection target dataset (A) is used, and the data enhancement in the form of Geometric Transformation + Cutout + Mosaic superposition is used for training in the same experimental environment. The comparative analysis of vehicle detection accuracy (mAP) is shown in Fig. 8, where the vehicle types contain cars, vans, trucks, and trams, while the comparative algorithms are CNN, RNN, LSTM, and DNN. As reflected by Fig. 8, the performance of 3D occupancy network based vehicle detection (0.9219) is much better than CNN (0.7533), RNN (0.8053), LSTM (0.9016), and DNN (0.92198), indicating that the 3D occupancy network exerts excellent detection performance, detects dense overlapping targets, and there is almost no phenomenon of leakage or misdetection in real road scenes. In terms of detection effect, it proves that the 3D occupancy network has a better detection effect and good robustness for vehicle target detection tasks.

3.1.3

Vehicle distance detection analysis

In order to verify the effectiveness of the 3D occupancy network in the field of distance detection of the vehicle in front and to evaluate its ranging accuracy in the real road environment, we designed and implemented a static ranging experiment of the vehicle in front. In the experiment, the experimental environment selects the actual road scene, the pitch angle of the camera is set to 2°, and its installation height is 1.5 m. In this paper, multiple data points are set up, and the initial distance between the vehicle to be tested and the camera is 5 m. A set of data points is set up at intervals of 5 m, which extends up to 100 m. The results of the vehicle distance detection analysis are shown in Table 2. Based on the data in Table 1, it can be seen that within 70m, the three-dimensional occupancy network for vehicle distance detection error is controlled at less than 5%. With the increase of the distance, the vehicle distance detection error reaches a maximum of 6.59%, while the overall detection error is 3.72%, which indicates that the three-dimensional occupancy network in a distance of 70m or less can ensure that the error of the distance measurement of the vehicle in front of the vehicle can be less than 5%, so as to achieve the detection of the vehicle distance in front of the self-lane longitudinal distance detection so that the driverless car will automatically decelerate and slow down when the vehicle is too close.

Table 2.

The car distance test analysis results

Actual distance/m	Prediction distance/m	Absolute error/m	Prediction error/m
5	5.11	0.11	2.15%
10	10.31	0.31	3.01%
15	15.25	0.25	1.64%
20	20.43	0.43	2.10%
25	25.91	0.91	3.51%
30	31.16	1.16	3.72%
35	35.85	0.85	2.37%
40	41.09	1.09	2.65%
45	43.20	-1.80	-4.17%
50	52.20	2.20	4.21%
55	57.76	2.76	4.78%
60	62.36	2.36	3.78%
65	67.76	2.76	4.07%
70	73.30	3.30	4.50%
75	79.65	4.65	5.84%
80	84.71	4.71	5.56%
85	89.89	4.89	5.44%
90	96.13	6.13	6.38%
95	101.38	6.38	6.29%
100	107.06	7.06	6.59%

The calculation process is as follows, taking 5m as an example, and the others are the same: $Prediction error = \frac{5.11 - 5}{5.11} = 2.15 %$

3.2

Path Planning Analysis in Autonomous Driving

3.2.1

Description of the experiment

Maps 1, 2 and 3 are created for overtaking under less static obstacles, overtaking under more static obstacles and right turning under more static obstacles, based on which map 4 is created for overtaking under multi-dynamic obstacles and partially static obstacles, which are representative of the typical road conditions and can effectively validate the validity and adaptability of the three-dimensional occupancy network.

3.2.2

Analysis of results

The path post-processing process under three kinds of maps is shown in Fig. 9. The average results of post-processing for 30 simulations are shown in Table 3, in which the reverse optimisation path is based on the preliminary path based on the 3D occupancy network to carry out reverse optimisation in order to achieve the reduction of the path cost and to ensure the safety of vehicle driving at the same time. The spline-smoothing path is shown in Fig. 9 is based on the reverse optimisation path, which is smoothed based on the constraints of the maximum vehicle corners using 3 times the B. The spline smoothing path in Fig. 9 is based on the inverse optimisation path, which is smoothed by using 3 times B spline based on the maximum corner constraint of the vehicle, which solves the problem that the inverse optimisation path still exists with too large individual corners. The smoothed path is more in line with the actual vehicle driving path. The reduction of the path length by reverse optimization is obvious, and the use of 3 times B-spline has a very noticeable effect on smoothing the path and also reduces the path length accordingly. In order to evaluate the path smoothness here, the number of path segments conforming to the curvature continuum is used as the evaluation parameter of smoothness. The smaller the number of path segments is, the higher the smoothness, and the smoothness of its maps 1~3 is improved to 94.19%, 94.85% and 95.69%, respectively. To put it succinctly, the 3D occupancy network proposed has significantly improved the efficiency of path planning in three map scenarios that were designed based on urban intersections.

Table 3.

The average result of 30 simulated results is treated after processing

Map	Path	Path length/m	Length reduction/%	Path number	Smoothing ascension
1	Primary path	1174.42	-	8.6	-
	Reverse optimization path	1172.43	0.17	3.5	59.30%
	Spline smooth path	1171.38	0.26	0.5	94.19%
2	Primary path	1179.26	-	9.7	-
	Reverse optimization path	1174.32	0.42	3.5	63.92%
	Spline smooth path	1172.21	0.60	0.5	94.85%
3	Primary path	1141.08		11.6	-
	Reverse optimization path	1105.21	3.14	4.5	61.21%
	Spline smooth path	1094.61	4.07	0.5	95.69%

Taking map 1 as an example, the length reduction and smoothness improvement calculation process is shown below: $L e n g t h r e d u c t i o n = \frac{1174.42 - 1172.43}{1174.42} = 0.17 %$ $S m o o t h i n g a s c e n s i o n = \frac{8.6 - 3.5}{8.6} = 59.30 %$

Combined with the simulation of the first three map scenes, it can be clearly seen that the three-dimensional occupancy network in the path search time improvement is obvious. In order to further validate the adaptability of the algorithm, in the map 4 scene for multi-dynamic obstacles and part of the static obstacles overtaking path-finding simulation, in order to better visualisation and taking into account the three-dimensional occupancy network of the global planning, take the sub-frame to display, which the map 4 is according to the real road in the same scale Map 4 is drawn according to the real road in equal proportion, in the map, X, Y two cars were driving at 60km/h and 40km/h according to the direction of the ground arrow, and Z car is in a fault stationary state. The sub-frame simulation results under Map 4 are shown in Fig. 10. From T=0s to T=5s. The average planning time is 514.25ms, which effectively verifies that the 3D occupancy network can quickly and efficiently plan effective and feasible paths according to the changes of obstacle positions, and well meets the practical engineering requirements of path planning for self-driving vehicles and other vehicles in various environments.

4

Conclusion

The development of science and technology has resulted in research in the field of automatic driving becoming a key concern nowadays. This paper proposes the research of 3D occupancy network modelling for automatic driving and verifies and analyzes the network from the two aspects of vehicle detection and path planning. The research results below can be obtained: 1)

The cross-entropy loss function selected in this paper has the fastest and most stable convergence speed, and its value embodies 0.045 and eventually drops to 0.003. In addition, it can be concluded that the vehicle detection accuracy of this paper’s network is much better than that of CNN, RNN, LSTM, and DNN, which indicates that this paper’s network has a very good superiority in vehicle detection in order to better avoid vehicle collisions and accidents in the process of automatic driving.

2)

In four different road conditions, the network in this paper can always make a reasonable path planning quickly, and its average planning time is 514.25ms, which is good to meet the principle of safe automatic driving nowadays.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

3D Occupancy Network Modelling for Multi-view Image Fusion Techniques in Autonomous Driving

Xingyu Hu

Xipei Ma

Pingqing Fan

Chao Yang

Pubblicato online: 05 feb 2025

Ricevuto: 27 ago 2024

Accettato: 17 dic 2024

DOI: https://doi.org/10.2478/amns-2025-0060

Parole chiaveImage fusion technique, Ackermann steering principle, Vehicle kinematics model, Three-dimensional occupancy network, Autonomous driving

© 2025 Xingyu Hu et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Image fusion technique, Ackermann steering principle, Vehicle kinematics model, Three-dimensional occupancy network, Autonomous driving