3D Occupancy Network Modelling for Multi-view Image Fusion Techniques in Autonomous Driving
Online veröffentlicht: 05. Feb. 2025
Eingereicht: 27. Aug. 2024
Akzeptiert: 17. Dez. 2024
DOI: https://doi.org/10.2478/amns-2025-0060
Schlüsselwörter
© 2025 Xingyu Hu et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Nowadays, autonomous driving technology is a hot topic in the automotive industry and the field of science and technology, and it is one of the important development directions for future transportation. Autonomous driving, also known as driverless driving, is a technology that uses computers, sensors and artificial intelligence to enable vehicles to travel autonomously without human control. This technology allows a vehicle to understand its surroundings through sensors, maps, and other data sources that sense its surroundings and use this information to make decisions, including how to accelerate, brake, turn, and avoid obstacles [1]. The gradual application of autonomous driving technology can effectively reduce traffic accidents, improve transportation efficiency, and enhance the driving and traveling experience [2].
In the field of autonomous driving, high-quality 3D reconstruction is important for achieving more accurate scene understanding and interaction. According to the different ways of data acquisition, 3D reconstruction techniques can be divided into two categories: active and passive [3]. Active 3D reconstruction techniques mainly use a transmitter to emit a specific signal to the surface of an object, and then the receiver receives the reflected signal and calculates the 3D coordinates of the surface of the object by measuring information such as the propagation time or phase of the signal [4-7]. The passive 3D reconstruction technique calculates the 3D coordinates of the object surface by capturing the information of the light naturally reflected from the object surface and using image processing and computer vision techniques [8-9]. However, active 3D reconstruction methods have higher costs, increased system complexity, restricted operation range, security, privacy, and insufficient real-time issues, and traditional 3D reconstruction methods are often limited by high computational complexity, low reconstruction accuracy, and sensitivity to scene and lighting conditions [10-14]. 3D reconstruction based on multiview images requires only common visual sensors to input image information conveniently and thus has great potential for development and research value.
Literature [15] proposes a single-stage multi-view multimodal 3D target detector (MVMM) that fuses image information through a data-level fusion method with point cloud coloring, extracts semantic and geometric information from it and generates a 3D-based result using the captured colored points and range view features. Literature [16] proposes a multi-view reprojection structure to estimate the attitude of the self-driving vehicle and surrounding objects, which utilizes a 2D detection box to extend the orientation of the 3D enclosing box of the detected object and its dimensions and then employs a perspective geometric regression method to obtain the multi-view dimensions, orientations, and reprojection box of the 3D reconstruction layer, enabling accurate recovery of the 3D box even when the 2D box is truncated. Literature [17] combines sensor fusion, hierarchical multiview networks, and traditional heuristics to build an autonomous vehicle target detection system by efficiently fusing 2D RGB images with 3D point cloud data through a hierarchical multiview proposal network (HMVPN) and then generating a candidate set of suggestions for driving procedures through machine learning and heuristics Literature [18] shows that the recording and linking of multiview images can be used to reconstruct a 3D object from a 2D images to reconstruct 3D objects using a self-organizing mapping approach to build a voxel grid-like structure that has a powerful 3D reconstruction capability to recover 3D objects from event-based stereo data and does not require any a priori knowledge of camera parameters. Literature [19] explored the optimization scheme of a deep learning based multi-view stereo vision (MVS) system, using the neural network channel pruning method to prune the redundant parameters of the system by compressing and accelerating the 3D reconstruction model with high complexity. It not only improves the 3D reconstruction rate of the MVS system but also improves the accuracy and integrity of the model. Literature [20] constructed UniScene, a unified multi-camera pre-training framework, which then plays a larger role in 3D target detection and semantic scene complementation tasks and solves the spatial and temporal neglect of multi-camera systems in monocular 2D pre-training, which provides an important practical value for the realization of autonomous driving.
The article utilizes multi-view image fusion technology to acquire image data, determine the structural parameters of the vehicle, and apply the Ackermann steering principle to construct the vehicle kinematic model. Starting from the definition of a 3D occupancy network, a multi-scale mechanism is formulated to make the extracted features richer, the construction of a 3D occupancy network is realised through Transformer-based feature fusion, and the cross entropy is selected as the loss function of the network. After that, iterative training and optimisation of the network are carried out in combination with the research data, the experimental environment and network parameters are set up, and in order to verify the effectiveness of the model in this paper, the network in this paper is used to carry out experimental simulation and analysis of vehicle detection and path planning.
The source image is first transformed from
where
Then the average of the cumulative chromaticity mean values of all near-white-point regions is selected as the final The gain of each channel
Where
The optical principle of the equivalent spherical model method is shown in Fig. 1,

Optical principle of equivalent spherical model method
Assumption
Easily accessible:
In general, the centre of aberration can be simply considered to be the geometric centre of the aberrated image. If the width of the image is Calculate the distortion distance Calculate the angle Calculate the coordinates of the corrected pixel point
Substituting Eqs. (6), (7), and (8) into Eq. (9) gives the spatial coordinate mapping relationship:
Then traversing the pixel points of the distorted image, performing coordinate transformation using equation (10), and then recovering the grey values using the interpolation algorithm, the correction of the distorted image can be achieved.
The creation of the overlapping region of the empowered edge undirected graph and the two search trees are shown in Fig. 2. For the overlapping region

Create the right of the overlapping region and the two search trees
Let
where items
The data term
The smoothing term Colour space, denoted as:
Where The gradient space, denoted as:
Where Colour space + gradient space, denoted as:
Where
On the basis of panoramic splicing of multi-view images, in order to improve the integrity of image information, the principle of the image fusion algorithm will be the difference part and the same part of the two images to be fused to be processed separately, the difference is modelled to obtain the difference part of the two to ensure the completeness of the image, the multiplication is used to obtain the same part of the two to improve the image signal-to-noise ratio, and the two parts are respectively normalised and summed to obtain the fused image. The DMS method can avoid the redundancy of the fused image caused by repeated phase superposition, and it has obvious advantages in suppressing the background clutter [21]. Assuming that the two normalised single-view images are denoted as Find the identical parts and normalise them:
Find the part of the difference and normalise it:
Summing and normalising:
Where
On the basis of multi-view image fusion technology, the outer contour of the vehicle is simplified to a rectangle according to the basic structural parameters of the vehicle, the kinematic equation of state of the vehicle is established based on Ackermann’s steering principle, and the relationship between the four vertices of the body and the centre point of the vehicle’s rear axle is deduced based on the geometrical structure of the vehicle, so as to lay the foundation for the construction of the subsequent three-dimensional occupancy network.
The structural parameters of the vehicle are shown in Table 1. The size of the vehicle’s structural parameters has a great influence on parking planning and control, and the complex and varied outer contour of the vehicle can be simplified due to the low vehicle speed during the parking process, which produces very little pitch and roll. Taking the maximum length and maximum width of the vehicle as the length and width of the rectangle and replacing the shape of the vehicle’s outer contour with a rectangle, this simplified treatment not only facilitates the subsequent analysis of vehicle kinematics but also makes intelligent driving safer.
Vehicle structure parameters
Vehicle parameter | Symbol | Unit | Numerical value |
---|---|---|---|
Car width | m | 0.99 | |
Wheelbase | m | 1.00 | |
Front suspension | m | 0.38 | |
After suspension | m | 0.49 | |
Minimum turning radius | m | 2.45 | |
The maximum front wheel is turning to the angular velocity | m | 0.5236 |
Ackermann steering geometry is the theoretical basis for establishing the vehicle kinematics model, while the zero front wheel positioning angle of the vehicle and the no-slip phenomenon of all wheels in the steering process are the prerequisites of Ackermann steering geometry [22]. Equivalent front wheel angle schematic shown in Figure 3, when the use of a four-link equal crank, the vehicle’s inner wheels will be larger than the outer wheels of the steering angle of 2 ° ~ 4 °so that the vehicle in the steering of the four wheels in the trajectory of the center of the circle intersected in the vehicle’s rear axle on the extension line of a certain point

Schematic diagram of equivalent front wheel angle
The front and rear wheelbases of a vehicle are usually equal or have a small difference, and the geometrical relationship between the inside wheel angle
Where
According to the single-vehicle model, the trajectory of the front and rear wheels of the vehicle can be expressed by the two equivalent wheels at the centre of the front and rear axles, in which the equivalent steering angle is
Vehicle kinematics model is to study the relationship between vehicle position, speed and time from a geometric point of view, based on Ackermann’s steering principle to establish the kinematics model with the centre point of the rear axle as the origin as shown in Fig. 4, and to establish the model the following basic assumptions need to be made:
Neglecting the influence of tyre side-slip, the tyre and the ground do pure rolling. Neglecting the external loads on the vehicle, the vehicle, and the suspension system are rigid. Consider only the vehicle moving on the horizontal plane and ignore the motion perpendicular to the direction of the road surface. Based on Ackermann’s steering principle, the motion states of the vehicle’s left and right tyres are described equivalently as one tyre.

Kinematics model of vehicle
In the inertial coordinate system, i.e. the global coordinate system, the coordinates of the centre points of the front and rear axes of the vehicle are denoted by
Firstly, the differential equations are established with the centre point of the rear axle of the vehicle, based on the vehicle kinematics model only considering the longitudinal motion of the front and rear wheels without considering the transverse side slip, the following equations can be obtained:
Then the following relationship exists between the coordinates of the centre points of the front and rear axles of the vehicle:
Deriving Eq. (22) with respect to time gives:
Substituting Eq. (23) into Eq. (21) gives:
The differential equation is established at the centre of the front axle of the vehicle, and the transverse and longitudinal displacements at the centre of the front axle of the vehicle are derived with respect to time to obtain the transverse and longitudinal partial velocities:
Substituting Eq. (25) into Eq. (24) according to the formula for the sum and difference of two angles of a sine in trigonometry:
Integrating Eq. (26) over time gives:
Substituting equations (25) and (26) into equation (24) gives:
The velocity at the centre point of the front and rear axles of the vehicle satisfies the following relationship:
Taking the vehicle rear axle centre point velocity as the whole vehicle velocity, the joint formula (27) (29) can be obtained as the vehicle kinematics model:
Integrating
Squaring and adding both sides of Eq. (31) simultaneously yields the geometric relationship for the trajectory of the centre point of the rear axle of the vehicle:
From Equation (32), it can be seen that when the vehicle moves with a fixed front wheel angle, its trajectory is a circular trajectory related to the vehicle parameters, in which the vehicle wheelbase and the front wheel angle affect the size of the radius of the circular trajectory. In contrast, the vehicle’s traveling speed has nothing to do with the size of the radius of the circular trajectory. The theoretical basis can be provided for the subsequent process of constructing a three-dimensional occupancy network oriented towards automatic driving according to this law.
3D Occupancy Network is a deep learning based 3D reconstruction method that represents 3D surfaces as continuous decision boundaries for deep neural network classifiers [23]. Such a representation is capable of encoding high-resolution 3D outputs without excessive memory consumption, it is able to encode 3D structures efficiently, and it is able to infer models from different kinds of inputs, applying it widely to 3D target detection.
In the construction of 3D occupancy networks for multi-view image fusion techniques in autopilot scenarios, the diversity of object sizes and occlusion levels poses a serious challenge to the construction of the networks, aiming to construct 3D occupancy networks by capturing richer detail information. In this paper, the outputs of the last three convolutional layers of the backbone network (VoxelNet) of LiDAR data are selected as inputs, and these outputs are downsampled by 2x, 4x, and 8x, respectively, to obtain 3D voxel features at different scales. These 3D voxel features are then mapped (projected) into 2D multi-scale BEV feature maps with dimensions
Where
Most existing data-level point cloud image fusion methods rely heavily on calibration matrices to establish point-to-point associations between point clouds and images. However, points in point clouds are much sparser compared to image pixels, which means that fusion methods based on point-to-point associations lose a large number of semantic features in the image and are sensitive to calibration errors. To overcome these limitations, this paper proposes a Transformer-based fusion mechanism that is able to adaptively fuse LiDAR and camera features without relying on point-to-point associations. Specifically, the fusion mechanism in this study uses two layers of Transformer decoders to fuse LiDAR and camera features. These decoder layers learn the correlation between the Object Query and the two modal features and output the 3D bounding box parameters of the object. The structure of the fusion mechanism is shown in Figure 5. On the left is the first Transformer decoder layer, located in the point cloud branch, and on the right is the second Transformer decoder layer, located in the image branch.

The structure of the fusion mechanism
The first decoder layer takes as input a set of Object Query and frequency-enhanced LiDAR BEV feature maps and outputs an initial bounding box for each Query. The Object Query is a learnable embedding that represents the object to be detected. The frequency-enhanced BEV feature maps are obtained by using the FAFEM module on the LiDAR BEV feature maps. The first decoder layer applies the self-attention module and the cross-attention module to the Object Query and the frequency-enhanced BEV feature map, respectively. The self-attention module allows the Query to interact with each other to capture the global context, while the cross-attention module enables the Query to focus on relevant regions in the frequency-enhanced BEV feature map. The output of the first decoder layer is a set of refined Object Queries encoding frequency, spatial and semantic information from the point cloud. The initial bounding box is then predicted by applying a feed-forward network (FFN).
The second decoder layer takes the refined Object Query and the frequency-enhanced image feature maps as input and outputs the final bounding box for each Query. The frequency-enhanced image feature maps are obtained by using the FAFEM module on the image feature maps. The second decoder layer similarly applies the self-attention module and the cross-attention module to the Object Query and the frequency-enhanced image feature maps, respectively. The self-attention module further filters the Query by combining information from other Queries, while the cross-attention module adaptively fuses the Query with relevant image features using frequency, spatial and contextual relationships. The output of the second decoder layer is a set of Object Query fused with LiDAR and image information. The final bounding box is predicted by applying another FFN.
The self-attention module, cross-attention module and FFN are computed as follows:
Self-attention module:
Where Cross-attention module:
Where FFN:
Where Z is the set of output queries from the cross-attention module.
There are many types of loss functions, including classification error rate, mean square error, and cross-first loss function. The formula calculates the classification error rate by multiplying the number of prediction errors by the total number. It is very understandable, but it is not very effective. In many cases, it can’t tell. And the mean square error is also a more common type of loss function. It will average the loss values of all samples, as shown in formula (47). But in the classification problem, using sigmoid/softmax to get the probability value, when using the mean square error loss function, using the gradient descent method will have a very slow learning rate when the model starts training. To wit:
Therefore, RRNBNet uses the cross-entropy loss function, which is a frequently used loss function in classification problems, mainly used to measure the discrepancy information between two probability distributions. The formula for cross-entropy loss in the case of multiple classification is shown in (48), where
Because this model uses softmax to get the output, the combination of softmax and cross-first loss is good at learning inter-class information, which can achieve the effect of learning faster when the model is poor and slower when the model is good. It can also be seen from the formula that the cross-first loss function pays more attention to the accuracy of the correct label predictions and ignores the other incorrect predictions, which may lead to a scattering of the learned features.
Two versions of the network were designed for vehicle detection and path planning objectives, trained and tested separately. The detailed formulation is shown below:
Vehicle detection Consistent with the vehicle target detection range, the network in this paper refers to the processing of the point cloud within the range of Path Planning Compared with the single-feature trajectory prediction method, which only uses the positioning information for trajectory prediction, the multi-feature combined trajectory prediction method combines the vehicle positioning information and vehicle state information. The collected data are combined as the input parameters of the network in this paper, and the input sequence is denoted by
The two trajectory point gaps were analysed separately at a system data acquisition frequency of 40Hz, a time step of 30ms for the input data, an input data sequence of
In order to verify the feasibility of the 3D occupancy network proposed above, this paper focuses on experiments using dataset (A) to analyse and compare the detection effects of different models. In this section, the experimental environment is first introduced, and the experimental results are analysed in terms of loss and accuracy analysis, vehicle distance detection, and the effectiveness of the 3D occupancy network is verified based on the results of the experimental analysis.
To optimise the loss function, the Adam optimiser with one-cycle strategy is used with an initial learning rate of 1 × 10−4. On an NVIDIA RTX 4090Ti graphics card, 8 samples are processed per batch, and all samples are trained a total of 200 times. The individual loss weights are set to:
Loss value The network was trained with a Batch Size set to 16 and epoch set to 200, and the network obtained high recognition accuracy while avoiding overfitting. The training process was visualised using Tensorboard in the TensorFlow third-party library. The training set loss function is shown in Figure 6, where (a)~(c) is the regression loss (CIOU Loss) curve, confidence loss (Focal Loss) curve, and category loss (cross-entropy loss function) function, respectively, and the validation set loss function is shown in Figure 7, where the horizontal coordinate is the period (epoch), and the vertical coordinate is the loss value. Comprehensive Figure 6 and Figure 7 can show whether, in the training set or the validation set, the cross-entropy loss function selected in this paper converges faster than the CIOU Loss, Focal Loss, which is reflected in the data from the initial 0.045 ultimately reduced to 0.003. Accuracy In order to further validate the superiority of the 3D occupancy network proposed in this chapter for autonomous driving, a comparative analysis is carried out. With the same training period and learning rate, the same vehicle detection target dataset (A) is used, and the data enhancement in the form of Geometric Transformation + Cutout + Mosaic superposition is used for training in the same experimental environment. The comparative analysis of vehicle detection accuracy (mAP) is shown in Fig. 8, where the vehicle types contain cars, vans, trucks, and trams, while the comparative algorithms are CNN, RNN, LSTM, and DNN. As reflected by Fig. 8, the performance of 3D occupancy network based vehicle detection (0.9219) is much better than CNN (0.7533), RNN (0.8053), LSTM (0.9016), and DNN (0.92198), indicating that the 3D occupancy network exerts excellent detection performance, detects dense overlapping targets, and there is almost no phenomenon of leakage or misdetection in real road scenes. In terms of detection effect, it proves that the 3D occupancy network has a better detection effect and good robustness for vehicle target detection tasks.

Training set loss function

Verification set loss function

Comparison of vehicle detection accuracy
In order to verify the effectiveness of the 3D occupancy network in the field of distance detection of the vehicle in front and to evaluate its ranging accuracy in the real road environment, we designed and implemented a static ranging experiment of the vehicle in front. In the experiment, the experimental environment selects the actual road scene, the pitch angle of the camera is set to 2°, and its installation height is 1.5 m. In this paper, multiple data points are set up, and the initial distance between the vehicle to be tested and the camera is 5 m. A set of data points is set up at intervals of 5 m, which extends up to 100 m. The results of the vehicle distance detection analysis are shown in Table 2. Based on the data in Table 1, it can be seen that within 70m, the three-dimensional occupancy network for vehicle distance detection error is controlled at less than 5%. With the increase of the distance, the vehicle distance detection error reaches a maximum of 6.59%, while the overall detection error is 3.72%, which indicates that the three-dimensional occupancy network in a distance of 70m or less can ensure that the error of the distance measurement of the vehicle in front of the vehicle can be less than 5%, so as to achieve the detection of the vehicle distance in front of the self-lane longitudinal distance detection so that the driverless car will automatically decelerate and slow down when the vehicle is too close.
The car distance test analysis results
Actual distance/m | Prediction distance/m | Absolute error/m | Prediction error/m |
---|---|---|---|
5 | 5.11 | 0.11 | 2.15% |
10 | 10.31 | 0.31 | 3.01% |
15 | 15.25 | 0.25 | 1.64% |
20 | 20.43 | 0.43 | 2.10% |
25 | 25.91 | 0.91 | 3.51% |
30 | 31.16 | 1.16 | 3.72% |
35 | 35.85 | 0.85 | 2.37% |
40 | 41.09 | 1.09 | 2.65% |
45 | 43.20 | -1.80 | -4.17% |
50 | 52.20 | 2.20 | 4.21% |
55 | 57.76 | 2.76 | 4.78% |
60 | 62.36 | 2.36 | 3.78% |
65 | 67.76 | 2.76 | 4.07% |
70 | 73.30 | 3.30 | 4.50% |
75 | 79.65 | 4.65 | 5.84% |
80 | 84.71 | 4.71 | 5.56% |
85 | 89.89 | 4.89 | 5.44% |
90 | 96.13 | 6.13 | 6.38% |
95 | 101.38 | 6.38 | 6.29% |
100 | 107.06 | 7.06 | 6.59% |
The calculation process is as follows, taking 5m as an example, and the others are the same:
Maps 1, 2 and 3 are created for overtaking under less static obstacles, overtaking under more static obstacles and right turning under more static obstacles, based on which map 4 is created for overtaking under multi-dynamic obstacles and partially static obstacles, which are representative of the typical road conditions and can effectively validate the validity and adaptability of the three-dimensional occupancy network.
The path post-processing process under three kinds of maps is shown in Fig. 9. The average results of post-processing for 30 simulations are shown in Table 3, in which the reverse optimisation path is based on the preliminary path based on the 3D occupancy network to carry out reverse optimisation in order to achieve the reduction of the path cost and to ensure the safety of vehicle driving at the same time. The spline-smoothing path is shown in Fig. 9 is based on the reverse optimisation path, which is smoothed based on the constraints of the maximum vehicle corners using 3 times the B. The spline smoothing path in Fig. 9 is based on the inverse optimisation path, which is smoothed by using 3 times B spline based on the maximum corner constraint of the vehicle, which solves the problem that the inverse optimisation path still exists with too large individual corners. The smoothed path is more in line with the actual vehicle driving path. The reduction of the path length by reverse optimization is obvious, and the use of 3 times B-spline has a very noticeable effect on smoothing the path and also reduces the path length accordingly. In order to evaluate the path smoothness here, the number of path segments conforming to the curvature continuum is used as the evaluation parameter of smoothness. The smaller the number of path segments is, the higher the smoothness, and the smoothness of its maps 1~3 is improved to 94.19%, 94.85% and 95.69%, respectively. To put it succinctly, the 3D occupancy network proposed has significantly improved the efficiency of path planning in three map scenarios that were designed based on urban intersections.

The following process of the path after the 3 species
The average result of 30 simulated results is treated after processing
Map | Path | Path length/m | Length reduction/% | Path number | Smoothing ascension |
---|---|---|---|---|---|
1 | Primary path | 1174.42 | - | 8.6 | - |
Reverse optimization path | 1172.43 | 0.17 | 3.5 | 59.30% | |
Spline smooth path | 1171.38 | 0.26 | 0.5 | 94.19% | |
2 | Primary path | 1179.26 | - | 9.7 | - |
Reverse optimization path | 1174.32 | 0.42 | 3.5 | 63.92% | |
Spline smooth path | 1172.21 | 0.60 | 0.5 | 94.85% | |
3 | Primary path | 1141.08 | 11.6 | - | |
Reverse optimization path | 1105.21 | 3.14 | 4.5 | 61.21% | |
Spline smooth path | 1094.61 | 4.07 | 0.5 | 95.69% |
Taking map 1 as an example, the length reduction and smoothness improvement calculation process is shown below:
Combined with the simulation of the first three map scenes, it can be clearly seen that the three-dimensional occupancy network in the path search time improvement is obvious. In order to further validate the adaptability of the algorithm, in the map 4 scene for multi-dynamic obstacles and part of the static obstacles overtaking path-finding simulation, in order to better visualisation and taking into account the three-dimensional occupancy network of the global planning, take the sub-frame to display, which the map 4 is according to the real road in the same scale Map 4 is drawn according to the real road in equal proportion, in the map, X, Y two cars were driving at 60km/h and 40km/h according to the direction of the ground arrow, and Z car is in a fault stationary state. The sub-frame simulation results under Map 4 are shown in Fig. 10. From T=0s to T=5s. The average planning time is 514.25ms, which effectively verifies that the 3D occupancy network can quickly and efficiently plan effective and feasible paths according to the changes of obstacle positions, and well meets the practical engineering requirements of path planning for self-driving vehicles and other vehicles in various environments.

Map 4 results from the frame simulation
The development of science and technology has resulted in research in the field of automatic driving becoming a key concern nowadays. This paper proposes the research of 3D occupancy network modelling for automatic driving and verifies and analyzes the network from the two aspects of vehicle detection and path planning. The research results below can be obtained:
The cross-entropy loss function selected in this paper has the fastest and most stable convergence speed, and its value embodies 0.045 and eventually drops to 0.003. In addition, it can be concluded that the vehicle detection accuracy of this paper’s network is much better than that of CNN, RNN, LSTM, and DNN, which indicates that this paper’s network has a very good superiority in vehicle detection in order to better avoid vehicle collisions and accidents in the process of automatic driving. In four different road conditions, the network in this paper can always make a reasonable path planning quickly, and its average planning time is 514.25ms, which is good to meet the principle of safe automatic driving nowadays.