À propos de cet article

Citez

INTRODUCTION

With the development of robot technology, more and more robots are approaching our lives, such as sweeping robots, shopping mall robots, etc. Mobile robots are the product of the cross fusion of various disciplines and technologies. Among them, SLAM(Simultaneous Localization and Mapping) is an important technology for mobile robots. SLAM means that the robot builds a map of the surrounding environment in real time based on sensor data without any prior knowledge, and infers its own positioning based on the map. From the 1980s to the present, more and more sensors are used in SLAM, from early sonar, to later 2D/3D lidar, to monocular, binocular, RGBD, ToF and other cameras. Compared with lidar, cameras used in vision SLAM have become the focus of current SLAM research due to their advantages such as low price, light weight, large amount of image information, and wide application range. Stereo cameras generally consist of two pinhole cameras placed horizontally. Compared to monocular vision’s scale uncertainty and pure rotation problems, binocular cameras can directly calculate the pixel depth. At the same time, compared to RGB-D cameras, stereo cameras collect images directly from ambient light and can be used indoors and outdoors. Compared with lidar, the main disadvantage of the camera as a SLAM sensor is that when the camera moves too fast, the camera will blur images, and the camera will not work in a scene with insufficient environmental feature textures and few feature points.

Aiming at the problems of the above-mentioned visual SLAM system, this paper proposes an algorithm that fuses IMU and SLAM. Through the fusion of IMU, it can provide a good initial pose for the system. At the same time, during the camera movement process, it makes up for the shortcomings of visual SLAM, ensuring the accuracy of the camera pose estimation in the case of fast camera movement and lack of environmental texture.

RELATED WORKS
Camera coordinate system

Camera models generally have four coordinate systems: a pixel coordinate system, an image coordinate system, a world coordinate system, and a camera coordinate system. Figure 1:

Figure 1.

Camera related coordinate system

Among them, OwXwYwZw is the world coordinate system. The world coordinate system is the reference coordinate system in the visual SLAM system. The positions of the camera trajectory and map points are described based on this coordinate system. The unit is m.

Oixy is the image coordinate system. The image coordinate system uses the intersection of the camera optical center and the image plane coordinate system as the origin. The unit is mm.

OcXcYcZc is the camera coordinate system. The camera coordinate system uses the camera optical center as the origin, and the directions parallel to the x-axis and y-axis of the image coordinate system are respectively taken as the Xc-axis and Yc-axis, and the direction perpendicular to the image plane is the Zc-axis. The unit is m.

O – uv is the pixel coordinate system. The origin of the pixel coordinate system is generally the upper left corner of the image, with the u axis to the right parallel to the x axis, and the v axis to the y axis. The unit is pixel.

Camera projection model

The camera maps the coordinate points of the three-dimensional world to the two-dimensional image plane. This process is generally a pinhole model. Under the pinhole model, it is assumed that there is a spatial point P, and the coordinates of the point P are [X Y Z]T. After the projection of the small hole O, the point P falls on the imaging plane o – xy, and the imaging point is p, The p-point coordinate is [X Y Z]T. Let the distance from the imaging plane to the small hole be the focal length f. Therefore, according to the principle of triangle similarity, there are:Zf=Xx=Yy

So we can get:{x=fXZy=fYZ

The difference between the pixel coordinate system and the imaging plane is a zoom and a translation of the origin. Suppose that the pixel coordinates are scaled α times on the u axis and β times on the v axis, and the origin is translated [cx, cy]T, so we can get:{u=αx+cxv=βy+cy

Equation (3) is substituted into equation (2) to get:{u=fxXZ+cxv=fyYZ+cy

The unit of f is m and the unit of α and β is pixel/ m, so the unit of fx, and fy is pixel. Written as a matrix:(uv1)=1Z(fx0cx0fycy001)(XYZ)1ZKP

Among them, the matrix K is called the internal parameter matrix of the camera, and P is the coordinate representation of the space point in the camera coordinate system.

Let the coordinate P of the space point in the camera coordinate system correspond to the coordinate Pw in the world coordinate system, and use coordinate transformation to obtain:ZPuv=Z[uv1]=K(RPw+t)=KTPw

Among them, T represents the pose of the camera relative to the world coordinate system, and can also be called the external parameter of the camera. In summary, the pinhole camera model uses the triangle similarity relationship to obtain the relationship between space points and pixels, which is a relatively ideal model. In practice, there will be errors in the manufacture and installation of optical lenses, which will affect the propagation of light during the imaging process and cause distortion in the images collected by the camera. Here we mainly consider radial distortion and tangential distortion.

Radial distortion is caused by the shape of the lens, and the distortion increases as the distance between the pixel and the center of the image increases. Therefore, a polynomial function can be used to describe the changes before and after the distortion, that is, the quadratic and higher-order polynomial functions related to the distance between the pixel and the center of the image can be used for correction. The polynomial of the coordinate change before and after the radial distortion correction is as follows:{xcorrected=x(1+k1r2+k2r4+k3r6)ycorrected=y(1+k1r2+k2r4+k3r6)

Among them, [x, y]T is the coordinates of the uncorrected points, and [xcorrected, ycorrected]T is the coordinates of the points after the distortion is corrected. r is the distance from the point (x, y) to the origin. k1,k2 and k3 are three radial distortion parameters. Usually these three parameters can be obtained by the calibration step.

For tangential distortion, the reason is that the lens and the imaging plane cannot be strictly parallel during camera assembly. Tangential distortion can be corrected using two other parameters, p1 and p2:{xcorrected=x+2p1xy+p2(r2+2x2)ycorrected=y+2p2xy+p1(r2+2y2)

Considering the two types of distortion, we can find the correct position of a pixel in the pixel coordinate system through 5 distortion coefficients:{xcorrected=x(1+k1r2+k2r4+k3r6)+2p1xy+p2(r2+2x2)ycorrected=y(1+k1r2+k2r4+k3r6)+2p2xy+p1(r2+2y2)

In summary, the parameters describing the camera model mainly include: in the camera’s internal parameter matrix, and distortion correction parameters.

Stereo camera ranging principle

The binocular camera generally consists of two pinhole cameras placed horizontally, and the two cameras observe an object together. The aperture centers of both cameras are located on one axis, and the distance between the two is called the baseline b of the binocular camera. There is an existing space point P, which is an image in the left-eye camera and the right-eye camera, and is denoted as PL, PR. Due to the presence of the camera baseline, these two imaging positions are different. Remember that the coordinates of the imaging on the left and right sides are xL, xR, which can be seen from the similarity of the triangles:zfz=buL+uRb

We can get:z=fbd

The above model is an ideal model, which aims to explain the principle of measuring the actual three-dimensional point depth of the binocular camera. In practical applications, due to factors such as manufacturing and installation, it is difficult to achieve that the imaging planes of the binocular cameras are strictly on the same plane and the optical axes are strictly parallel. Therefore, before using a binocular camera for measurement, it should be calibrated to obtain the left and right camera internal parameters and the relative position relationship between the left and right cameras.

POSE ESTIMATION ALGORITHM

At present, the fusion method of monocular vision sensor and IMU can be divided into two types: loose coupling and tight coupling[1]. Loose coupling is based on the vision sensor and IMU as two separate modules, both of which can calculate the pose information, and then fused by EKF[2] and so on. Tight coupling refers to the non-linear optimization of vision and IMU data to obtain pose estimates. Because tight coupling can make full use of each sensor’s data, this paper uses tight coupling to fuse vision and IMU data. Firstly, the purely visual feature point pose estimation method is used to estimate the camera pose. Then, during the camera movement, if the number of extracted feature points is less than a certain threshold value, the camera movement cannot be estimated or the estimated camera rotation and translation are greater than a certain threshold value, The camera pose is estimated by fusing the IMU, otherwise feature points are still used to estimate the camera pose.

Pose estimation using pure visual information

The ORB (Oriented Fast and rotated Brief) algorithm was proposed by Ethan Rublee et al. In 2011[3]. The ORB feature is composed of the FAST feature and the BRIEF descriptor. It adds orientation and scale invariance to the FAST feature. Features are described using binary BRIEF descriptors. When performing feature matching, the descriptors between feature points and feature points are compared. The binocular camera can directly obtain the corresponding 3D position of the pixel under the known pixel matching of the left and right camera images. Therefore, the stereo camera-based SLAM system can use the known 3D point and its projection match in the current frame to obtain the current camera pose without the need to solve camera motion using epipolar geometry[4].

This paper first uses the method of EPnP[5] to solve the camera pose. The EPnP pose solution method can more effectively use the matching point information, and iteratively optimize the camera pose. EPnP is known as the coordinates {Piw,i=1,2,,n} of n space points in the world coordinate system and their corresponding coordinates {Pic,i=1,2,,n} in the image coordinate system to solve the rotation matrix R and translation vector t of the camera movement.Set four non-coplanar virtual control points in the world coordinate system, whose homogeneous sitting marks are: {Ciw|i=1,2,3,4}. The relationship between the world coordinates of the space points and the control points is as follows:Piw=j=14αijCjw,withj=14=αij=1

Once thevirtual control point is determined and the premise that the four control points are not coplanar, {αij, j = 1, …,} is the only one determined.In the image coordinate system, the same weighting sum relationship exists:Pic=j=14αijCjc

Substituting equation (13) into the camera model gives:i,si[uivi1]=KPic=Kj=14αCjc=[fx0cx0fycy001]j=14αij[xjcyjczjc]

The image coordinates ui, vi in Equation (13) are known, so:si=j=14αijzjc

From equations (13) and (14):{j=14αijfxxjc+αij(cxui)zjc=0j=14αijfyyjc+αij(cyvi)zjc=0

In order to obtain the coordinates of the 2D point into the camera coordinate system, it is assumed that αij in the camera coordinate system is consistent with αij in the world coordinate system, that is, to find the rotation and translation of the four control points. Solve linear equations:MX=0

Among them, M is a 2n×12 matrix, and X=[C1cT,C2cT,C3cT,C4cT] is a vector composed of 12 unknowns to be solved.X=i=1Nβivi

vi is the right singular vector of M, and the corresponding singular value is 0. Solve the MTM eigen value and eigenvector. The eigenvector with eigenvalue of 0 is vi. N is the dimension of the MTM space, and βi is the coefficient to be determined.

Depending on the position of the reference point, the spatial dimension of the matrix MTM may take the values 1,2,3,4. According to the same distance between the control points in the world coordinate system and the camera coordinate system, six constraints can be obtained, and the pending coefficients can be solved.

When N = 1, according to the constraints:βv[t]βv[j]2=CiwCjw2

and so:β=[i,j][1,4]v[i]v[j]CiwCjw[i,j][1,4]v[i]v[j]2

When N = 2:β1v1[i]+β2v2[i](β1v1[j]+β2v2[j])2=CiwCjw2

Since β1 and β2 only appear in the equation as quadratic terms, let β=[β12,β1β2,β22]T, and use the vector ρ to represent all CiwCjw2, thus obtaining the equation:Lβ=ρ

Where L is a 6×3 matrix composed of v1 and v2.

When N = 3, L is a 6×6 matrix.

In summary, the coordinate solution of the reference point in the camera coordinate system can be obtained as the initial value of the optimization, the optimization variable is β = [β1, β2, …, βN]T, and the objective function is:Error(β)=(i,j)s.t.i<j(CicCjcCiwCjw2)

Optimize β corresponding to the smallest dimension of the error, get the vector X, and restore the coordinates of the control point in the camera coordinate system. At the same time, the coordinates of the reference point in the camera coordinate system are obtained according to the centroid coordinate coefficient. Finally, according to the coordinates of a set of point clouds in the two coordinate systems, the pose transformations of the two coordinate systems are obtained. The solution steps are as follows:

Find the center point:pcc=picn,Pcw=piwn

To the center point:qic=picpcc,qiw=piwpcw

Calculate the H matrix:H=i=1nqicqiwT

SVD decomposition of H matrix:H=UΣVT

Calculate the rotation R:R=UVT

If R <0, then R(2,.) =-R(2,0).

Calculate displacement t:t=p0cRp0w

Taking the results of EPnP solution as initial values, the method of g2o was used to optimize the pose of the camera nonlinearly. Construct the least squares problem and find the best camera pose:ξ=argminξ12i=1nui1siKexp(ξ)Pi22

Among them, ui is the pixel coordinates of the projection point, K is the camera internal reference, ξ is the camera pose, and Pi is the space point coordinate.

Camera pose estimation method based on IMU

The measurement frequency of the IMU is often higher than the frequency at which the camera collects pictures. For example, the binocular camera used in this article has a frame rate of up to 60FPS and an IMU frequency of up to 500Hz. The difference in frequency between the two results in multiple IMU measurements between the two frames. Therefore, in order to ensure the information fusion of the two sensors, it is necessary to pre-integrate [6] the data of the IMU. That is, only the IMU information between the two image moments is integrated to obtain the relative pose value, and the integration result is saved for later joint optimization.The IMU-based camera pose estimation method mainly includes three coordinate systems: the world coordinate system, the IMU coordinate system, and the camera coordinate system. All pose and feature point coordinates are finally expressed in the world coordinate system. During the calculation process, this article will convert the state quantity in the camera coordinate system to the IMU coordinate system, and then to the world coordinate system.In this article, the letter W is used to represent the world coordinate system, the letter B is used to represent the IMU coordinate system, RWB is used to represent the rotation matrix from the IMU coordinate system to the world coordinate system, and pWB is used to represent the translation matrix from the IMU coordinate system to the world coordinate system.

The acceleration and angular velocity of the IMU are:Bω~WB(t)=BωWB(t)+bg(t)+ηg(t)Ba~WB(t)=RWBT(t)(Wa(t)Wg)+ba(t)+ηa(t)

Among them, ba(t) and bg(t) represent the bias of the accelerometer and gyroscope respectively, ηa(t) and ηg(t) represent the noise of the accelerometer and gyroscope respectively, and Wg represents the gravity vector in the world coordinate system.

The derivatives of rotation, velocity, and translation are expressed as:R˙WB=RWBBωWBWv˙WB=WaWBWp˙WB=WvWB

The rotation, speed and translation in the world coordinate system can be obtained by the general integral formula:RWB(t+Δt)=RWB(t)Exp(tt+Δt(τ)dτ)Wυ(t+Δt)=Wυ(t)+tt+Δta(τ)dτWp(t+Δt)=Wp(t)+tt+Δtυ(τ)dτ+tt+Δta(τ)dτ2

Use Equation (32) in discrete time for Euler integration:RWB(t+Δt)=RWB(t)Exp(BωWB(t)Δt)Wv(t+Δt)=Wv(t)+Wa(t)ΔtWp(t+Δt)=Wp(t)+Wv(t)Δt+12wa(t)Δt2

The IMU model is obtained from equations (30) and (33):R(t+Δt)=R(t)Exp((ω~(t)bg(t)ηgd(t)Δt)v(t+Δt)=v(t)+gΔt+R(t)(a~(t)ba(t)ηad(t))Δtp(t+Δt)=p(t)+v(t)Δt+12gΔt2+12R(t)(a~(t)ba(t)ηad(t))Δt2

Suppose there are two image frames with time ti and tj, tj > ti. Therefore, the IMU’s pre-integration observation model is:ΔR~ij=RiTRjExp(δϕij)Δv~ij=RiT(vjvigΔtij)+δvijΔpij=RiT(pjpiviΔtij12gΔtij2)+δpij

Among them, A, B, and C are the noise terms of the rotation amount, the pre-integrated speed noise term, and the pre-integrated translation noise term, respectively.

For the pose between two adjacent frames, this paper still uses a nonlinear optimization method to fuse IMU information and visual information. Among them, the state quantities that need to be optimized are:θ={RWBj,WpBj,WvBj,bgj,baj}

In equation (36), RWBj, vWBj, and pWBj are the rotation, velocity, and translation of the IMU coordinate system relative to the world coordinate system at time i, and the random walk bias of the gyroscope and accelerometer at time i, respectively.

Therefore, the optimal state quantity θ is solved by optimizing the visual reprojection error and the IMU measurement error:θ=argminθ(kEproj(k,j)+EIMU(i,j))

Experimental design

The upper computer of the experimental platform in this article is a laptop with Ubuntu 16.04 version, running memory is 8G, processor model is CORE i5 8250U, and the main frequency is 1.6GHz. The robot platform is a Dashgo D1 robot mobile platform that supports the ROS development system. The overall size is θ406×210 and the diameter of the driving wheel is 125mm. The binocular camera sensor used is MYNT EYE D1000-IR-120/Color.

The experiments in this paper are mainly aimed at the positioning accuracy of the robot. The evaluation index is the RMSE (root-mean-square-error) of the robot position:RMSE=1Ni=1N(p^ipi)2

Where p^i is the estimated robot position and pi is the actual robot position.

Figure 2.

Robot Straight Driving Positioning Experiment

In this paper, robot positioning experiments are performed in corridor environments with insignificant environmental characteristics and indoor environments with rich characteristics. In a corridor environment, a mobile robot is used to carry experimental equipment to travel at a constant speed of 10m in the positive direction of the camera, and then the positioning accuracy of pure vision and the positioning accuracy of vision fusion IMU are recorded separately. In a feature-rich indoor environment, a robot linear experiment was also performed to make the mobile robot move forward at a constant speed of 5m in the positive direction of the camera, but the speed was 2.5 times that of the previous experiment. Perform multiple experiments and record the results.

EXPERIMENTAL RESULT

Robot operating environmentPure visual RMSE/mVisual fusion IMU RMSE/m
Low-texture corridor environment0.07460.02122
Feature-rich environment0.10240.06502

From the experimental results, it can be seen that the stereo vision positioning error of the fusion IMU is less than the pure vision positioning error, which indicates that the visual positioning of the robot with the fusion IMU is more accurate than the vision-only positioning in low-texture environments and fast robot movements. degree.

CONCLUSION

In this paper, the robot positioning technology in the robot system is researched, and a binocular vision fusion IMU-based robot positioning method is proposed. Compared with the pure vision robot localization method, the proposed method is more robust in low-textured environments and fast robot movements. The experimental results show that the visual positioning method integrated with IMU solves the defects of pure visual positioning to a certain extent and improves the positioning accuracy of the robot.

eISSN:
2470-8038
Langue:
Anglais
Périodicité:
4 fois par an
Sujets de la revue:
Computer Sciences, other