Acceso abierto

Research on face feature point detection algorithm under computer vision

 y   
19 mar 2025

Cite
Descargar portada

Introduction

Face recognition technology as a kind of biometrics, is through the acquisition of recognized facial features to identify the identity [1]. With the in-depth study of face recognition methods, the current face recognition algorithms have reached a high level, recognition efficiency and accuracy has been very high, face recognition has become the mainstream biometrics, in practice, has been widely used. For example, face recognition technology is used in daily security issues such as login authentication [2], attendance [3], purchase payment [4], and criminal identification [5] in public security system. The main components of the face recognition system, which can also be said to be the direction of research, are divided into the following parts: face detection, face alignment (key point detection), face feature point extraction, and face classification and recognition [6]. Among them, face feature point detection is also known as face key point detection and face alignment, and the purpose of its technology is to process the already detected face images in more detail through the computer, and to coordinate the specific location of each key part of the face, such as eyebrows, eyes, mouth, nose, and face contour [7]. Accurate face feature point localization is conducive to the extraction of facial features, which is very meaningful for automatic face recognition, expression recognition and analysis, and face modeling [810]. At the same time, excellent face alignment algorithms have good robustness to the interference on the image, i.e., they have good adaptability to problems such as illumination change, background change, face angle change, occlusion, etc., and can still accurately calibrate the location of the keypoints in the case of poor image quality [1112]. Therefore, face keypoint detection is of great significance for dealing with various problems in face recognition, and it is a very important part of the face recognition field. Face recognition, as an important research topic in the application of biometrics, involves a wide range of fields, including computer applications, computer vision, perceptual science, image processing, psychology, statistics, pattern recognition, and neurology, etc. [1316].

This paper focuses on the face feature point detection algorithm under computer vision. The cascade regression algorithm model is built, the face pose change module is designed based on weak invariance and pose indexing features, and the algorithm model is kept in incremental learning by training all levels of layer regressors to improve the feature extraction accuracy. By comparing the experimental results with the existing methods and the pupil localization accuracy and detection speed tests, it is verified that the cascade regression algorithm can extract the face feature points quickly and accurately.

Research on face feature point detection algorithm based on cascade regression

The accuracy of face feature point detection relies on the continuous training and testing of the cascade regression algorithm, which will be described in the following section on face feature point detection, cascade regression algorithm, and the learning and training process.

Face feature points and face shape
Face Feature Point Detection

A human feature point is a set of points with special semantics, such as cheeks, eyebrows and lips, on a face image. For the sake of exposition, the set composed of all face images is denoted as II = {I} in this chapter.

Following the customary terminology in the field of face feature point detection, in this paper, we refer to the vector S formed by combining the coordinates of all the face feature points on I as the shape of the face on the image, or, without ambiguity, we simply refer to it as the shape. For ease of exposition, this section first abstracts the formal definition of face shape from the customary symbolic notation:

Definition 1 (Face Shape) For a given p – feature point labeling approach, let xi ∈ ℝ 2 denote the coordinates of the i rd feature point on I, and the vector S = (x1, x2, …, xp) ∈ ℝ 2p denote the entirety of the p feature points on I.

Face feature point detection is usually learned on a given dataset to obtain the corresponding model, and the average face shape usually has a special meaning in a carefully structured dataset. For the sake of the narrative of this chapter, this section gives definitions of the face feature point dataset and the average face shape:

Definition 2 (Face Feature Point Dataset and Mean Shape) A face feature point dataset consists of a series of images and the face shapes on those images, denoted as D = {(Ii, Si) | i = 1, 2, …, N}. The mean of the face shapes in D is called the mean shape, denoted as S¯=i=1NSi/N .

In a standard face feature point dataset, the sample capacity is usually large, and the angles and poses of the faces in the image are widely and uniformly distributed, so S¯ has a statistically significant average meaning and can usually be used as a standard normal face shape. The standard 68 points in the 300-W dataset available on the i-bug website is the average shape on the 300-W dataset. Later, S¯ is used directly as the standard face shape without further elaboration.

Face shape indexing in pixel coordinates

In algorithms of the cascade regression class, face shape-indexed features are often used to describe information about how well the current face feature points match the image. Such features are constructed with the help of face shape indexing of pixel coordinates, i.e., for different face shapes, the positions of these pixels should have similar semantics. For example, when the k pixels used have coordinates p1, p2, …, pk in the coordinate system of the average shape S¯ , and they have coordinates p1,p2,,pk with respect to any given shape S, the coordinates of p1,p2,,pk with respect to S should be the same as the coordinates of p1, p2, …, pk with respect to S¯ .

On different face images, after shape indexing, the positions of the corresponding points relative to the face shapes are geometrically invariant and have some semantic information.

Shape indexing can be realized with the help of various geometrical means. For the sake of exposition, in this paper, the process of mapping a set of reference points p1, p2, …, pk to the corresponding points p1,p2,,pk by shape S indexing is abbreviated as p1,p2,,pksp1,p2,,pk

In the ERT algorithm, the pixel difference feature of shape indexing is proposed by calculating the difference of gray values between shape indexed pixel points under the condition of (1).

Description of the cascade regression algorithm
General framework of the algorithm

The regression-based face feature point detection algorithm can be divided into two parts: feature extraction from the input image and mapping the extracted features to face feature points by a learned regressor. However, due to the complexity of the distribution of feature points, it is very difficult to learn a strong regressor to directly map the features to the exact location, and the results are often unsatisfactory. Therefore, drawing on the idea of integrated learning, the researchers propose to obtain a strong regressor by integrating multiple weak regressors in order to realize face feature point detection. The algorithmic framework can be divided into two parts: the training phase and the testing phase.

Training phase

In the training phase of the algorithm, it can be divided into three main modules: generating training samples, training the optimal weak regressors, and updating the training samples. The framework of the algorithm is shown in Figure 1.

Each training sample can be described as τ=(I,θc,θ^) , where I is a face image, θc is the current feature point distribution of the sample, and θ^ is the true feature point distribution of the sample. The generation of training samples only requires randomly generating multiple different initial distributions of feature points θc for each set of face images I and corresponding feature point distributions θ^ to obtain the set of training samples T.

After determining the set of training samples T, the pose index features need to be extracted and the weak regressor needs to be trained. For the regressor, its training samples can be denoted as s = (x, δθ), where x is the extracted pose-indexed features and δθ is the residual from the current feature point distribution to the true feature point distribution. Therefore, the training of the regressor is essentially to learn a method f based on the set of training samples S, for each of which s = (x, δθ), satisfies δθ = f(x). Similar to the regressor, the weak regressor also needs to learn the regression equation f based on the samples, but the difference is that the weak regressor does not require to absolutely satisfy δθ = f(x), but rather requires to satisfy D(f(x)) < D (δθ), of which the method D is used to measure the magnitude of the residuals. In other words, the weak regressor can make the current feature point distribution closer to the real feature point distribution as long as it can. The choice of weak regressor is not unique, different regression algorithms can be selected. In the training process, different parameters can be selected to train different weak regressors. The algorithm needs to find the optimal weak regressor during the training process. Remember that the distribution of feature points after the update of weak regressor R on sample τ is θtR , then the optimal weak regressor can be defined by equation (2): R*=argminRτ θ^tθtR 2,τT

Where θ^ denotes the real face feature point distribution. Eq. (2) represents the selection of the weak regressor that minimizes the error sum of all training samples after updating as the optimal weak regressor weak. The specific details of the extraction of pose index features and the training of the weak regressor will be presented later.

After completing the training of the optimal weak regressor R*, each training sample τT, adjusted to τ1=(I,θτR*,θ^) , constitutes a new set of training samples T1. The optimal weak classifier is retrained using T1. Repeat this process until all sample errors are below a set threshold.

Testing phase

In the testing phase of the algorithm, for the input face image, the face feature points are initialized, the pose index features are extracted according to the current feature point distribution, and the extracted features are input to the trained weak regressor to get a set of feature point distribution residuals to update the current feature point distribution. Repeat the above process for each weak regressor to finally approximate the real face feature point distribution. The process of algorithm iteration can be represented by equation (3). St=St1+Rt(h(I,St1)),t=1,,T

Where S denotes the feature point distribution, St denotes the feature point distribution after the t rd round of regression, R denotes the weak regressor, I denotes the input face image, and h is the feature extraction operator. In order for the cascade regression algorithm to accurately update the face feature point distribution, the feature extraction operator h is required to be a pose-indexed feature with weak invariance. The block diagram of the algorithm is shown in Fig. 2.

Figure 1.

Training block diagram of cascade regression algorithm

Figure 2.

Test block diagram of cascade regression algorithm

Pose indexing features and weak invariance

Pose-indexed feature is a feature extraction method, compared to common feature extraction algorithms such as HOG, SIFT, etc., the most important feature of pose-indexed feature is that it contains the information of pose change in the process of feature extraction. For face images, this pose change information is implicitly included in the face feature point distribution. The pose indexing features are the key to the cascade regression framework to achieve good results, and the overall robustness of the algorithm can be well improved by designing reasonable pose indexing features.

Formally, the pose-indexed feature extraction operator for a face image can be defined as h : I × Θ → Rn. where Θ denotes the set of face feature point distributions, I denotes the input face image, and n denotes the dimension of the output features. From the definition of pose indexed features, it can be seen that numerous different pose indexed features can be extracted for an input face image and a specified set of feature point distributions. Not all pose index features are valid, according to the weak invariance assumption, as long as the designed pose index features satisfy the weak invariance to some extent, the cascade regression algorithm can converge to an accurate result after a certain number of iterations.

Weak invariance is a property of pose-indexed features, which indicates that consistent pose-indexed features can be extracted from images with different poses. In order to facilitate the description of weak invariance, this paper defines the camera mapping method as C : R3 × TR2, which indicates that a 3D coordinate point prR3under a head coordinate system under a rotational change tT is obtained by camera mapping to a pixel point piR2 on a 2D plane, and then the face image captured under pose t is noted as It. Then a set of face feature points PfR3 under a specific pose t can be represented as θ = C(Pf, t) after mapping to a 2D image, so that a shape index feature with weak invariance The shape-indexed features can then be represented as satisfying Eq. (4) for any t0, t1T. h(θt0,It0)=h(θti,Iti)

That is, satisfying weak invariance means that consistent features can be extracted from face images with different poses for the same set of feature points Pf under 3D space.

Considering that pose-indexed features will be computed very frequently during the training and testing process of the whole cascade regression algorithm framework, the design of features should also be as simple as possible to ensure that the algorithm can realize real-time face feature point detection. Based on the pixel difference features, the algorithm first randomly selects P points as reference points on the face image, and then estimates the rotation and scaling parameters according to the current distribution of face feature points to transform the reference points. Finally, the pixel difference between the two transformed reference points is utilized to form a lissomorphic feature of dimension P2. The key to the weak invariance of the pixel difference feature is to transform the reference points using the rotation and scaling parameters. However, since the pixel difference feature is obtained directly using the global coordinates, the reference point will be larger in the case of a large expression transformation or head pose transformation, in order to improve this problem, the global coordinates of the reference point are no longer used directly, but the relative coordinates between the reference point and its closest neighboring feature point are used to determine the reference point. Based on this, a further improvement is made by randomly selecting the reference point on the line between two feature points. The experimental results show that all these improvements somewhat enhance the robustness of the image soliton difference feature to head motion.

Offline learning process

The data distribution statistics are used to train all levels of layer regressors, so that the cascade regression framework can complete the incremental learning, which can gradually correct the initial shape to the position of the real shape marking, and improve the accuracy of face feature point localization. While the incremental learning needs to be built on an existing offline model to be carried out, this section first introduces the learning process of the offline model.

The N image and the corresponding real shape markers in the training set are denoted as I = {I}i=1⋯N and S*={ si* }i=1N respectively, where si*=[ x1,y1,,xl,yl,,xL,yL ]T is used to denote the position vector of the L feature points and the coordinates of the l th feature point are denoted as [xi, yl]. All the samples are initialized and denoted as S0={ si0 }i=1N , which is usually set to be the average of the vectors of all the real shape markers in the training set. In this section of the method, SIFT features are used to extract local features sequentially based on the order of all feature points and spliced to form shape index features, denoted as Φ(I, si) ∈ ℝ d. The goal of the cascade regression is to model the mapping relationship between the shape index features Φ(I, si) and the shape displacements δsit=sit1si* , such that the initial shapes are progressively corrected to the positions of the true shape markers. The following equation shows the optimization expression in cascade level t: argminGti=1N δsitGt(Φ(I,sit-1)) 22 where G(·) denotes the extreme learning machine that regresses the shape. Let xt = Φ(I, st–1), then the regression function can be expressed as Gt(x)=k=1KβkG(ak,bk,xt) , where K is the number of hidden layer nodes of the ELM. The current shape displacements of all samples are represented using a matrix as ΔSt=[ δs1t,,δsNt ]T , and the optimization problem in the above equation is rewritten as: β˜t=argminβt||ΔStHtβt||22

Where Ht=H(a1t,,aKt,b1t,,bKt,x1,,xN) is calculated and the Sigmoid function is chosen as the activation function of the neuron with the expression: G(a,b,x)=11+exp((ax+b)) . Equation (6) is solved to obtain the parameters of the ELM network for the t rd level layer Gt=[ at,bt,β¯t,M ] and saved as a regressor. The amount of shape displacement is calculated by the trained regressor and the current shape is updated using equation (7): sit=sit1Gt(Φ(Ii,sit1))

The learning process of traditional cascaded homoscedastic methods requires iteratively training each level of the hierarchical regressors until the error between the current shape and the true shape is no longer increasing. However, such a sequential approach makes learning the model longer and does not allow for incremental updates of each regressor. As shown in FIG. 3, since the shape displacement space δst in the level t layer is completely dependent on the computed results of the level t – 1 layer regressors. If a new sample {δsnew, xnew} is entered for incremental learning, the regressors in the first level can be easily updated using the ELM’s incremental learning approach to generate a new regressor Gnew1 . All samples, including the old and new samples { { si1 }i=1N,snwe1 } , need to be recalculated using the updated regressors Gnew1 . Therefore, the regressor for the second level must be recalculated for all samples in the output of Gnew1 . Similarly, the regressors for all subsequent levels are updated in a similar manner. Therefore, the incremental update procedure for the back-to-sun hybridizer sequential training approach is very time-consuming and computationally resource-intensive. To solve this problem, the approach in this section utilizes sampling to generate a shape displacement space from a known mixed Gaussian distribution, which replaces the shape displacement space obtained from the previous cascade level computation. As shown in Fig. 4, this strategy can effectively realize that all regressors in the cascade framework can be incrementally updated in parallel when needed to improve the learning efficiency. The statistical parameters of the mixed Gaussian distribution of shape displacements in each cascade layer can be collected through the sequential training approach described above, in the form of the mean and variance of the shape displacement space.

Figure 3.

The training manner of conventional CR

Figure 4.

Parallel update method based on mixed Gaussian distribution sampling

Specifically, the input space of the level t layer contains N shape displacement vectors { δsi }j=1N and since the shape of faces captured in natural environments is usually oriented in the unquestioned direction, it is assumed that there exist J Gaussian distributions describing the input space. The shape displacement probability distribution of the sample is assumed to be: p(δsit|Θt)=j=1JπjNt(δsit|μj,Σj) where πj, j = 1, ⋯, J denote the probability that the shape displacement vector δsit belongs to the j th subdistribution, and N(δsi|μj, Σj) denotes the j th Gaussian distribution function and the mean and variance of its distribution are μj and Σj, respectively. Therefore, the set of Gaussian mixture distribution parameters in the t th level layer is denoted as Θt = {π1, π2 …, πJ, μ1, μ2, …, μJ, Σ1, Σ2, …, ΣJ}, which is the mean, variance, and probability of each subdistribution occurring in the mixture model. Learning the model parameters in Eq. (8) is typically solved using an iteration-based EM algorithm designed to iteratively solve for the expectation and model parameter maxima until logp(ΔSt|Θt)=logi=1Np(δsit|Θt) converging to a local minimum, where ΔSt={ δs1t,δs2tδsNt } , the solution process is briefly described below (superscripts denoting the number of layers will be omitted t for brevity of writing).

Step E – : Fix the parameters and solve for the a posteriori probability, i.e., the normalized probability that the shape displacement vector δsi on the i th comes from the jth distribution, which in step m is solved by Eq. (9). zji=πjmp(δsi|μjm,Σjm)IJπlmp(δsi|μlm,Σlm)

Step M – : The model parameters Θm+1={ π1m+1,m2m+1,,πJm+1,μ1m+1,μ2m+1,,μJm+1,Σ1m+1,Σ2m+1,,ΣJm+1 } are estimated through the estimates of zji and the constraints j=1Jπj=1 , where μR2L, Σ ∈ R2L×2L. These distributional parametric quantities are updated through equation (10): πjm+1=1Ni=1Nzji,μjm+1=i=1N=ijsij=1N=ijiΣjm+1=i=1Nzji(δsiμjm+1)(isiμjm+1)Ti=1Nzji,

The EM algorithm solves for the hybrid Gaussian distribution to better characterize the distribution of shape displacement vectors in the level t layer. On the HELEN dataset, the trained shape displacement vectors and the computationally generated hybrid Gaussian distribution (K = 4) can more accurately quantify the current distribution of shape displacement vectors than the single Gaussian distribution generated in the face shape space. The distribution statistics {Θ1, Θ2, …, ΘT} of the T hierarchical layers are collected during the offline training process.

With the known distribution statistics of the data in the T hierarchical layers, all the hierarchical regressors can be updated in parallel based on these distribution statistics. As shown in Fig. 4, the input space of shape displacement is generated by sampling all the cascade layers with known data distributions, and all the extreme learning machines run in parallel, which not only improves the efficiency of learning but also facilitates the incremental learning of the whole cascade regression framework.

Application of cascade regression algorithm in face feature point detection

Through the testing and training of the cascade regression algorithm, the algorithm model that can continuously learn and update is obtained, which is compared and analyzed with the existing detection methods in the following section and applied to the pupil localization accuracy experiments, which verifies the practical prospect of the cascade regression algorithm.

Comparison of experimental results between the algorithm in this paper and existing methods

In order to further validate the effectiveness of this paper’s algorithm, the experimental results of this paper’s algorithm are compared with those of existing methods on the LFPW and LFW face datasets.

Figure 5 shows the average relative localization error of this paper’s algorithm for each face feature point on the LFPW dataset, and also shows the statistical results of the sample consistency method on the LFPW experimental dataset, as well as the results of manual annotation on the LFPW dataset. It can be seen that this paper’s algorithm exceeds the accuracy of manual labeling in all the feature points, which also means that this paper’s algorithm has better stability than manual labeling. Compared with the sample consistency method, this paper’s algorithm is very close in the localization accuracy of stable feature points, and has higher localization accuracy in unstable feature points (i.e., those located in the center of the eyebrow, etc., where the texture features are relatively not very distinctive), with an increase in accuracy of more than 10%.

Figure 5.

Comparison of average relative positioning errors of face feature points

Figure 6 shows the error cumulative distribution curves of this paper’s algorithm, the sample consistency method, and in order to further compare the localization accuracy of the localization detector, the figure also shows the error cumulative distribution curves of the support vector machine used in this paper’s algorithm as well as the sample consistency method, where the support vector machine localization result of this paper’s algorithm is given by the weighted Mean-Swift algorithm. Where the red curve represents the algorithm of this paper that introduces face shape constraints, and the green curve represents the sample consistency method that introduces face shape constraints; the blue curve represents the error accumulation distribution curve given by the support vector machine of the sample consistency method; and the purple curve represents the error accumulation distribution curve given by the support vector machine of this paper’s algorithm.

Figure 6.

Cumulative relative error distribution curve and comparison

It can be seen that, due to the fact that the training set used by this algorithm to train the support vector machine with probabilistic output has fewer samples and the parameters have not been carefully adjusted, the cumulative distribution of the error given by the support vector machine has a poorer result compared with the cumulative distribution of the error given by the support vector machine of the sample consistency method, which is manifested by the fact that the detection rate of the support vector machine is lower than the detection rate of the support vector machine of the sample consistency method for almost all the values of the relative localization error. The results of the sample consistency method. After adding the face shape constraint, the cumulative error distribution curve of this paper’s algorithm is better than that of the sample-consistent method, which shows that the detection rate of this paper’s algorithm is higher for any given value of relative localization error. This shows that the introduction of face shape constraints by this paper’s algorithm on a poor localization detector can greatly improve the accuracy of face feature point localization, which also illustrates the importance of face shape constraints, especially higher-order constraints, for face feature point localization.

Fig. 7 shows the detection rate of this paper’s algorithm for each feature point on the LFW face dataset and the comparison with the results obtained by the sample consistency method, conditional random forest method and other methods, and it can be seen that this paper’s algorithm has a greater degree of improvement in the detection rate compared to the other methods, and all the detection rates are more than 90%.

Figure 7.

Detection rate of each feature point on LFW face data set

Experiments and analysis of pupil localization accuracy

Through the analysis and comparison above, it is verified that this paper’s algorithm has a high detection rate in face feature point detection. Pupil localization accuracy test is carried out using the algorithm of this paper to further analyze the detection error of this paper’s algorithm in small pixels and the corresponding accuracy.

The error analysis is calculated by Matlab, and the square of the distance calculated by the calibration point and the program is plotted, where the vertical coordinate is the square of the distance calculated by the manual calibration point and the algorithm (the unit is the square of the distance of the pixel point), and the horizontal coordinate is the pixel point, and the pupil localization accuracy graph is shown in Figure 8.

Figure 8.

Pupil precision location map

By analyzing Figure 8, the following conclusions can be drawn:

Overall, the accuracy of the pupil center detection of the left and right eyes is controlled within a distance of 3 pixels, and the accuracy of the left eye corner detection can be controlled within 3 pixels. The accuracy of right eye corner detection was controlled within 5-6 pixels, with most of them within 5 pixels.

The low detection accuracy of the right eye corner lies in the fact that the eye corners of the test samples are not clear, and the difficulty of confirmation in real life is still high.

Through the analysis of the test samples, the human eye pupil accuracy has an error because of the light caused by the presence of the pupil part of the region of the region of white pixels, therefore, the pupil detection when calculating the center of the pupil black area caused by the impact of off-center.

The test accuracy of large and small pixels is similar, mainly due to the similarity of the detection algorithm.

Through this test, it is obvious that this paper’s algorithm, although to a certain extent affected by the clarity of the face feature points, the brightness of the light and other factors, still maintains a high detection accuracy in the detection of the pupil, the corner of the eye and other small pixels of the face feature points.

Comprehensive assessment of cascade regression algorithms

Combined with the previous test and evaluation, it can be seen that the cascade regression algorithm has high detection accuracy in face feature point detection, and it can have better application in face detection. In the following section, the cascade regression algorithm will be comprehensively evaluated in combination with the evaluation index of detection speed to clarify its application prospect in face detection.

There are many evaluation indexes for face detection algorithms, among which detection speed is one of the indexes to measure whether the algorithms can process the face feature points quickly in real scenarios. We test the data forward propagation speed and single detection speed of this paper’s algorithm on AFW, PASCAL Faces, FDDB and other datasets, and obtain the data results in Table 1.

Face detection module evaluation

Test set Forward propagation time/ms Detection time/ms
AFW 7.39 10.76
PASCAL Faces 7.25 12.28
FDDB 7.48 11.68
CASIA-WebFace 7.27 10.80
WIDER Face 7.42 11.59
MALF 7.36 12.15

The number of model parameters of the face detection module of this paper’s algorithm is about 5M, combined with Table 1, it can be seen that the time required for a single forward propagation on a single GTX 1080Ti GPU is less than 7.5 ms, and the time required for a single detection is about 11.5 ms on average, which means that the detection speed is about 87 FPS.In the previous section, it was verified that this paper’s algorithm has a higher detection accuracy, and the algorithm of this paper at the same time maintains the advantage of fast detection speed. This means that the algorithm in this paper can not only quickly detect different face feature points, but also accurately detect different face feature points. Based on this, the algorithm has practical and broad application prospects in real life.

Conclusion

This paper focuses on the advantages of cascade regression algorithm in face feature point detection. The overall framework of the cascade regression algorithm is built through training and testing, combined with the weak invariance of the pose index feature and the training of the regressors at all levels of the hierarchy, to improve the accuracy of the algorithm for face feature point detection. Through relevant experiments and results comparison analysis, it is concluded that the detection rate of cascade regression algorithm is more than 90%, the pupil detection accuracy is about 3 pixels, and the detection speed is about 87FPS, which has the advantages of fast detection speed and high detection accuracy.

This paper studies the face feature point detection algorithm from the perspective of computer vision, which can provide scientific and effective support for the application of cascade regression algorithm in real life. And from the actual research results, the cascade regression algorithm can quickly and accurately recognize different face feature points, and has a better application prospect in the field of face recognition.