Otwarty dostęp

Gender determination from periocular images using deep learning based EfficientNet architecture


Zacytuj

Introduction

Human gender detection has become popular in recent years because of its different kinds of applications. Previously gender detection has been accomplished based on all facial features like eyes, eyebrows, nose, lips etc. The eye and the surrounding region, including the sclera, brows, lashes, eyelids and skin is referred to as periocular as shown in Figure 1. Periocular biometrics switches between using the iris and the region around the eyes. The periocular area is unique for each individual. Periocular biometrics can be successfully used for recognition purposes [1]. Iris recognition cannot be used with visible light cameras. Face masks are widely used today which becomes a disadvantage when it comes to facial recognition. Hence, periocular biometrics can be used in such conditions [2]. In addition, the periocular region is not affected by facial expression variation, plastic surgery or aging.

Fig. 1

Periocular region.

When the image is taken from a distance, the determination can be done based on the periocular region and when the image is taken too close, the facial information captured is weak, hence, making use of the iris would give better results. Iris information can be used to estimate gender, age, emotions, ethnicity and so on [3, 4]. Gender determination becomes an important aspect in social situations. It may be helpful in identifying the gender of the person without registering the identity of a person and hence providing them entry to certain social situations. The recognition of biometric features from the iris pattern, such as age, ethnicity, or gender has been analysed by many researchers [5,6,7,8]. Better predictions are however obtained if both the iris and periocular region are available for a given individual [9,10,11].

The significance of deep learning has greatly improved the accuracy of the models thus improving the predictions [12,13]. There are various ways to obtain a CNN-based sex prediction technique: training a CNN model from scratch, and transfer learning [14]. Transfer learning has two instances:

the top layers of a pre-trained network fine tuning,

using the bottleneck features of a pre-trained model.

In transfer learning, the features acquired from a base network, trained using base dataset, are reused to train the target network on a target dataset. The features learned from the base network should be suitable for the target network too in order to ease the process.

Training from Scratch: A CNN model has many layers

Conv2Dlayer: It is a 2-D convolutional layer. It is used as the layer to capture corresponding consequences from the raw input image. The first layer captures low-level features like colour, gradient orientation, edges. A greater number of Conv2Dlayer is used to extract higher-lever features.

The number of parameters and the number of computations are reduced by pooling layer. It is also refered to as the down sampling or sub sampling layer.

Flatten Layer: This layer converts the multi-dimensional arrays into a single continuous linear vector.

Fully connected layer: The neurons in this layer receive input from all the neurons in the previous layers. It is also called as dense layer.

Transfer learning: Transfer learning becomes a powerful tool to train a model [14] when the target dataset is lesser than the base dataset. However, the features acquired from the base dataset should also be compatible with the target dataset.

Bottle-neck features: A pre-trained network detects features that are useful for pattern recognition like edges, spots, or ridges. A fully connected layer is defined separately on the learned features. Only the convolutional part of the model is implemented.

Fine Tuning: In fine tuning, the final layer of the architecture is replaced with custom layers. The final layers are different for different problems and can be tailored according to our needs.

EfficientNetB1 was introduced by Google AI in [15] and proposes an efficient method which gives better accuracy. EfficientNet was introduced in 2019 by Tan and Le. It provides high exactness on both common and Imagenet image classification transfer learning tasks. We fine tune the EfficientNetB1 architecture for gender classification using periocular images. When a model has already been trained for a particular task, and we make some changes to the model to fit our task, it is known as fine tuning.

Related works

In 2009, Park et al. [16] investigated the feasibility of periocular images of an individual as a biometric trait and confirmed the same. They extracted the features from the periocular region using texture and point operators. The performance was assessed using Local Binary Patterns (LBP), Histograms of Oriented Gradient (HOG), and Scale Invariant Feature Transform (SIFT). On the periocular recognition technique, Bakshi et al. [17] nature a multi-scale local feature. The high dimensional features are extracted from the iris and the periocular region in the visible spectrum. Tapia et al. [4] fused the features extracted from near infrared images and RGB images to improve the accuracy of gender classification. Three techniques were used to extract features: Texture (Local Binary Pat-terns), Shape (Histogram of Oriented Gradients) and Intensity (Pixel value). Random Forest Classifier was then used to classify the images. They achieved an accuracy level of 90%.

With the help of Semantics Assisted Convolutional Neural Network (SCNN), Zhao et al. [18] improved the periocular recognition rates. Using the comprehensive periocular features, semantic information is recovered. Moreover it requires less computational time for the matching process. However, it requires more training time as the technique involves learning from scratch. Proenca et al. [2] excluded sclera and iris for periocular recognition in visible light data. The ocular parts are separated from the periocular part using an ocular segmentation algorithm in the learning dataset. Bobeldyk et al. [8] investigated the sex-predictive exactness related to four different regions. The different regions were the normalized iris-only region, the iris-only region, the iris-excluded ocular region, and the extended ocular region. The extracted feature set is classified as male or female using a Support Vector Machine (SVM). It was concluded that iris along with the periocular region gives a better accuracy. Kevin et al. used off-the-shelf features for periocular recognition. In recent years, many researchers have taken advantage of transfer learning to build stronger and more accurate CNN models. Sharif et al. [13] used the over feat network to solve the diverse range of recognition tasks.

Proposed model of the work
CNN from scratch
Pre-processing

In the beginning, the images of the eye taken in Visible Spectrum (400nm–700nm) are preprocessed. Preprocessing helps us to better handle the images by bringing them to the same size and format. The images are resized as 75 × 75 size pixel images without affecting the aspect ratio of the images using bilinear interpolation. The images are changed into grayscale (or graylevel) image as minimum information is offered for each pixel. This reduces the computational requirements. To avoid the possibility of exploding gradients and improve the convergence speed, the images are normalized by dividing by 255. Figure 2 shows the general CNN architecture.

Fig. 2

CNN architecture.

CNN feature extraction

The normalized images are used as the input for the CNN feature extraction module. The CNN model is built from scratch. The training dataset is divided into batch sizes of 10. Batch size is the number of data points taken at a time to train the model. The smaller batch size makes the learning faster and more efficient. It has 2 convolutional-pooling (conv-pooling) layers, 1 flattening layer, and 3 fully connected layers. Multiple kernels (filters) are used to extract features in an image. The first convolutional layer contains 16 kernels of size 3 × 3 × 1 with stride (1,1). The next is with 32 kernels of size 3 × 3 × 1 with stride (1,1). An epoch is the number of times the whole dataset is passed through the entire network. The number of epochs is set to 10 for our model. The hyper-parameters are searched using Bayesian Optimization. The optimum hyperparameters are searched for using 5-fold cross validation. An image can be represented as a cuboid having length, width and height. Now imagine a filter passed through the image vertically. If this filter is passed throughout the image, we get new dimensions for the same image. This process is called convolution. The convolutional layer extracts the attributes from the periocular images, and hence is an integral element of CNN. The kernels generate a kernel map or feature map using the convolution operator (*). The succeeding convolution layer extracts higher-level features, whereas the initial layer extracts low-level features. As an output of the convolutional layers, the pooling layer reduces the dimensions of the feature map obtained. The model uses max-pooling for the same. It is a pooling operation that calculates the maximum value for the patches of the obtained feature map. The output from the convolutional-pooling layer is given as the input to the flattening layer. The 2-D arrays from the pooling layer are converted into a single long continuous linear vector. The combination of convolutions and other inputs is presented as xJL=f(iMj(xi(L1)kijJ+bjL)), x_J^L = f\left( {\sum\limits_{i \in {M_j}} {(x_i^{(L - 1)}{k_i}{j^J} + b_j^L)} } \right), where

xJLjt x_J^L - {j^t} output of the Lth layer,

bJL b_J^L - additive biases for the Lth layer,

kJL k_J^L - kernel for the Lth layer,

Mj- the set of input images.

The output from the flattening layer is fed into the dense/fully-connected layer. In the dense layer, each neuron receives the input from all the neurons in the previous layer. To avoid the over-fitting of the data, dropout layers with a dropout rate of 0.2 are added in between the dense layers The image is classified by the dense layer based on the output from the previous layers. LeakyReLu is used as the activation for the first two dense layers, while softmax is used for the last dense layer.

EfficientNetB1
Pre-processing

In the first step, the images of the eye considered in the Visible Spectrum (400nm–700nm) are pre-processed. The images are resized as (240, 240) size pixel images without affecting the aspect ratio of the images using bilinear interpolation. EfficientNetB1 takes the input data should range [0,255], and input images of shape (240,240,3). Thus, the images are not changed into grayscale.

Feature extraction

Transfer learning involves using the previously attained knowledge from other models in the current model. This reduces the time and effort while building a new model thus ensuring better results as the model is already trained on a number of data points from other dataset. An EfficientNet-B1 architecture consists of the stem layer and the custom-made final layers. The stem layer consists of an Input layer, rescaling layers that rescale the images to a desirable size, the normalization layer, the zero padding layer, the Conv2D layer, the Batch normalization layer and the activation layer. The initial layers which form the stem layer are shown in Figure 3.

Fig. 3

Stem layer for EfficientNet-B1 [19].

After the stem layer, there are seven blocks which are shown in Figure 4.

Fig. 4

Architecture for EfficientNet-B1 [19].

The modules used in the architecture are shown in Figure 5.

Fig. 5

Modules for EfficientNet-B1 [19].

The above-mentioned layers have features learnt on images from ImageNet. The feature maps from the EfficientNet-B1 architecture are now passed onto the top layer. The top layer, which is the final layer, is replaced with custom layers which help to classify the images into two classes - 0 (female) and 1 (male). The custom model has an AveragePooling2D layer and a dense layer. In AveragePooling2D, pooling is applied in 2x2 patches. The average is calculated for each patch with a stride of 2. Softmax activation is used in the last layer. The softmax function calculates the relative probabilities and determines if the given image is that of a male or a female eye. softmax(zi)=exp(zi)jexp(zi), softmax({z_i}) = {{exp({z_i})} \over {\sum\nolimits_j {exp({z_i})} }}, which z is the value from the neurons in the dense layer.

Experimental results
Database

The images are downloaded from Kaggle [20]. Women and men’s faces are downloaded from [21]. The eyes are highlighted by [20] with the help of haarcascadeeye.xml. The dataset contains 9220 periocular images in the visible spectrum. It contains 5058 images of male eyes and 4162 images of female eyes. Out of these 418 male eyes and 504 female eyes are randomly selected for the study.

Performance metric

The performance of the work is reported and compared using classification accuracy. Classification accuracy is used to measure how well our model performs. Accuracy=TP+TNTP+TN+FP+FN, Accuracy = {{TP + TN} \over {TP + TN + FP + FN}}, where

FN False Negative,

FP False Positive,

TN False Negative,

TP True Positive,

Some of the other metrics used are precision, recall and f1-score. Precison=TPTP+FP, Precison = {{TP} \over {TP + FP}}, Recall=TPTP+FN, Recall = {{TP} \over {TP + FN}}, F1score=2PrecisionRecallPrecision+Recall. F1score = 2{{PrecisionRecall} \over {Precision + Recall}}.

Experimental setup

An Intel®Core i3 CPU workstation with 4 GB RAM computer has been used for conducting the experiments. Google Colab has been used to build the model. Colab runs in the browser using Google Cloud and it is a Python development environment. It has a Nvidia K80 / T4 GPU, 12 GB RAM, 0.82GHz memory clock, and 4.1 TFLOPS. The models are built using TensorFlow. Keras and TensorFlow are open-source software libraries that provide a Python interface for artificial neural networks, and machine learning and artificial intelligence, respectively. Google Colab uses Python 3.6. The dataset is classified into training, validation and test datasets. To train the model, 80% of the data is used, for validation 10% is considered and the remaining 10% is for testing. We use a CNN model built in the present work from scratch and EfficientNetB1 to capture characteristics from the periocular images. EfficientNetB1 can divide images into 1000 categories as it is trained on a million images from ImageNet. The weights from the EfficientNet-B1 model are used to train the custom model build on top of the EfficientNetB1.

Performance analysis
CNN from scratch

The accuracy attainment is 94.46% for the CNN model built from scratch. The classification report obtained by using the CNN model is shown in Table 1.

Classification report for CNN model built from scratch.

Label Precision Recall F1-score

1 (Male) 0.94 0.96 0.95
0 (Female) 0.95 0.93 0.94

Out of 418 male eyes, 387 are predicted correctly and 31 are predicted incorrectly. Out of 504 female eyes, 484 are predicted correctly and 20 are predicted incorrectly. The confusion matrix obtained by classifying the data with the help of the CNN model built from scratch is as shown in Figure 6.

Fig. 6

A subset of the dataset.

The model accuracy changes with each epoch. The accuracy increases with the increasing number of epochs and then reaches a saturation point. The model accuracy obtained by using CNN model built from scratch is as shown in Figure 7.

Fig. 7

Confusion matrix for CNN model built from scratch.

The model loss obtained by using the CNN model built from scratch is as shown in Figure 8.

Fig. 8

Model accuracy for CNN built from scratch.

EfficientNetB1

For the fine-tuned EfficientNetB1, the accuracy obtained is 97.94%. The classification report obtained by using the fine-tuned EfficientNetB1 is as shown in Table 2.

Classification report for fine-tuned EfficientNetB1.

Label Precision Recall F1-score

1 (Male) 0.97 0.99 0.98
0 (Female) 0.99 0.97 0.98

Out of 418 male eyes, 404 are predicted correctly and 14 are predicted incorrectly. Out of 504 female eyes, 499 are predicted correctly and 5 are predicted incorrectly. The accuracy of the model can be determined with the help of the confusion matrix given in Figure 9.

Fig. 9

Model loss for CNN built from scratch.

The model accuracy and loss obtained by using EfficientNetB1 is as shown in Figures 10, 11 and 12 respectively.

Fig. 10

Confusion matrix for CNN model built from scratch.

Fig. 11

Model accuracy for EfficientNetB1.

Fig. 12

Model loss for EfficientNetB1.

Comparison

EfficientNetB1 gives better accuracy than CNN from scratch. All the other parameters justify that the fine-tuned EfficientNetB1 is a better model than building a CNN model from scratch. The difference in accuracy is due to the different features acquired from each level. The layers in the CNN model built from scratch learn features which are relatively simple to learn. The EfficientNetB1 is already trained on a million images and hence has acquired more advanced features. In the initial layers, the low accuracy is because they only learn simple features and cannot identify the complex features. With an increasing number of epochs, more features are learnt and hence the accuracy increases in both models.

Conclusion

From a deep learning perspective, we have addressed the approach of periocular recognition in the visible spectrum in the present work. The work is the first of its kind which uses EfficientNetB1 architecture for gender determination from periocular images. We compared a CNN built from scratch with a fine-tuned EfficientNetB1 model. This comparison is made to show the strength and versatility of transfer learning. Our experiment shows that the model works better by using off the shelf features. Using transfer learning is hence more precise and less time consuming. Building a CNN model from scratch for every purpose is tiresome and needs high computational power. However, a pre-trained model can be easily used with the help of transfer learning by fine-tuning it for our requirements. This work can be further extended to include different kinds of EfficientNet models to better understand the classification capabilities. It can also be used for images in different spectra.

Declarations
Conflict of interest 

The authors declare that there is no conflict of interest regarding the publication of this paper.

Author’s contributions

V.B.N.-Writing-Original Draft, Methodology, Writing-Review, Investigation and Visualization; B.R.-Resources, Validation, Conceptualization and Formal analysis. P.V.-Editing and Supervision. All authors read and approved the final submitted version of this manuscript.

Funding

Not applicable.

Acknowledgement

Thank you so much to the Editors and Reviewers for their valuable suggestions and guidance.

Data availability statement

All data that support the findings of this study are included within the article.

Using of AI tools

The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.

eISSN:
2956-7068
Język:
Angielski
Częstotliwość wydawania:
2 razy w roku
Dziedziny czasopisma:
Computer Sciences, other, Engineering, Introductions and Overviews, Mathematics, General Mathematics, Physics