Human gender detection has become popular in recent years because of its different kinds of applications. Previously gender detection has been accomplished based on all facial features like eyes, eyebrows, nose, lips etc. The eye and the surrounding region, including the sclera, brows, lashes, eyelids and skin is referred to as periocular as shown in Figure 1. Periocular biometrics switches between using the iris and the region around the eyes. The periocular area is unique for each individual. Periocular biometrics can be successfully used for recognition purposes [1]. Iris recognition cannot be used with visible light cameras. Face masks are widely used today which becomes a disadvantage when it comes to facial recognition. Hence, periocular biometrics can be used in such conditions [2]. In addition, the periocular region is not affected by facial expression variation, plastic surgery or aging.
Periocular region.
When the image is taken from a distance, the determination can be done based on the periocular region and when the image is taken too close, the facial information captured is weak, hence, making use of the iris would give better results. Iris information can be used to estimate gender, age, emotions, ethnicity and so on [3, 4]. Gender determination becomes an important aspect in social situations. It may be helpful in identifying the gender of the person without registering the identity of a person and hence providing them entry to certain social situations. The recognition of biometric features from the iris pattern, such as age, ethnicity, or gender has been analysed by many researchers [5,6,7,8]. Better predictions are however obtained if both the iris and periocular region are available for a given individual [9,10,11].
The significance of deep learning has greatly improved the accuracy of the models thus improving the predictions [12,13]. There are various ways to obtain a CNN-based sex prediction technique: training a CNN model from scratch, and transfer learning [14]. Transfer learning has two instances:
the top layers of a pre-trained network fine tuning, using the bottleneck features of a pre-trained model.
In transfer learning, the features acquired from a base network, trained using base dataset, are reused to train the target network on a target dataset. The features learned from the base network should be suitable for the target network too in order to ease the process.
Training from Scratch: A CNN model has many layers
Conv2Dlayer: It is a 2-D convolutional layer. It is used as the layer to capture corresponding consequences from the raw input image. The first layer captures low-level features like colour, gradient orientation, edges. A greater number of Conv2Dlayer is used to extract higher-lever features. The number of parameters and the number of computations are reduced by pooling layer. It is also refered to as the down sampling or sub sampling layer. Flatten Layer: This layer converts the multi-dimensional arrays into a single continuous linear vector. Fully connected layer: The neurons in this layer receive input from all the neurons in the previous layers. It is also called as dense layer. Transfer learning: Transfer learning becomes a powerful tool to train a model [14] when the target dataset is lesser than the base dataset. However, the features acquired from the base dataset should also be compatible with the target dataset.
Bottle-neck features: A pre-trained network detects features that are useful for pattern recognition like edges, spots, or ridges. A fully connected layer is defined separately on the learned features. Only the convolutional part of the model is implemented. Fine Tuning: In fine tuning, the final layer of the architecture is replaced with custom layers. The final layers are different for different problems and can be tailored according to our needs.
EfficientNetB1 was introduced by Google AI in [15] and proposes an efficient method which gives better accuracy. EfficientNet was introduced in 2019 by Tan and Le. It provides high exactness on both common and Imagenet image classification transfer learning tasks. We fine tune the EfficientNetB1 architecture for gender classification using periocular images. When a model has already been trained for a particular task, and we make some changes to the model to fit our task, it is known as fine tuning.
In 2009, Park et al. [16] investigated the feasibility of periocular images of an individual as a biometric trait and confirmed the same. They extracted the features from the periocular region using texture and point operators. The performance was assessed using Local Binary Patterns (LBP), Histograms of Oriented Gradient (HOG), and Scale Invariant Feature Transform (SIFT). On the periocular recognition technique, Bakshi et al. [17] nature a multi-scale local feature. The high dimensional features are extracted from the iris and the periocular region in the visible spectrum. Tapia et al. [4] fused the features extracted from near infrared images and RGB images to improve the accuracy of gender classification. Three techniques were used to extract features: Texture (Local Binary Pat-terns), Shape (Histogram of Oriented Gradients) and Intensity (Pixel value). Random Forest Classifier was then used to classify the images. They achieved an accuracy level of 90%.
With the help of Semantics Assisted Convolutional Neural Network (SCNN), Zhao et al. [18] improved the periocular recognition rates. Using the comprehensive periocular features, semantic information is recovered. Moreover it requires less computational time for the matching process. However, it requires more training time as the technique involves learning from scratch. Proenca et al. [2] excluded sclera and iris for periocular recognition in visible light data. The ocular parts are separated from the periocular part using an ocular segmentation algorithm in the learning dataset. Bobeldyk et al. [8] investigated the sex-predictive exactness related to four different regions. The different regions were the normalized iris-only region, the iris-only region, the iris-excluded ocular region, and the extended ocular region. The extracted feature set is classified as male or female using a Support Vector Machine (SVM). It was concluded that iris along with the periocular region gives a better accuracy. Kevin et al. used off-the-shelf features for periocular recognition. In recent years, many researchers have taken advantage of transfer learning to build stronger and more accurate CNN models. Sharif et al. [13] used the over feat network to solve the diverse range of recognition tasks.
In the beginning, the images of the eye taken in Visible Spectrum (400nm–700nm) are preprocessed. Preprocessing helps us to better handle the images by bringing them to the same size and format. The images are resized as 75
CNN architecture.
The normalized images are used as the input for the CNN feature extraction module. The CNN model is built from scratch. The training dataset is divided into batch sizes of 10. Batch size is the number of data points taken at a time to train the model. The smaller batch size makes the learning faster and more efficient. It has 2 convolutional-pooling (conv-pooling) layers, 1 flattening layer, and 3 fully connected layers. Multiple kernels (filters) are used to extract features in an image. The first convolutional layer contains 16 kernels of size 3
The output from the flattening layer is fed into the dense/fully-connected layer. In the dense layer, each neuron receives the input from all the neurons in the previous layer. To avoid the over-fitting of the data, dropout layers with a dropout rate of 0.2 are added in between the dense layers The image is classified by the dense layer based on the output from the previous layers. LeakyReLu is used as the activation for the first two dense layers, while softmax is used for the last dense layer.
In the first step, the images of the eye considered in the Visible Spectrum (400nm–700nm) are pre-processed. The images are resized as (240, 240) size pixel images without affecting the aspect ratio of the images using bilinear interpolation. EfficientNetB1 takes the input data should range [0,255], and input images of shape (240,240,3). Thus, the images are not changed into grayscale.
Transfer learning involves using the previously attained knowledge from other models in the current model. This reduces the time and effort while building a new model thus ensuring better results as the model is already trained on a number of data points from other dataset. An EfficientNet-B1 architecture consists of the stem layer and the custom-made final layers. The stem layer consists of an Input layer, rescaling layers that rescale the images to a desirable size, the normalization layer, the zero padding layer, the Conv2D layer, the Batch normalization layer and the activation layer. The initial layers which form the stem layer are shown in Figure 3.
Stem layer for EfficientNet-B1 [19].
After the stem layer, there are seven blocks which are shown in Figure 4.
Architecture for EfficientNet-B1 [19].
The modules used in the architecture are shown in Figure 5.
Modules for EfficientNet-B1 [19].
The above-mentioned layers have features learnt on images from ImageNet. The feature maps from the EfficientNet-B1 architecture are now passed onto the top layer. The top layer, which is the final layer, is replaced with custom layers which help to classify the images into two classes - 0 (female) and 1 (male). The custom model has an AveragePooling2D layer and a dense layer. In AveragePooling2D, pooling is applied in 2x2 patches. The average is calculated for each patch with a stride of 2. Softmax activation is used in the last layer. The softmax function calculates the relative probabilities and determines if the given image is that of a male or a female eye.
The images are downloaded from Kaggle [20]. Women and men’s faces are downloaded from [21]. The eyes are highlighted by [20] with the help of
The performance of the work is reported and compared using classification accuracy. Classification accuracy is used to measure how well our model performs.
FN FP TN TP
Some of the other metrics used are precision, recall and f1-score.
An Intel®Core i3 CPU workstation with 4 GB RAM computer has been used for conducting the experiments. Google Colab has been used to build the model. Colab runs in the browser using Google Cloud and it is a Python development environment. It has a Nvidia K80 / T4 GPU, 12 GB RAM, 0.82GHz memory clock, and 4.1 TFLOPS. The models are built using TensorFlow. Keras and TensorFlow are open-source software libraries that provide a Python interface for artificial neural networks, and machine learning and artificial intelligence, respectively. Google Colab uses Python 3.6. The dataset is classified into training, validation and test datasets. To train the model, 80% of the data is used, for validation 10% is considered and the remaining 10% is for testing. We use a CNN model built in the present work from scratch and EfficientNetB1 to capture characteristics from the periocular images. EfficientNetB1 can divide images into 1000 categories as it is trained on a million images from ImageNet. The weights from the EfficientNet-B1 model are used to train the custom model build on top of the EfficientNetB1.
The accuracy attainment is 94.46% for the CNN model built from scratch. The classification report obtained by using the CNN model is shown in Table 1.
Classification report for CNN model built from scratch.
Label | Precision | Recall | F1-score |
---|---|---|---|
1 (Male) | 0.94 | 0.96 | 0.95 |
0 (Female) | 0.95 | 0.93 | 0.94 |
Out of 418 male eyes, 387 are predicted correctly and 31 are predicted incorrectly. Out of 504 female eyes, 484 are predicted correctly and 20 are predicted incorrectly. The confusion matrix obtained by classifying the data with the help of the CNN model built from scratch is as shown in Figure 6.
A subset of the dataset.
The model accuracy changes with each epoch. The accuracy increases with the increasing number of epochs and then reaches a saturation point. The model accuracy obtained by using CNN model built from scratch is as shown in Figure 7.
Confusion matrix for CNN model built from scratch.
The model loss obtained by using the CNN model built from scratch is as shown in Figure 8.
Model accuracy for CNN built from scratch.
For the fine-tuned EfficientNetB1, the accuracy obtained is 97.94%. The classification report obtained by using the fine-tuned EfficientNetB1 is as shown in Table 2.
Classification report for fine-tuned EfficientNetB1.
Label | Precision | Recall | F1-score |
---|---|---|---|
1 (Male) | 0.97 | 0.99 | 0.98 |
0 (Female) | 0.99 | 0.97 | 0.98 |
Out of 418 male eyes, 404 are predicted correctly and 14 are predicted incorrectly. Out of 504 female eyes, 499 are predicted correctly and 5 are predicted incorrectly. The accuracy of the model can be determined with the help of the confusion matrix given in Figure 9.
Model loss for CNN built from scratch.
The model accuracy and loss obtained by using EfficientNetB1 is as shown in Figures 10, 11 and 12 respectively.
Confusion matrix for CNN model built from scratch.
Model accuracy for EfficientNetB1.
Model loss for EfficientNetB1.
EfficientNetB1 gives better accuracy than CNN from scratch. All the other parameters justify that the fine-tuned EfficientNetB1 is a better model than building a CNN model from scratch. The difference in accuracy is due to the different features acquired from each level. The layers in the CNN model built from scratch learn features which are relatively simple to learn. The EfficientNetB1 is already trained on a million images and hence has acquired more advanced features. In the initial layers, the low accuracy is because they only learn simple features and cannot identify the complex features. With an increasing number of epochs, more features are learnt and hence the accuracy increases in both models.
From a deep learning perspective, we have addressed the approach of periocular recognition in the visible spectrum in the present work. The work is the first of its kind which uses EfficientNetB1 architecture for gender determination from periocular images. We compared a CNN built from scratch with a fine-tuned EfficientNetB1 model. This comparison is made to show the strength and versatility of transfer learning. Our experiment shows that the model works better by using off the shelf features. Using transfer learning is hence more precise and less time consuming. Building a CNN model from scratch for every purpose is tiresome and needs high computational power. However, a pre-trained model can be easily used with the help of transfer learning by fine-tuning it for our requirements. This work can be further extended to include different kinds of EfficientNet models to better understand the classification capabilities. It can also be used for images in different spectra.
The authors declare that there is no conflict of interest regarding the publication of this paper.
V.B.N.-Writing-Original Draft, Methodology, Writing-Review, Investigation and Visualization; B.R.-Resources, Validation, Conceptualization and Formal analysis. P.V.-Editing and Supervision. All authors read and approved the final submitted version of this manuscript.
Not applicable.
Thank you so much to the Editors and Reviewers for their valuable suggestions and guidance.
All data that support the findings of this study are included within the article.
The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.