Open Access

Deep Learning for Sign Language Recognition: A Comparative Review

, ,  and   
Jun 15, 2024

Cite
Download Cover

Introduction

Communication plays an essential role with enormous effects on individuals’ lives, such as in gaining and exchanging knowledge, interacting, developing social relationships, and revealing feelings and needs. While most humans communicate verbally, there are those with limited verbal abilities who need to communicate using Sign language (SL). Sign Language is a type of language that is visual, which is utilized by the deaf individuals and mainly relies on the various parts of an individual’s body including fingers, hand, arm, head, body, and facial expression to transfer information rather than using the vocal tract [1]. According to the World Federation of the Deaf, there are more than seventy million deaf people around the world that use more than 300 types of sign language [2]. However, it is not popular among the individuals with typical hearing and communication abilities, and a few of them are able to understand, and learn sign languages. This reveals a genuine communication gap between deaf individuals and the rest of society. Automated recognition and translation of sign language by performing sign language recognition would help to break down these barriers by providing a comfortable communication platform between deaf, and hearing individuals and give the same opportunities for deaf individuals to obtain information as everyone else [3]. Machine translation demonstrates a remarkable capacity for overcoming linguistic barriers, particularly through the utilization of Deep Learning (DL), as a branch of the field. Deep learning exhibits outstanding and exceptional performance in diverse domains, including image classification, pattern recognition, and various other fields and applications [4]. The advancement of DL networks has witnessed a significant surge in their performance, particularly in the realm of video-related tasks, such as Human Action Recognition, Motion Capturing, Gesture Recognition [5,6,7]. Basically, DL techniques offer remarkable attributes that render them highly advantageous in Sign Language Recognition (SLR). This is primarily attributed to their hidden layers, which autonomously extract latent features, as well as their capacity to effectively handle the intricate nature of hand gestures in sign language. This is achieved by leveraging extensive datasets, enabling the generation of accurate outcomes without time-consuming processes, a characteristic often lacking in conventional translation methods [8]. This paper presents a review of various deep learning models used to recognize sign language in to spot the light the key challenges encountered in using deep learning for sign language recognition and determine the unresolved issues. Additionally, this paper provides some suggestions to overcome these challenges that are based on our knowledge have not been solved.

Motivation

The communication gap that exists between normal and deaf individuals is the most important motivation in designing and building an interpreter to facilitate communication between them. When embarking on the design of such a translator, a comprehensive set of objectives must be taken into account. These include ensuring accuracy, speed, efficiency, scalability, and other factors that contribute to delivering a satisfactory translation outcome for both parties involved. However, numerous challenges have been identified in the realm of sign language recognition, necessitating the development of an efficient and robust system to address various issues related to environmental conditions, movement speed, occlusions, and adherence to linguistic rules. Deep-learning-based sign language recognition models have gained significant interest in the last few years due to the quality of the recognition and translation that they provide and their ability in dealing with the various sign language recognition challenges

Contribution

The main contributions of this work are:

Provide a description of important concepts related to sign language including acquiring methods, types of sign language, and a description of many public datasets in different languages around the world.

Identify the various challenges and problems encountered in the implementation of sign language recognition using DL.

Review more than 140 related works for DL-based sign language recognition from the year 2018 to 2022.

Classify these relevant works according to the specific problem addressed and the technique or method employed to overcome the specified challenge or problem.

Paper Organization

This paper is organized into eight main sections as described in Fig1. To facilitate a better and smoother reading for this review, a detailed description of each section is presented below:

Introduction: Provides a brief introduction about sign language, deep learning, describes the motivation behind this review, introduces the main contributions, and illustrates the main layout of this work.

Sign language Overview: Provides a comprehensive overview of sign language, encompassing its historical context and the fundamental principles employed in its construction and development. Additionally, it includes a description of the various forms used to represent letters, words, and sentences in sign language, as well as the acquisition methods employed for capturing sign language.

Deep Learning Background: Introduces the historical background of DL networks structures, properties, layers, and commonly utilized architectures.

Sign Language Recognition Challenges Using Deep Learning: Describes the main challenges and problems facing the recognition of sign language using DL.

Sign Language Public Datasets Description: Presents an overview of widely accessible sign language datasets, encompassing various languages and types (such as images and videos) and available in different formats (including RGB, depth, and skeleton data). Additionally, provide a description of public action datasets related to sign language.

Deep learning-based sign language-related works: Introduces a considerable number of promising related works for sign language using DL techniques from 2018 to 2022 that are organized based on the type of problem being addressed.

Discussion: Discusses the results and methods utilized by the presented related works.

Conclusion: Concludes the review conducted, and illustrates the conclusions reached by performing this review paper on sign language recognition using DL, in addition to a set of recommendations for future research in this area.

Figure 1:

Paper Organization

Sign Language Overview

Sign language (SL) serves as a crucial means of communication for individuals who experience difficulties in speaking or hearing. Unlike spoken language, understanding sign language does not rely on auditory perception, nor does it involve vocalization. Instead, sign language is primarily conveyed through the simultaneous combination of hand shapes, orientations, movements, as well as facial expressions, making it a visual language. [9]. Historically, the linguistic studies of sign language started in the 1970s [10] and show that it is like spoken languages, in which it arranges elementary units called phonemes into meaningful units known as semantic units, which contain lingual information which include different symbols and letters. Sign language is not derived from spoken languages, instead, it has its own independent vocabulary and grammatical constructions [11]. However, the signs used by individuals who are deaf possess an internal structure similar to spoken words. Just as a limited number of sounds can generate hundreds of thousands of words in spoken languages, signs are formed by a finite set of gestural features. Thus, signs are not only gestures, but they are actually a group of linguistically significant features. There is a common misapprehension, that there is only a universal, and single sign language. Just like spoken languages, sign languages evolve, and grow inherently across time, and space [12]. Many countries have their own national sign languages. However, there is also regional variance and domestic dialects. Moreover, the signs do not have a one-to-one mapping to a specific word. Therefore, sign language recognition is a complex process that extends beyond a simple substitution of individual signs with their corresponding spoken language counterparts. This is attributed to the fact that sign languages possess distinct vocabularies and grammatical structures that are not confined to any particular spoken language. Furthermore, even within regions where the same spoken language is used, there can be significant variations in the sign languages employed [13].

Sign Language Acquisition Modalities

The signs of sign language must be captured and attained to provide input for the recognition system and there are various acquisition techniques that provide several types of input such as image, video, and signals. Basically, the main acquisition methods for any sign language recognition system depend on one of these acquisition techniques.

1- Vision-Based: In this type of system signs are captured using single or multiple images capturing devices in the form of single images or video stream and in some cases uses an active and invasive device, to collect the depth information that represent an accurate information associated to the distance between the image plane and the relevant object in the intended image [14]. This category is easy and presents a low computational. There are many imaging devices for signs capturing images in the form of RGB and depth data including [15]:

Single camera: Refers to a filming technique or production method that involves using only one camera such as Webcam, digital cam, video cam, and smartphone cam.

Stereo-camera: Obtains many monocular cameras, or thermal ones to capture in-depth information.

Active methods: Utilizes the projection of structured light using devices such as Kinect and Leap Motion Controller (LMC), which are 3D cameras that can gather movement and skeletal data.

Other methods such as body markers in colored gloves, wrist bands, and LED lights.

Generally, the major advantages of vision methods are that it is not costly, convenient, and non-intrusive. The user simply needs to communicate using sign language naturally in front of an image capturing device. This makes it suitable for real-time applications [16]. However, the use of vision-based input suffers from a set of problems including [17]:

Too much redundant information causes low recognition efficiency.

Low recognition accuracy, due to occlusion and motion blur.

The variances in sign language style between individuals resulted in poor generalization of algorithms.

Small recognizable words vocabulary due to the large vocabulary datasets containing similar words.

Have some challenging matters about time, speed, and overlapping.

Need more feature extraction methods to operate correctly.

2- Hardware-Based: This type mainly depends on the use of some types of hardware devices that can capture or sense the signs performed by the user when attached to his/her arm, hand, or fingers, and convert these signs into signals, or images, or in some cases video. Motion sensors are the most widely utilized devices that can track the movements, position, shapes, and velocity of fingers and hands [18]. Electronic gloves serve as the predominant sensor technology employed for capturing hand pose and associated motion. They are affixed to both hands to acquire precise data on hand movements and gestures. The hand’s position, orientation, and location are calculated precisely due to the hundreds of sensors supplemented in the gloves. The most significant advantage of this method is its fast reaction [19], so it is highly accurate. However, since it depends on costly sensors it cannot be considered an affordable method for the common deaf people. Moreover, it suffers from relatively low accuracy or complicated structures, and the insufficient amount of information provided by the wearable sensors often affects the overall performance of this system. Some popular examples of sensors are described below [20]:

Inertial Measurement Unit (IMU): It is an electronic device employed to measure and report an object's specific force, position, angular rate, and sometimes orientation with respect to an inertial reference frame, and acceleration. It typically consists of a combination of accelerometers, gyroscopes, and sometimes magnetometers.

Electromyography (EMG): It is the device that uses electrodes placed on or inserted into the skin near the muscle of interest to measure human muscle’s electrical pulses and employ the bio-signal to detect movements.

Wi-Fi and Radar: These devices mainly depend on radio waves, broad beam radar, or spectrogram to detect in-air signal strength variation. They are employed to monitor the movements and positions of the deaf by capturing the reflections of radio waves off their body or hand movements. Radar systems can provide data on the dynamics and trajectories of sign language gestures. This information then can be used for analysis or recognition purposes.

Others include flex sensors, ultrasonic, mechanical, electromagnetics, and haptic technologies.

In general, although these methods exhibit higher speed and accuracy, the necessity for individuals to wear sensors remains impractical for the following reasons [21]:

It may cause a burden on the users because they must take electronic devices with them when moving.

Portable electronic devices require a battery, which needs to be charged from time to time.

Specific equipment is required to process the signals acquired from the wearable devices.

3- Hybrid-based: In this type, the vision-based cameras together with other types of sensors, such as infrared depth sensors, are combined to acquire multi-mode information regarding the shapes of the hands [22]. This approach requires calibration between the hardware and vision-based modalities, which can be particularly challenging. The purpose of a hybrid system was to enhance data acquisition, and accuracy, and attempt to reduce the challenges and problems of both visions and hardware-based approaches [23].

Sign Language Types

Static: A specific hand configuration and pose, depicted through a single image, is employed for the recognition of fingerspelled gestures of alphabets and digits. This recognition process relies solely on still images as input to predict and generate the corresponding output, without incorporating any movement. It is considered to be very inconvenient, due to the time required to perform prediction each time an input is given, and depends basically on handshapes, hand positions, and facial expressions to convey meaning [24].

Dynamic: Refers to a variant of sign language, in which signs are produced with movement. This form of communication encompasses not only handshapes and positions but also incorporates the movement of hands, arms, and other body parts to convey meaning. To capture and represent this type of sign language, video streams are required [25]. There are certain words in sign language, such as in American Sign Language, which necessitate hand movements for proper representation, making it a dynamic form. It plays a vital role in facilitating communication, as well as establishing linguistic and cultural identities within the deaf community. Dynamic signs find application in various contexts, including everyday conversations, education, storytelling, performances, and broadcasting. Broadly speaking, dynamic signs can be categorized into two main types based on what they represent, be it individual words or complete sentences. These are described below [26]:

Isolated: The input dynamic signs are used to represent words by performing more than one sign each time and pauses only happen between words.

Continues: The continuous dynamic entries are mainly employed to represent sentences because it incorporates more than one sign performed continuously without any pause between signs [27].

Deep Learning Background

The Deep Neural Network is basically a branch of Machine Learning (ML), that was originally inspired by and resembles the human nervous system, and the structure of the brain. It is composed of several layers and nodes, in which the layers are processing units systemized in input, output, and hidden layers. The nodes or units in every layer are linked to nodes in contiguous layers, and every connection owns its singular weight value. The inputs are multiplied by the intended weights and summed at every unit. The summation result undergoes a transformation depending on some type of activation function such as, Sigmoid function, Tan hyperbolic or Rectified Linear Unit (ReLU) [28]. Thus, DL includes stacking many learning layers to learn high-level abstractions in the data within approximate highly nonlinear functions giving the learning algorithm the ability to learn hierarchical features from the input data. This feature learning highly replaced the hand-engineered features and owes its regeneration to effective optimization methods, and powerful computational resources [29]. DL powerful properties give it the ability to taking the lead in achieving the desired results depending on a set of factors including: [30]

Feature learning refers to the capacity to acquire descriptive features from data that have an impact on other correlated tasks. This implies that numerous relevant factors are disentangled within these features, in contrast to handcrafted features that are designed to remain constant with respect to the targeted factors.

Hierarchical Impersonation: the features in this type of method are represented in a hierarchical format, in which the simple ones are represented in the lower layers, and the high layers learn increasingly complicated features. This will provide a successful encoding for properties of two types, including local and global in the last features representation.

Distributed Impersonation: Signifies a many-to-many relationship where the representations are dispersed. This occurs because multiple neurons can demonstrate a single factor, while one neuron can account for multiple factors. Such an arrangement eradicates the potential for dimensionality and offers a compact and comprehensive representation.

Large-Scale Datasets: the DL is able to deal with the datasets with a vast number of samples and gives outstanding performance in many domains.

In recent years, DL methods have demonstrated exceptional performance surpassing previous state-of-the-art ML techniques across various domains. One domain in which DL has emerged as a prominent methodology is computer vision, particularly in the context of sign language recognition. DL has provided novel solutions to challenges in sign language recognition and has become a leading approach in this field [31]. Many architectures of DL have been utilized for sign language recognition and in an accurate, fast, and efficient manner, due to their ability in dealing with most challenges and the complexity of sign language [32]. The most popular and utilized DL architectures are the Convolutional Neural Network (CNN), Deep Boltzmann Machine (RBM), Deep Belief Network (DBN), Auto Encoder (AE), Variational Auto Encoder (VAE), Generative Adversarial Network (GAN), and Recursive Neural Network (RNN) including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) [33].

Sign Language Recognition Challenges Using Deep Learning

Detection, tracking, pose estimation, gesture recognition, and pose recovery represent key sub-fields within sign language recognition. These sub-fields are extensively employed in human-computer interaction applications that utilize DL techniques. Nevertheless, the recognition or conversion of signs performed by deaf individuals using DL presents a range of challenges that can significantly impact the output results. These challenges include:

Feature Extraction

Feature extraction is the process used to select and /or combine variables into features, and effectively reduce the amount of data that must be processed, while still accurately and completely describing the original data. It has a role in addressing the problem of finding the most compact and informative set of features, to enhance the efficiency of data storage and processing. Defining feature vectors remains the most common and convenient means of data representation for classification and regression [34]. In the context of sign language recognition, the process of extracting pertinent features plays a vital, and decisive role. Irrelevant features, on the other hand, can result in misclassification and erroneous recognition [35]. Within the realm of DL techniques for data classification, automatic feature extraction holds paramount importance. Integrating various features extracted from both training and testing images without any data loss is a crucial step that greatly impacts the recognition accuracy of sign language. In general, two types of features are considered in sign language including manual, and non-manual features. Manual features encompass the movements of hands, fingers, and arms. On the other hand, nonmanual features represent the fundamental component in sign language [36] like facial expressions, eye gaze, head movements, upper body motion, and positioning. The combination of manual and non-manual features offers a comprehensive representation of sign language. In the domain of DL, features can be classified into two categories: spatial and temporal. Spatial features pertain to the geometric representation of shapes within a specific coordinate space, while temporal features account for time-related aspects during movement especially when dealing with a sequence of images as input. By employing feature fusion and combining multiple types of features in the process of sign language recognition using DL, one can achieve the desired outcomes effectively [37].

Environment Conditions

The variability in the environment during sign capture poses a significant technical challenge with a notable impact on sign language recognition. When capturing an image, numerous factors, such as lighting conditions (spectra, source distribution, and intensity) and camera characteristics (sensor response and lenses), exert their influence on the appearance of the captured sign. Additionally, skin reflectance properties and internal camera controls [38] further contribute to these effects. Moreover, noise originating from other elements present in the background and landmarks can also influence sign recognition outcomes.

Movement

The movements in sign language are dynamic acts, exhibiting trajectories with distinct beginnings and ends. The representation of dynamic sign language involves both isolated and continuous signing, wherein signs are performed consecutively without pauses. This introduces challenges related to similarity and occlusion, arising from variations in hand movements and orientations, involving one or both hands in different angles and directions [39]. The determination of each sign's precise beginning and end presents a significant hurdle, resulting in what is termed Movement Epenthesis (ME) or transition segments. These ME segments act as connectors between sequential signs when transitioning from the final position of one sign to the initial position of the next. However, ME segments do not convey any specific sign information; instead, they contribute to the complexity of recognizing continuous sign sequences. The lack of well-defined rules for making such transitions poses a significant challenge [40], demanding careful attention and a demonstrable approach to address effectively.

Hand Segmentation, and Tracking

The segmentation process stands out as one of the most formidable challenges in computer vision, especially in the context of sign language recognition, where the extraction of hands from video frames or images holds particular significance due to their critical role in the recognition process. To address this, image segmentation is employed to isolate relevant hand data while eliminating undesired elements, such as background and other objects in the input, which might conflict with the classifier operations [41]. Image segmentation restricts the data region, enabling the classifier to focus solely on the Region of Interest (ROI). Segmentation methods can be categorized as contextual and non-contextual. Contextual segmentation takes spatial relationships between features into account, often using edge detection techniques. Conversely, non-contextual segmentation does not consider spatial relationships; rather, it gathers pixels based on global attributes. Hand tracking can be viewed as a subfield of segmentation, and it typically poses challenges, particularly when the hand moves swiftly, leading to significant appearance changes within a few frames [42].

Classifier

In the realm of sign language recognition, the classifier's selection and design require meticulous attention. It is essential to carefully determine the architecture of the classifier, encompassing its layers and parameters, in order to steer clear of potential problems like overfitting or underfitting. The primary objective is to achieve optimal performance in classifying sign language. Furthermore, the classifier's ability to generalize effectively across diverse data types, rather than being confined to specific subsets, is of paramount importance [43].

Time and Complexity

Real-time recognition of sign language is considered to be an important concern and one of the main problems that needs a real solution in order to give an efficient interpreter to fill up the communication gap between the deaf and public communities. The time problem arises from the need to process video data in real-time or with minimal delay. Computational complexity, both in hardware and software, can be quite demanding and may present challenges for the deaf community to effectively deal with [44].

Sign Language Public Datasets

The availability of sign language datasets is limited and can be considered as one of the main obstacles in designing an accurate recognition system, in which there are few datasets available for sign language compared to gesture databases. Several sign language datasets are created with many variations such as regional differences, type of images (RGB or Depth), type of acquiring methods (images, video), and so on. Sign language differs from one region to another just like spoken languages, and each type has its own properties and linguistic grammar. Most publicly available and utilized sign and gesture datasets in different languages are described in this section and categorized depending on the type of language as illustrated in Table 1, and Fig2.

Public sign language datasets

Dataset Language Equipment Modalities Signers Samples
ASL alphabets [45] American Webcam RGB images - 87,000
MNIST [46] American Webcam Grey images - 27,455
ASL Fingerspelling A [47] American Microsoft Kinect RGB and depth images 5 48,000
NYU [48] American Kinect RGB and depth images 36 81,009
ASL by Surrey [49 American Kinect RGB and depth images 23 130,000
Jochen-Triesch [50] American Cam Grey images with different background 24 720
MKLM [51] American Leap Motion device and a Kinect sensor RGB and depth images 14 1400
NTU-HD [52] American Kinect sensor RGB and depth images 10 1000
HUST [53] American Microsoft Kinect RGB and depth images 10 10880
RVL-SLLL [54] American Cam RGB video 14
ChicagoFSWild [55] American Collected online from YouTube RGB video 160 7,304
ASLG-PC12 [56] American Cam RGB video - 880
American Sign Language Lexicon Video (ASLLVD) [57] American Cam RGB videos of different angles 6 3,300
MU [58] American Cam RGB images with illumination variations in five different angles 5 2515
ASLID [59] American Web cam RGB images 6 809
KSU-SSL [60] Arabic Cam and Kinect RGB Videos with uncontrolled environment 40 16000
KArSL [61] Arabic Kinect V2 RGB video 3 75,300
ArSL by University of Sharjah [62] Arabic Analog camcorder RGB images 3 3450
JTD [63] Indian Webcam RGB images with 3 different backgrounds 24 720
IISL2020 [64] Indian Webcam RGB video with uncontrolled environment 16 12100
RWTH-PHOENIX-Weather 2014 [65] German Webcam RGB Video 9 8,257
SIGNUM [66] German Cam RGB Video 25 33210
DEVISIGN-D [67] Chinese Cam RGB videos 8 6000
DEVISIGN-L [67] Chinese Cam RGB videos 8 24000
CSL-500 [68] Chinese Cam RGB, depth and skeleton videos 50 25,000
Chinese Sign Language [69] Chinese Kinect RGB, depth and skeleton videos 50 125000
38 BdSL [70] Bengali Cam RGB images 320 12,160
Ishara-Lipi [71] Bengali Cam Greyscale images - 1800
ChaLearn14 [72] Italian Kinect RGB and depth video 940 940
Montalbano II [73] Italian Kinect RGB and depth video 20 940
UFOP–LIBRAS [74] Brazilian Kinect RGB, depth and skeleton videos 5 2800
AUTSL [75] Turkish Kinect v2 RGB, depth and skeleton videos 43 38,336
RKS-PERSIANSIGN [76] in Persian Cam RGB video 10 10,000
LSA64 [77] Argentine Cam RGB video 10 3200
Polytropon (PGSL) [78] Greek Cam RGB video 6 840
kETI [79] Korean Cam RGB video 40 14,672
Figure 2:

Samples of sign language datasets.

Several critical factors contribute to the evaluation of sign language datasets. One such factor is the number of signers involved in performing the signs, which significantly impacts the dataset's diversity and subsequently affects the evaluation of recognition systems' generalization rate. Additionally, the quantity of distinct signs within the datasets, particularly in isolated and continuous formats, holds considerable importance. Furthermore, the number of samples per sign plays a crucial role in training systems that require an ample representation of each sign. Adequate sample representation helps improve the robustness and accuracy of the recognition systems. Moreover, when dealing with continuous datasets, annotating them with temporal information for continuous sentence components is very important. This temporal information is vital for effectively processing and understanding this type of dataset [80]. Although sign language recognition is one of the gesture recognition applications, gesture datasets are seldom utilized for sign language recognition for many reasons. First, the classes count in the gesture recognition dataset has some degree of limitation. Secondly, sign language involves the simultaneous use of manual and non-manual gestures, posing challenges in annotating both types of gestures within a single gesture dataset. Moreover, sign language relies on hand gestures, while gesture datasets are broader and include gestures about full body movements. Additionally, gesture datasets lack the necessary details about hand fingers, which are essential for developing accurate sign language recognition systems [81]. Nevertheless, despite these limitations, gesture datasets can still play a role in training sign recognition systems. In this context, Table 2 presents a comprehensive overview of various gesture datasets, and Fig3 illustrates some representative examples.

Gesture public datasets

Name Modality device signers samples
LMDHG [82] RGB, and depth videos Kinect and 21 608
SHREC Shape Retrieval Contest (SHREC) [83] RGB, and depth videos Intel RealSense short range depth camera 28 2800
UTD–MHAD [84] RGB, depth and skeleton videos Kinect and wearable inertial sensor 8 861
The Multicamera Human Action Video Data (MuHAVi) [85] RGB video 8 camera views 14 1904
NUMA [86] RGB, depth and skeleton videos 10 Kinect with three different views 10 1493
WEIZMANN [87] Low resolution RGB video Camera with 10 different viewpoints 9 90
NTU RGB [88] RGB, depth and skeleton videos Kinect 40 56 880
Cambridge hand gesture [89] RGB video captured under five different illuminations Cam 9 900
VIVA [90] RGB, and depth videos Kinect 8 885
MSR [91] RGB, and depth videos Kinect 10 320
CAD-60 [92] RGB and depth video in different environments, such as a kitchen, a living room, and office Kinect 4 48
HDM05MoCap (motion capture) [93] RGB video Cam 5 2337
CMU [94] RGB images CAM 25 204
isoGD [95] RGB and depth videos Kinect 21 47,933
NVIDIA [96] RGB and depth video Kinect 8 885
G3D [97] RGB and depth video Kinect 16 1280
UT Kinect [98] RGB and depth video Kinect 10 200
First-Person [99] RGB and depth video RealSense SR300 cam 6 1,175
Jester [100] RGB Cam 25 148,092
Ego Guster [101] RGB and depth video Kinect 50 2,081
NUS II [102] RGB images with complex backgrounds, and various hand shapes and sizes Cam 40 2000
Figure 3:

Samples of gesture datasets

Deep Learning based Sign Language Recognition-Related Works

Numerous research efforts have been dedicated to the recognition and translation of sign language across diverse languages worldwide, aiming to facilitate its conversion into other communication forms used by individuals, such as text or sound. This study categorizes the works of sign language recognition using DL according to the primary challenges encountered in recognition and the corresponding solutions proposed by each of the investigated works. Any sign language recognition system consists of key stages, which include signs acquisition, hand segmentation, tracking, preprocessing, feature extraction, and classification, as depicted in Fig4.

Figure 4:

The procedural stages of sign language recognition

In sign Acquisition, the input modalities as mentioned earlier are either an image or a video stream using one type of vision-based capturing device or depth information using one of the hardware-based collecting equipment. The input modality may be in any format including RGB-colored, greyscale, and binary. In general, DL techniques need high quality data samples with sufficient amount for training to be conducted.

Accuracy is one of the most common performance measurements considered in any type of recognition system, in addition to the percentage of error that may be identified using the Equal Error Rate, Word Error Rate, and False Rate. Another evaluation metric named Bilingual Evaluation Understudy Score (BLEU), is used to measure the matching between the resulting sentences to the entered sign language. The perfect match results in a score of 1.0, while the worst score that represents mismatching is 0.0, so it is also considered as a measurement of accurate translation and widely used in machine learning systems [103]

The related sign language works using DL are categorized based on the type of problem solved in this work, and what is the technique utilized to get the desired result.

Related Works on Preprocessing Problem

The acquired signs may exhibit issues such as low quality, noise, varying degrees of orientation, or enormous size. Therefore, the preprocessing step becomes indispensable to rectify these issues in sign images and videos, effectively eliminating any environmental influences that might have affected them, such as variations in illumination and color. This phase involves the application of filters and other techniques to adjust the size, orientation, and color, ensuring improved data quality for subsequent analysis and recognition. The primary advantage of preprocessing is enhancing the image quality, which enables efficient hand segmentation from the scene for effective feature extraction. In the case of video streams, preprocessing serves to eliminate redundant and similar frames from the input video, thereby increasing the processing speed of the neural network without sacrificing essential information. Many sign language recognition using DL overcome the environmental condition problem using a variety of techniques. Table 3 illustrates the most important related work, the environmental condition being addressed, and the proposed technique. Fig5 shows a sample of the NUSII dataset to show the environmental condition problem.

Related works on SLR using DL that address the various environmental conditions problem.

Author (s) Year Language Modality Type of condition Deal with technique results
[130] 2018 Bengali RGB images Variant background and skin colors Modified VGG net 84.68%
[134] 2018 American RGB images noise and missing data Augmentation 98.13%
[150] 2018 Indian RGB video Different viewing angles, background lighting, and distance Novel CNN 92.88%
[158] 2019 American Binary images Noise Erosion, closing, contour generation, and polygonal approximation, 96.83%
[159] 2019 American Depth image Variant illumination, and background Attain depth images 88.7%
[164] 2019 chines RGB, and depth video Variant illumination, and background Two-stream spatiotemporal network 96.7%
[173] 2019 Indian RGB, and depth video Variant illumination, background, and camera distance Four stream CNN 86.87%
[178] 2020 Arabic RGB images Variant illumination, and skin color DCNN 94.31%
[179] 2020 Arabic RGB videos Variant illumination, background, pose, scale, shape, position, and clothes Bi-directional Long Short-Term Memory (BiLSTM) 89.59%
[180] 2020 Arabic RGB Videos Variant illumination, clothes, position, scale, and speed 3DCNN and SoftMax function 87.69%
[182] 2020 Arabic RGB Videos Variations in heights and distances from camera Normalization 84.3%
[194] 2020 Arabic RGB images variant illumination, and background VGG16 and the ResNet152 with enhanced softmax layer 99%
[201] 2020 American Grayscale images illumination, and skin color Set the hand histogram 95%
[202] 2020 American RGB images Variant illumination, background DCNN 99.96%
[206] 2021 Indian RGB video Variant illuminations, camera positions, and orientations Google net+ BiLSTM 76.21%
[207] 2021 Indian RGB images Light and dark backgrounds DCNN with few numbers of parameters 99.96%
[209] 2021 American RGB video Noise Gaussian Blur 99.63%
[213] 2021 Korean Depth Videos Low resolution Augmentation 91%
[224] 2021 Bengali RGB images Variant backgrounds, camera angle, light contrast, and skin tone Conventional deep learning + Zero-shot learning ZSL 93.68%
[225] 2021 Arabic RGB video Variant illumination, background, and clothes Inception-BiLSTM 84.2%
[227] 2021 American Thermal images Varying illumination Adopt live images taken by a low-resolution thermal camera 99.52%
[229] 2021 Indian RGB video Varying illumination 3DCNN 88.24%
[230] 2021 American RGB video Noise, varying illumination Median filtering + histogram equalization 96%
[236] 2021 Arabic RGB images Variant illumination, and background Region-based Convolutional Neural Network (R-CNN) 93.4%
[239] 2022 Indian RGB video Variant illumination, and views Grey scale conversion and histogram equalization 98.7%
[241] 2022 Arabic RGB video Variant illumination, and background CNN+ RNN 98.8%
[249] 2022 Arabic Greyscale images Variant illumination, and background Sobel filter 97%
[253] 2022 Arabic RGB, and depth video Variant Background ResNet50-BiLSTM 99%
[259] 2022 American RGB, and depth images Noise and illumination variation Median filtering and histogram equalization 91.4%
[261] 2022 American Skeleton video Noise in video frames An innovative weighted least square (WLS) algorithm 97.98%
[270] 2022 English Wi-Fi signal Noise and uncleaned Wi-Fi signals. Principal Component Analysis (PCA) 95.03%
Figure 5:

Sample images (class 9) from NUS hand posture dataset-II (data subset A), showing the variations in hand posture sizes and appearances.

Another challenge arises when attempting to recognize signs, particularly in the dynamic type, where movement is considered one of the key phonological parameters in sign phonology. This pertains to the variations in hand location, speed, orientation, and angles during the signing process [104]. A consensus on how to characterize and organize movement types and their associated features in a phonological representation has been lacking. Due to divergent approaches and perspectives, there remains uncertainty about the most suitable and standardized way to define and categorize movements in sign language. In general, there are three main types of movements in sign language [105,106]:

Movement of the hands and arms: include waving, pointing, or tracing shapes in the air.

Movement of the body: include twisting, turning, or leaning to indicate direction or location.

Movement of the face and head: include nodding, shaking the head, or raising the eyebrows to convey different meanings or emotions.

The movement involved in demonstrating sign language also involves a significant challenge, which includes dealing with similar paths of movement (Trajectory), and Occlusion. The arm trajectory formation refers to the principles and laws that invariantly govern the selection, planning, and generation processes of multi-joint movements, as well as to the factors that dictate their kinematics, namely geometrical and temporal features [107]. The sign language movement trajectory swerves to some extent, due to the action speed, and arm length of the user; even for the same user, the psychological changes resulted in inconsistent implementation speed of sign language movement [108]. Movement trajectory recognition is the key link of sign language translation research, which influence directly on the accuracy of sign language translation, in which the same signs match with variant movement trajectory predominantly refer to two variant meanings, that is, illustrating different sign language [109]. On the other hand, occlusion means that some fingers or parts of the hand would be covered (not in view of the camera) or hidden by other parts of the scene, so the sign cannot be detected accurately [110]. The occlusion may appear in various parts including hand/hand, and hand/face depending on the movement and the captured scene. The occlusion has a great effect on the segmentation procedure especially skin segmentation techniques [111]. Table 4 summarizes the most important related DL works that handle these types of problems in sign language recognition.

Related works on SLR using DL that address movement orientation, trajectory, occlusion problems.

Author(s) Year Type of variation language Signing mode Model Accuracy Error Rate
[129] 2018 similarities, and occlusion American Static DCNN 92.4%
[135] 2018 Movement Brazilian Isolated Long-term Recurrent Convolutional Networks 99% -
[138] 2018 size, shape, and position of the fingers or hands American Static CNN 82% -
[140] 2018 Hand movement American Isolated VGG 16 99% -
[144] 2018 Movement American Isolated Leap Motion Controller 88.79% -
[145] 2018 3D motion Indian Isolated Joint Angular Displacement Maps (JADMs) 92.14%
[150] 2018 head and hand movements Indian Continues CNN 92.88 % -
[155] 2019 Hand movement Indian Continues Wearable systems to measure muscle intensity, hand orientation, motion, and position 92.50% -
[156] 2019 Variant hand orientations Chines Continues Hierarchical Attention Network (HAN) and Latent Space 82.7% -
[165] 2019 Similarity and trajectory Chines Isolated Deep 3-d Residual ConvNet + BiLSTM 89.8% -
[166] 2019 orientation of camera, hand position and movement, inter hand relation Vietnam Isolated DCNN 95.83%
[173] 2019 Movement, self-occlusions, orientation, and angles Indian Continues Four stream CNN 86.87%
[174] 2019 Movement in different distance from the camera American Static Novel DNN 97.29% -
[176] 2020 Angles, distance, object size, and rotations Arabic Static Image Augmentation 90% 0.53
[180] 2020 fingers' configuration, hand's orientation, and its position to the body Arabic Isolated Multilayer perceptron+ Autoencoder 87.69%
[185] 2020 Hand Movement Persian Isolated Single Shot Detector (SSD) +CNN+LSTM 98.42%
[186] 2020 shape, orientation, and trajectory Greek Isolated Fully convolutional attention-based encoder-decoder 95.31% -
[192] 2020 Trajectory Greek Isolated incorporate the depth dimension in the coordinates of the hand joints 93.56% -
[195] 2020 finger angles and Multi finger movements Taiwan Continues Wristband with ten modified barometric sensors+ dual DCNN 97.5%
[196] 2020 movement of fingers and hands Chinese Isolated Motion data from IMU sensors 99.81% -
[197] 2020 finger movement Chinese Isolated Trigno Wireless sEMG acquisition system used to collect multichannel sEMG signals of forearm muscles 93.33%
[199] 2020 finger and arm motions, two-handed signs, and hand rotation Chinees Continues Two armbands embedded with an IMU sensor and multi-channel sEMG sensors are attached on the forearms to capture both arm, and finger movements - 10.8%
[76] 2020 Hand occlusion Persian Isolated Skeleton detection 99.8%
[204] 2020 Trajectory Brazilian Isolated Convert the trajectory information into spherical coordinates 64.33%
[210] 2021 Trajectory Arabic Isolated Multi-Sign Language Ontology (MSLO) 94.5%
[213] 2021 Movement Korean Isolated 3DCNN 91%
[214] 2021 finger movement Chines Isolated Design a low-cost data glove with simple hardware structure to capture finger movement and bending simultaneously 77.42%
[218] 2021 Skewing, and angle rotation Bengali Static DCNN 99.57 0.56
[219] 2021 Hand motion American Continues Sensing Gloves 86.67%
[223] 2021 spatial appearance and temporal motion Chines Continues Lexical prediction network 91.72% 6.10
[226] 2021 finger self-occlusions, view invariance Indian Continues Motion modelled deep attention network (M2DA-Net) 84.95%
[228] 2021 Occlusions of hand/hand, hands/face, or hands/upper body postures. American Continues Novel hyperparameter based optimized Generative Adversarial Networks (H-GANs) Deep Long Short-Term Memory (LSTM) as generator and LSTM with 3D Convolutional Neural Network (3D-CNN) as a discriminator 97% 1.4
[230] 2021 Variant view American Isolated 3-D CNN’s cascaded 96%
[233] 2021 Hand occlusion, Italian Isolated LSTM+CNN 99.08%
[237] 2021 Finger occlusion, motion blurring, variant signing styles. Chines Continues Dual Network up on a Graph Convolutional Network (GCN). 98.08%
[239] 2022 self-structural characteristics, and occlusion Indian Continues Dynamic Time Warping (DTW) 98.7%
[240] 2022 High similarity and complexity American Static DCNN 99.67% 0.0016
[241] 2022 Movement Arabic Isolated The difference function 98.8%
[259] 2022 Hand Occlusion American Static Re-formation layer in the CNN 91.40%
[260] 2022 Trajectory, hand shapes, and orientation American Isolated Media Pipe’s Landmarks with GRU 99%
[261] 2022 ambiguous and 3D double-hand motion trajectories American Isolated 3D extended Kalman filter (EKF) tracking, and approximation of a probability density function over a time frame. 97.98%
[262] 2022 Movement Turkish Continues Motion History Images (MHI) generated from RGB video frames 94.83%
[264] 2022 Movement Argentina Continues Propose an accumulative video motion (AVM) technique 91.8%
[269] 2022 orientation angle, prosodic, and similarity American continues Develop robust fast fisher vector (FFV) in in Deep Bi-LSTM 98.33%
[270] 2022 variant length, sequential patterns, English Isolated Novel Residual-Multi Head model 95.03%
Related Works on Segmentation and Tracking Problem

Detecting the signer hand in a still image or tracking it in a video stream is challenging and affected by many factors discussed earlier in the preprocessing phase such as environment, movement, hand shape, and occlusion. Hence, the careful choice of an appropriate segmentation technique is of utmost importance, as it profoundly influences the recognition of sign language and the work of the subsequent phases (feature extraction and classification). The hand segmentation identifies the beginning and end of each sign. This is necessary for accurate recognition and understanding of the signer's message. Through the process of segmenting the sign language input, the recognition system can concentrate on discerning individual signs and their respective meanings, thereby avoiding the interpretation of the entire continuous signing stream as a single sign. In addition to enhancing recognition accuracy, segmentation contributes to system efficiency and speed. By dividing the input into distinct signs, the system can process each sign independently, reducing computational complexity and improving response time. Furthermore, segmentation facilitates advancements in sign language recognition technology by enabling the creation of sign language corpora annotated with information about individual signs. Such resources are valuable for training and evaluating sign language recognition systems and conducting linguistic research on sign language structure and syntax. Various segmentation techniques are employed, including Background subtraction [112], Skin color detection [113], Template matching [114], Optical flow [115], and Machine learning [116]. Table 5 presents DL for sign language recognition-related works that focus on addressing the segmentation and tracking challenges to achieve optimal system performance.

Related works on SLR using DL that address segmentation problem.

Author(s) Year Input Modality Segmentation method Results
[131] 2018 RGB image HSV color model 99.85%
[148] 2018 RGB image Skin segmentation algorithm based on color information 94.7%
[149] 2018 RGB images k-means-based algorithm 94.37%
[158] 2019 RGB images Color segmentation by MLP network 96.83%
[159] 2019 Depth image Wrist line localization by algorithm-based thresholding 88.7%
[164] 2019 RGB, and depth video Aligned Random Sampling in Segments (ARSS) 96.7%
[168] 2019 RGB, and depth images Depth based segmentation using data of Kinect RGB-D camera 97.71%
[171] 2019 RGB video Design an adaptive temporal encoder to capture crucial RGB visemes and skeleton signees 94.7%
[179] 2020 RGB videos Hand semantic Segmentation named as DeepLabv3+ 89.59 %
[180] 2020 RGB Videos Novel method based on open pose 87.69 %
[182] 2020 RGB Videos Viola and Jones, and human body part ratios 84.3%
[183] 2020 RGB images Robert edge detection method 99.3 %
[185] 2020 RGB video SSD is a feed-forward convolutional network A Non-Maximum Suppression (NMS) step is used in the final step to estimate the final detection 98.42%
[187] 2020 RGB images Sobel edge detector, and skin color by thresholding 98.89%
[188] 2020 RGB images Open-CV with a Region of Interest (ROI) box in the driver program 93%
[189] 2020 RGB Videos Frame stream density compression (FSDC) algorithm 10.73 error
[199] 2020 RGB Videos Design an attention-based encoder-decoder model to realize end-to-end continuous SLR without segmentation 10.8% WER
[200] 2020 RGB images Single Shot Multi Box Detection (SSD) 99.90%
[209] 2021 RGB Video Canny 99.63%
[216] 2021 RGB images Erosion, Dilation, and Watershed Segmentation 99.7 %
[219] 2021 RGB Video Data sliding window 86.67%
[236] 2021 RGB images R-CNN 93%
[239] 2022 RGB videos Novel Adaptive Hough Transform (AHT) 98.7%
[246] 2022 RGB images, and video Grad Cam and Cam shift algorithm 99.85%
[248] 2022 Grey images YCbCr, HSV and watershed algorithm 99.60%,
[249] 2022 RGB images Sobel operator method 97 %
[263] 2022 RGB images Semantic 99.91%
[267] 2022 RGB images R-CNN 99.7%
[268] 2022 RGB video Mask is created by extracting the maximum connected region in the foreground assuming it to be the hand+ Canny method 99%
Related Works on Feature Extraction Problem

The feature extraction goal is to capture the most essential information about the sign language gestures while removing any redundant or irrelevant information that may be present in the input data. The process of feature extraction offers numerous advantages in sign language recognition. It enhances accuracy by effectively representing the distinctive characteristics of each sign and gesture, thereby facilitating the system's ability to differentiate between them. Moreover, feature extraction reduces both processing time and computational complexity, as the extracted features are typically represented in a more compact and informative manner compared to raw input data. Additionally, feature extraction confers robustness against noise and variability, as features can be designed to be invariant to specific types of variations, such as changes in lighting conditions or background clutter [117,118]. This enables the recognition system to maintain its performance even in challenging and diverse environments. Table 6 shows related DL works for sign language recognition that focus on solving the problem of features extraction.

Related works on SLR using DL that address feature extraction problem.

Author(s) Year Dataset Technique Signing mode Feature(s) Result
[130] 2018 Collected DCNN static Hand shape 84.6%
[135] 2018 Collected 3D CNN Isolated spatiotemporal 99%
[138] 2018 ASL Finger Spelling CNN Static depth and intensity 82%
[141] 2018 RWTH-2014 3D Residual Convolutional Network (3D-ResNet) Continues Spatial information, and temporal connections across frames 37.3WER
[143] 2018 Collected 3D-CNNs Isolated spatiotemporal 88.7%
[144] 2018 Collected DCNN Isolated hand palm sphere radius, and position of hand palm and fingertip 88.79%
[149] 2018 ASL Finger Spelling Histograms of oriented gradients, and Zernike moments Static Hand shape 94.37%
[150] 2018 Collected CNN Continues Hand shape 92.88 %
[151] 2018 Collected 3DRCNN Continues/Isolated motion, depth, and temporal 69.2%
[152] 2018 SHREC Leap Motion Controller (LMC) sensor Isolated, static finger bones of hands. 96.4%
[153] 2018 Collected Hybrid Discrete Wavelet Transform, Gabor filter, and histogram of distances from Centre of Mass Static Hand shape 76.25%
[154] 2018 Collected DCNN Static Facial expressions 89%
[156] 2019 Collected Two-stream 3-D CNN Continues Spatiotemporal 82.7%
[158] 2019 Collected CNN Static Hand shape 96.83%
[79] 2019 Collected Open Pose library Continues human key points (hand, face, body) 55.2%
[159] 2019 ASL fingerspelling PCA Net Static hand shape (corners, edges, blobs, or ridges) 88.7%
[161] 2019 SIGNUM Stacked temporal fusion layers in DCNN Continues spatiotemporal 2.80WER
[162] 2019 Collected Leap motion device Continues Isolated 3D positions of the fingertips 72.3%89%
[163] 2019 Collected CNN Static Hand shape 95%
[164] 2019 CSL D-shift Net Continues spatial features time features, and temporal. 96.7%
[165] 2019 DEVISIGN_D B3D Res-Net Isolated spatiotemporal 89.8%
[166] 2019 Collected Local and GIST Descriptor Isolated Spatial and scene-based features 95.83%
[169] 2019 Collected Restricted Boltzmann Machine (RBM) Isolated Handshape, and network generated features 88.2%
[170] 2019 KSU-SSL 3D-CNN Isolated hand shape, position, orientation, and temporal dependence in consecutive frames 77.32%
[171] 2019 Collected C3D, and Kinect device Continues Temporal, and Skeleton 94.7%
[175] 2019 Collected Open Pose library with Kinect V2 Static 3D skeleton 98.9%.
[177] 2020 Ishara-Lipi Mobile Net V1 Isolated Two hands shape 95.71%
[178] 2020 Collected DCNN Static Hand shape 94.31%.
[179] 2020 Collected Single layer Convolutional Self-Organizing Map (CSOM) Isolated Hand shape 89.59%
[180] 2020 KSU-SSL Enhanced C3D architecture Isolated Spatiotemporal of hand and body 87.69 %
[182] 2020 KSU-SSL 3DCNN Isolated Spatiotemporal 84.3%
[185] 2020 Collected ResNet50 model Isolated Hand shape, Extra Spatial hand Relation (ESHR) features, and Hand Pose (HP), temporal. 98.42%
[186] 2020 Polytropon (PGSL) ResNet-18 Isolated Optical flow of skeletal, handshapes, and mouthing 95.31%
[187] 2020 Collected Discrete cosines transform, Zernike moment, scale-invariant feature transform, and social ski driver optimization algorithm Static Hand shape 98.89%
[189] 2020 RWTH-2014 Temporal convolution unit and dynamic hierarchical bidirectional GRU unit Continues spatiotemporal 10.73% BLEU
[191] 2020 Collected Standard score’ normalization on the raw Channel State Information (CSI) acquired from the Wi-Fi device, and MIFS algorithm Static, and continues The cross-cumulant features (unbiased estimates of covariance, normalized skewness, normalized kurtosis) 99.9%
[192] 2020 GSL Open Pose human joint detector Isolated 3D hand skeletal, and region of hand, and mouth 93.56%
[197] 2020 Collected Four channel surface electromyography (sEMG) signals Isolated time-frequency joint features 93.33%
[199] 2020 Collected Euler angle, Quaternion from IMU signal Continues Hand Rotation 10.8% WER
[76] 2020 RKS-PERSIANSIGN 3DCNNs Isolated Spatiotemporal 99.8%
[202] 2020 ASL fingerspelling A DCNN Static Hand Shape 99.96%
[203] 2020 Collected Construct a color-coded topographical descriptor from joint distances and angles, to be used in 2 streams (CNN) Isolated distance and angular 93.01%
[204] 2020 Collected Two CNN models and a descriptor based on Histogram of cumulative magnitudes Isolated Two hands, skeleton, and body 64.33%
[208] 2021 RWTH-2014T Semantic Focus of Interest Network with Face Highlight Module (SFoI-Net-FHM) Isolated Body and facial expression 10.89Bleu
[210] 2021 Collected (ConvLSTM) Isolated Spatiotemporal 94.5%
[212] 2021 Collected ResNet50 Static hand area, the length of axis of first eigenvector, and hand position changes. 96.42%.
[214] 2021 Collected f-CNN (fusion of 1-D CNN and 2-D CNN Isolated Time and spatial-domain features of finger resistance movement 77.42%
[217] 2021 MU Modified Alex Net and VGG16 Static Hand edges and shape 99.82%
[222] 2021 Collected VGG net of six convolutional layers Static Hand shape 97.62%
[224] 2021 38 BdSL DenseNet201, and Linear Discriminant Analysis Static Hand shape 93.68%
[225] 2021 KSU-ArSL Bi-LSTM Isolated spatiotemporal 84.2%
[226] 2021 Collected Paired pooling network in view pair pooling net (VPPN) Isolated spatiotemporal 84.95%
[228] 2021 ASLLVD Bayesian Parallel Hidden Markov Model (BPaHMM) + stacked denoising variational autoencoders (SD-VAE) + PCA Continues Shape of hand, palm, and face, along with their position, speed, and distance between them 97%
[230] 2021 ASLLVD 3-D CNN’s cascaded Isolated spatiotemporal 96.0%
[231] 2021 Collected leap motion controller Static, and Isolated sphere radius, angles between fingers their distance 91.82%
[232] 2021 RWTH-2014 (3 C 2 C 1) D ResNet Continues height, motion of hand, and frame blurriness levels 23.30WER
[233] 2021 Montalbano II AlexNet + Optical Flow (OF) + Scene Flow (SF) methods Isolated Pixel level, and hand pose 99.08%
[234] 2021 RWTH-2014 GAN Continues spatiotemporal 23.4WER
[235] 2021 MNIST DCNN Static Hand shape 98.58%
[236] 2021 Collected R-CNN Static Hand shape 93%
[237] 2021 CSL-500 Multi-scale spatiotemporal attention network (MSSTA) Isolated Spatiotemporal 98.08%
[242] 2022 MNIST modified CapsNet Static Spatial, and orientations 99.60%
[243] 2022 RKS-PERSIANSIGN Singular value decomposition SVD Isolated 3D hand key points between the segments of each finger, and their angles. 99.5%
[244] 2022 Collected 2DCRNN + 3DCRNN Continues Spatiotemporal out of small patches 99%
[246] 2022 Collected Atrous convolution mechanism, and semantic spatial multi-cue model Static Isolated pose, face, and hand, and Spatial, full frame, 99.85%
[253] 2022 Collected 4 DNN models using 2D and 3D CNN Isolated Spatiotemporal 99%
[255] 2022 Collected Scale-Invariant Feature Transformation (SIFT) Static Corner, edges, rotation, blurring, and illumination. 97.89%
[256] 2022 Collected InceptionResNetV2 Isolated Hand shape 97%
[257] 2022 Collected Alex net Static Hand shape 94.81%
[258] 2022 Collected Sensor + mathematical equations+ CNN Continues Mean, Magnitude of Mean, Variance, correlation, Covariance, and frequency domain features+ spatiotemporal 0.088WER
[260] 2022 Collected Media Pipe framework Isolated hands, body, and face 99%
[261] 2022 Collected Bi-RNN network, maximal information correlation, and leap motion controller Isolated hand shape, orientation, position, and motion of 3D skeletal videos. 97.98%
[264] 2022 LSA64 dynamic motion network (DMN)+ Accumulative motion network (AMN) Isolated spatiotemporal 91.8%
[265] 2022 CSL-500 Spatial–temporal–channel attention (STCA) is proposed isolated spatiotemporal 97.45%
[268] 2022 Collected SURF (Speeded Up Robust Features) Isolated distribution of the intensity material within the neighborhood of the interest point 99%
[269] 2022 Collected Thresholding and Fast Fisher Vector Encoding (FFV) Isolated Hand, palm, finger shape, and position and 3D skeletal hand characteristics 98.33%
Related Works on Classification Problem

Classification is the final phase of any sign language recognition system and used before transferring the sign language into another form of data whether text or sound. In general, a particular sign is recognized by comparing it with the trained dataset, in which it categorizes the data into respective classes, depending on the feature vector obtained. Moreover, the system can calculate the probability associated with each class, allowing the data to be categorized under the respective class based on probability values. Overall, the classification conditions for sign language using DL involve selecting appropriate data representation, feature extraction techniques, classification algorithms, evaluation metrics, and ensuring sufficient and diverse training data. These factors collectively contribute to the accuracy and effectiveness of the sign language classification system. However, it may have some kinds of problems such as overfitting. In the realm of DL, overfitting occurs when a neural network model becomes too specialized in learning from the training data to the extent that it fails to generalize effectively to new, and unseen data. In other words, the model "memorizes" the training examples instead of learning the underlying patterns or relationships. When a DL model overfits, it performs very well on the training data but struggles to accurately predict or classify new instances that it has not encountered during training [119]. Various causes and indicators of overfitting exist, including a high model complexity with numerous parameters, insufficient training data, lack of regularization, excessive training epochs, and reliance on training data for evaluation [120]. To mitigate overfitting in deep models, several effective techniques can be employed. These include regularization methods [121], the incorporation of dropout layers [122], early stopping criteria [123], data augmentation strategies [124], and increasing training data [125]. These techniques can help to enhance model generalization and prevent the adverse effects of overfitting. Table 7 summarizes some related work of sign language recognition systems using DL that focuses on solving the problem of overfitting.

Related works on SLR using DL that address overfitting problem.

Author(s) Year dataset Model technique result
[129] 2018 NTU DCNN Augmentation 92.4%
[130] 2018 Collected Modified VGG net Dropout 84.68%
[132] 2018 Ishara-Lipi DCNN Dropout 94.88%
[133] 2018 Collected DCNN small convolutional filter sizes, Dropout, and learning strategy 85.3%
[136] 2018 HUST Deep Attention Network (DAN) data augmentation 73.4%
[142] 2018 ASL Finger Spelling A DNN Dense Net 90.3%
[143] 2018 Collected 3DCNN SGD 88.7%
[146] 2018 SIGNUM CNN-HMM hybrid Augmentation 7.4 error
[157] 2019 Collected DCNN Augmentation 93.667%
[79] 2019 Collected ResNet-152 batch size, Augmentation 55.28%
[163] 2019 Collected VGG-16 Dropout 95%
[166] 2019 Collected DCNN Augmentation 95.83%
[167] 2019 Collected DCNN Dense Net 90.3%
[171] 2019 Collected LSTM Increase hidden state number 94.7%
[172] 2019 NVIDIA Squeeze-net Augmentation 83.29%
[173] 2019 G3D Four stream CNN Sharing of multi modal features with RGB spatial features during training and drop out 86.87%
[175] 2019 Collected DCNN Augmentation 98.9%.
[176] 2020 Collected DCNN Pooling Layer 90%
[181] 2020 Collected DCNN Reduce epochs to 30, and dropout added after each maxpooling 97.6%
[184] 2020 Collected CNN with 8 layers Augmentation 89.32 %
[188] 2020 MNIST CNN Dropout 93%
[190] 2020 Collected Enhanced Alex Net Augmentation 89.48%
[191] 2020 Collected SVM Augmentation, and k-fold cross validation 99.9%
[193] 2020 KETI CNN+LSTM New data augmentation 96.2%
[194] 2020 Collected VGG16, and ResNet152 with enhanced softmax layer Augmentation 99%
[196] 2020 Collected RNN-LSTM dropout layer (DR) 99.81%
[201] 2020 Collected CNN dropout layer, and augmentation 95%
[203] 2020 NTU 2 stream CNN randomness in the features interlocking fusion with dropout 93.01%
[207] 2021 Jochen-Triesch’s DCNN two dropouts 99.96%
[214] 2021 Collected Generic temporal convolutional network (TCN) Dropout 77.42%
[215] 2021 Collected DCNN Dropout 96.65%
[216] 2021 Collected DCNN Cyclical learning rate method 99.7%
[217] 2021 MU Modified AlexNet and VGG16 Augmentation 99.82%
[222] 2021 Collected CNN Dropout 97.62%
[229] 2021 Collected 3DCNN Dropout & Regularization 88.24%
[236] 2021 Collected ResNet-18 Zero-patience stopping criteria 93.4%
[238] 2021 Collected DCNN Synthetic Minority Oversampling Technique (SMOTE) 97%
[240] 2022 Collected DCNN Augmentation 99.67%
[253] 2022 Collected ResNet50-BiLSTM Augmentation 99%
[256] 2022 Collected LSTM, and GRU Dropout 97%
[263] 2022 BdSL CNN Augmentation 99.91%

Another critical issue that must be considered when designing a deep model for sign language recognition is the generalization, which refers to the capability of a model to operate accurately on unseen data that is distinct from the training one. The model demonstrates a high degree of generalization ability by consistently achieving impressive performance across a wide range of diverse and distinct datasets [126]. Having consistent results across different datasets is an important characteristic for a model to be considered robust and reliable, which demonstrates that it can be applied effectively to various real-world scenarios. The datasets can have different characteristics, biases, or noise levels. Therefore, it is crucial to carefully evaluate and validate the model's performance on each specific dataset to ensure its reliability and generalization ability [127]. Table 8, presents relevant works in sign language recognition using DL, focusing on the model's generalization ability by evaluating its performance on diverse datasets.

Related works on SLR using DL that aim to achieve generalization.

Author(s) Year Datasets Technique Result
[129] 2018 ASL finger spelling ANTU DCNN 92.4%99.7%
[134] 2018 NYUMUASL Fingerspelling AASL Surrey Restricted Boltzmann Machine (RBM) 90.01%99.31%98.13%97.56%
[136] 2018 NTUHUST DAN 98.5%73.4%
[143] 2018 Collected CSLChaLearn14 3D-CNN 88.7%95.3%
[145] 2018 Collected MD05CMU JADM+CNN 88.59%87.92%87.27%
[146] 2018 RWTH 2012RWTH 2014SIGNUM CNN-HMM hybrid 30.0 WER32.57.4
[156] 2019 CollectedRWTH-2014 Hierarchical Attention Network (HAN) + Latent Space LS-HAN 82.7%61.6%
[161] 2019 RWTH-2014SIGNUM DCNN 22.86 WER2.80
[164] 2019 CSLIsoGD Proposed multimodal two-stream CNN 96.7%63.78%
[165] 2019 DEVISIGN-DCollected Deep 3-d Residual ConvNet + BiLSTM 89.8%86.9%
[170] 2019 KSU-SSLArSLRVL-SLLL 3D-CNN 77.32%34.90%70%
[173] 2019 Collected RGB-DMSRUT KinectG3D Four stream CNN 86.87%86.98%85.23%88.68%
[174] 2019 Jochen-TrieschMKLMNovel SI-PSL Novel DNN 97.29%96.8%51.88%
[182] 2020 KSU-SSLArSL by University of SharjahRVL-SLLL 3DCNN 84.38%34.9%70%
[186] 2020 PGSLChicagoFSWildRWTH 2014T DCNN 95.31%92.63%76.30%
[187] 2020 ASLMU Deep Elman recurrent neural network 98.89%97.5%
[192] 2020 GSLChicagoFSWild CNN 93.56%91.38%
[76] 2020 NYUFirst-Person, RKS-PERSIANSIGN CNN 4.64 error91.12%99.8%
[202] 2020 NUSAmerican fingerspelling A DCNN 94.7%99.96%
[203] 2020 HDM05CMUNTUCollected 2 stream CNN 93.42%92.67%94.42%93.01%
[204] 2020 UTD–MHADIsoGDCollected linear SVM classifier 94.81%67.36%64.33%
[207] 2021 Collected RGB images.Jochen-Triesch’s DCNN 99.96%100%
[210] 2021 LSA64LSACollected 3DCNN 98.5%99.2 %94.5%
[211] 2021 ASLG-PC12RWTH-2014 GRU and LSTM Bahdanau and Luong’s attention mechanisms 66.59%19.56% BLEU
[221] 2021 ASL alphabet, ASL MNIST MSL Optimized CNN based on PSO 99.58%99.58%99.10%
[225] 2021 KSU-ArSLJesterNVIDIA Inception-BiLSTM 84.2%95.8%86.6%
[226] 2021 CollectedNTUMuHAVi,WEIZMANNNUMA Motion modelled deep attention network (M2DA-Net) 84.95%89.98%85.12%82.25%88.25%
[228] 2021 RWTH-2014ASLLVD Novel hyperparameter based optimized Generative.Adversarial Networks (H-GANs) 73.9%97%
[232] 2021 RWTH-2014Collected Bidirectional encoder representations from transformers (BERT) + ResNet 20.123.30 WER
[233] 2021 Montalbano IIisoGDMSRCAD-60 LSTM+CNN 99.08%86.10%98.40%95.50%
[234] 2021 RWTH2014(CSL)(GSL) GAN 23.42.12.26
[237] 2021 CSL-500DEVISIGN-L Dual Network up on a Graph Convolutional Network (GCN). 98.08%64.57%
[242] 2022 SLDDMNIST Modified Caps Net architecture (SLR-Caps Net) 99.52%99.60%
[243] 2022 RKS-PERSIANSIGNFirst-PersonASVIDisoGD Single shot detector, 2D convolutional neural network, singular value decomposition (SVD), and LSTM 99.5%91%93%86.1%
[247] 2022 CollectedCollectedASL finger spelling DCNN+ diffGrad optimizer 92.43%88.01%99.52%
[248] 2022 38 BdSLCollectedIshara-Lipi BenSignNet 94.00%99.60%99.60%
[251] 2022 CollectedCollectedCollected DCNN 99.41%99.48%99.38%
[254] 2022 CollectedCambridge hand gesture Hybrid model based on VGG16-BiLSTM 83.36%97%
[255] 2022 CollectedMNIST,JTDNUS Hybrid Fist CNN 97.89%,95.68%94.90%95.87%
[256] 2022 ASLGSLAUTSLIISL2020 LSTM+GRU 95.3%94%95.1%97.1%
[261] 2022 CollectedSHRECLMDHG DLSTM 97.98%96.99%97.99%
[262] 2022 AUTSLCollected 3D-CNN 93.53%94.83%
[265] 2022 CSL-500JesterEgo Gesture deep R (2+1) D 97.45%97.05%94%
[266] 2022 MUHUST-ASL end-to-end fine-tuning method of a pre-trained CNN model with score-level fusion technique 98.14%64.55%
[269] 2022 SHRECCollectedLMDHG FFV-Bi-LSTM 92.99%98.33%93.08%

The choice of DL layers significantly influences the classification model's performance, as it determines the model's architecture and its ability to learn and represent intricate patterns in the input data. Selecting the right layers involves a comprehensive understanding of the data's characteristics, problem complexity, and available resources for training and inference. Often, it necessitates experimentation, tuning, and domain expertise to discover the optimal combination of layers that maximizes classification performance for a particular task [128]. In sign language recognition. numerous authors have designed and utilized deep models to achieve desired performance levels, as depicted in Table 9.

Related works’ Classifiers employed in SLR using DL.

Author year Input modality Classifier result
[129] 2018 Static DCNN 92.4%
[131] 2018 Static DCNN 99.85%
[133] 2018 Static DCNN 85.3 %
[134] 2018 Static restricted Boltzmann machine 98.13 %
[135] 2018 Isolated LRCNs and 3D CNNs 99 %
[136] 2018 Static DAN 73.4%
[137] 2018 Static (CNNs) of variant depth sizes and stacked denoising autoencoders 92.83%
[139] 2018 Static DCNN 82.5%
[142] 2018 Static DCNN 90.3 %
[145] 2018 Isolated DCNN 88.59%
[146] 2018 Continues CNN-HMM hybrid 7.4 error
[147] 2018 Static DCNN 98.05 %
[151] 2018 Isolated 3DCNN, and enhanced fully connected (FCRNN) 69.2 %
[155] 2019 Continues Deep Capsule networks and game theory 92.50%
[156] 2019 Continues Hierarchical Attention Network (HAN) and Latent Space 82.7 %
[157] 2019 Static DCNN 93.667%
[160] 2019 Static DCNN 97 %
[161] 2019 Continues DCNN 2.80 WER
[162] 2019 Continues Isolated Modified LSTM 72.3%89%
[167] 2019 Isolated DCNN based Dense NET 90.3 %
[168] 2019 Static DCNN 97.71%
[176] 2020 Static DCNN 90%
[181] 2020 Static DCNN 97.6%
[184] 2020 Static Eight CNN layers+ stochastic pooling, batch normalization and dropout 89.32 %
[185] 2020 Isolated Cascaded model (SSD, CNN, LSTM) 98.42 %
[187] 2020 Static Deep Elman recurrent neural network 98.89 %
[188] 2020 Static DCNN 93%
[190] 2020 Static Enhanced Alex Net 89.48%
[198] 2020 Static Multimodality fine-tuned VGG16 CNN+ Leap Motion network 82.55%
[199] 2020 Continues Multi-channel CNN 10.8 WER
[200] 2020 Static Hybrid model based on the Inception v3+ SVM 99.90%
[201] 2020 Static 11 Layer CNN 95%
[205] 2021 Static Three-layered CNN model 90.8%
[206] 2021 Isolated Hybrid deep learning with convolutional (LSTM)+ and BiLSTM. 76.21%
[209] 2021 Isolated DCNN+ Sentiment analysis 99.63%
[211] 2021 Continues GRU+LSTM 19.56error
[214] 2021 Isolated Generic temporal convolutional network 77.42%
[215] 2021 Static DCNN 96.65%
[216] 2021 Static DCNN 99.7%
[220] 2021 Static Pretrained InceptionV3+ Mini-batch gradient descent optimizer 85%
[221] 2021 Static Apply the PSO algorithm to find the optimal parameters of the convolutional neural networks 99.58%
[223] 2021 Continues Visual hierarchy to lexical sequence alignment network H2SNet 91.72%
[227] 2021 Static Novel lightweight deep learning model based on bottleneck motivated from deep residual learning 99.52%
[228] 2021 Continues Novel hyperparameter based optimized Generative Adversarial Networks (H-GANs) 97%
[229] 2021 Isolated 3DCNN 88.24%
[232] 2021 Continues Bidirectional encoder representations from transformers (BERT) + ResNet 23.30 WER
[234] 2021 Continues Generative Adversarial Network (SLRGAN) 23.4 WER
[238] 2021 Static DCNN 97%
[239] 2022 Static Optimized DCNN hybridization of Electric Fish Optimization (EFO), and Whale Optimization Algorithm (WOA) called Electric Fish based Whale Optimization Algorithm (E-WOA). 98.7%
[241] 2022 Isolated CNN+ RNN 98.8%
[242] 2022 Static Modified CapsNet architecture, (SLR-CapsNet) 99.60%
[245] 2022 Static DCNN 99.52%
[247] 2022 Static DCNN+ diffGrad optimizer 88.01%
[250] 2022 Static DCNN 92%
[251] 2022 Static DCNN 99.38%
[252] 2022 Static Lightweight CNN 94.30%
[254] 2022 Isolated Hybrid model based on VGG16-BiLSTM 83.36%
Related Works in Time and Delay Problem

In real-world classification scenarios using DL, time and delay is a principal factor to consider. It is important to strike a balance between achieving accurate classification results and minimizing the time required. The specific requirements and constraints of the application, such as the desired response time or the available computational resources, should be considered when designing and deploying DL models. As a result, one of the major requirements that make the recognition system of sign language efficient is the recognition time. Table 10 illustrates the related DL works for sign language recognition that focus on improving the recognition time.

Table 10: Related works on SLR using DL that aim to minimize the required time.

Discussion

Designing systems for recognizing sign language has become an emerging need in society and attracted the attention of academics and practitioners, due to its significant role in eliminating the communication barriers between the hearing and deaf communities. However, many challenges appeared when trying to design a sign language recognition system such as the dynamic gestures, environmental conditions, the availability of public datasets, and the multidimensional feature vectors. Still, many researchers are attempting to develop accurate, generalized, reliable, and robust sign language recognition models using deep learning. Deep learning technology is widely applied in many fields and research areas such as speech recognition, image processing, graphs, medicine, computer vision. With the emergence of DL approaches, sign language recognition has managed to significantly improve its accuracy. From the previous tables that illustrate some promising related works on sign language recognition using DL architectures, it is noticed that the most widely utilized deep architecture is CNN. Convolutional Neural Networks (CNNs) exhibit a remarkable capacity to extract discriminative features from raw data, enabling them to achieve impressive results in several types of sign language recognition tasks. They demonstrate robustness and flexibility, being employed either independently or in combination with other architectures, such as Long Short-Term Memory (LSTM), to enhance performance in sign language recognition. Moreover, CNNs prove to be highly advantageous in handling multi-modality data, such as RGB-D data, skeleton information, and finger points. These modalities provide rich information about the signer's actions, and their utilization has been instrumental in enhancing and addressing multiple challenges in sign language recognition. A set of related works focuses on solving only one type of problem facing the sign language recognition using DL such as in [132, 137, 139, 141, 147, 148, 152, 153, 154, 160, 169, 177, 195, 198, 205, 208, 212, 218, 220, 231, 235, 244, 247, 250, 252, 257, 258, 266], while others trying to solve multiple problems such as in [185, 199].The most widely used feature is the spatiotemporal, that depends on the hand shape, and the location information of the hand [135, 143, 156, 161, 165, 180, 182, 189, 76, 210, 225, 226, 230, 234, 237, 244, 253, 264, 265]. However, there are works that make use of more than one type of features in addition to spatiotemporal such as facial expression, skeleton, orientation of hand and angles [138, 141, 144, 151, 152, 79, 159, 162, 164, 166, 170, 171, 175, 185, 186, 191, 192, 197, 199, 203, 204, 208, 212, 228, 231, 232, 233, 245, 246, 255, 258, 260, 261, 268, 269]. Some works apply separate feature extraction techniques rather than depending only on the DL extracted features and managed to obtain recognition results [149, 152, 153, 79, 159, 162, 166, 169, 171, 175, 177, 179, 187, 189, 191, 192, 197, 199, 203, 204, 208, 228, 231, 233, 235, 237, 245, 246, 255, 258, 260, 261, 265, 268, 269]. Recent works especially from 2020 onwards focus on developing a recognition system for continuous sentences in sign language, which is still an open problem that gathers the most attention and is not completely solved or employed in any commercial application. Two factors that may contribute to an improved accuracy of continuous sign language recognition including feature extraction from frame sequences of the entered video and coordination between the features of every segment in the video and its corresponding sign label. Acquiring features from video frames that are more descriptive and discriminative resulted in better performance. While recent models in continuous sign language recognition have an uptrend in model performance using DL abilities in computer vision and Natural Language Processing (NLP), there is still much space for performance enhancement in this area. One of the main problems that many researchers deal with is the trajectory [186, 192, 204, 210, 260], and occlusion [129, 173, 76, 226, 228, 233, 237, 239, 259]. Furthermore, selecting or designing the appropriate deep model is one of the main challenges that have been addressed to deal with a particular type of challenges in sign language recognition by a variety of research in order to reach the desired accuracy goal. Others focus on solving some classification problems which is the overfitting that leads to the failure of the system. Applying any recognition system on more than one dataset with different properties is significant (high generalization), and one of the major factors that make the system highly effective. Thus, many researchers focus on implementing the recognition system of sign language on more than one dataset with a lot of variation and do not achieve the same results as in [129, 136, 143, 146, 156, 161, 164, 170, 182, 186, 204, 228, 234, 237, 254, 266]. Consequently, based on the information gathered from the preceding tables, deep learning stands out as a potent approach that has achieved the most impressive outcomes in sign language recognition. However, it's important to note that no existing research has successfully tackled all the associated challenges comprehensively. Some studies prioritize achieving high accuracy without considering time constraints, while others concentrate on addressing feature extraction issues and functioning in various environmental conditions. Yet, there's a lack of consideration for the complexity and overall applicability of the model. In addition, a significant aspect not extensively discussed in the related works pertains to hardware cost and complexity, both of which exert a substantial impact on the efficiency of the recognition system, particularly in real-world applications.

Conclusions and Future Work

Sign language recognition goes a long way from its first start in recognizing alphabets and digits reaching up to words and sentences. Recent systems of sign language recognition have a degree of success in dealing with dynamic signs that are based on hand and body motion, obtained from vision, or hardware devices. The use of DL for sign language recognition has improved the system performance into a higher level and confirmed its effectiveness in recognizing the signs of different forms including letters, word, and sentences, which are captured using different devices and convert them into another form such as text and sound. In this paper, related works on the use of DL for sign language recognition from the year of 2018 to 2022 have been reviewed, and a conclusion is possessed that DL reach to the desired performance, within high result in many aspects. Nevertheless, there remains room for further improvement to develop a comprehensive system capable of effectively handling all challenges encountered in sign language recognition. The goal is to achieve accurate and rapid results across various environmental conditions while utilizing diverse datasets. As future work, the primary objective is to address the issue of generalization and minimize the time needed for sign language recognition. Our objective is to present a deep learning model that can provide precise and highly accurate recognition outcomes for various types of sign language, encompassing both static and dynamic ones in different languages including English, Arabic, Malay, and Chinese. Notably, this model aims to achieve these outcomes while minimizing hardware expenses and the required training time with high recognition accuracy.