Deep Learning for Sign Language Recognition: A Comparative Review

Communication plays an essential role with enormous effects on individuals’ lives, such as in gaining and exchanging knowledge, interacting, developing social relationships, and revealing feelings and needs. While most humans communicate verbally, there are those with limited verbal abilities who need to communicate using Sign language (SL). Sign Language is a type of language that is visual, which is utilized by the deaf individuals and mainly relies on the various parts of an individual’s body including fingers, hand, arm, head, body, and facial expression to transfer information rather than using the vocal tract [1]. According to the World Federation of the Deaf, there are more than seventy million deaf people around the world that use more than 300 types of sign language [2]. However, it is not popular among the individuals with typical hearing and communication abilities, and a few of them are able to understand, and learn sign languages. This reveals a genuine communication gap between deaf individuals and the rest of society. Automated recognition and translation of sign language by performing sign language recognition would help to break down these barriers by providing a comfortable communication platform between deaf, and hearing individuals and give the same opportunities for deaf individuals to obtain information as everyone else [3]. Machine translation demonstrates a remarkable capacity for overcoming linguistic barriers, particularly through the utilization of Deep Learning (DL), as a branch of the field. Deep learning exhibits outstanding and exceptional performance in diverse domains, including image classification, pattern recognition, and various other fields and applications [4]. The advancement of DL networks has witnessed a significant surge in their performance, particularly in the realm of video-related tasks, such as Human Action Recognition, Motion Capturing, Gesture Recognition [5,6,7]. Basically, DL techniques offer remarkable attributes that render them highly advantageous in Sign Language Recognition (SLR). This is primarily attributed to their hidden layers, which autonomously extract latent features, as well as their capacity to effectively handle the intricate nature of hand gestures in sign language. This is achieved by leveraging extensive datasets, enabling the generation of accurate outcomes without time-consuming processes, a characteristic often lacking in conventional translation methods [8]. This paper presents a review of various deep learning models used to recognize sign language in to spot the light the key challenges encountered in using deep learning for sign language recognition and determine the unresolved issues. Additionally, this paper provides some suggestions to overcome these challenges that are based on our knowledge have not been solved.

1.1.

Motivation

The communication gap that exists between normal and deaf individuals is the most important motivation in designing and building an interpreter to facilitate communication between them. When embarking on the design of such a translator, a comprehensive set of objectives must be taken into account. These include ensuring accuracy, speed, efficiency, scalability, and other factors that contribute to delivering a satisfactory translation outcome for both parties involved. However, numerous challenges have been identified in the realm of sign language recognition, necessitating the development of an efficient and robust system to address various issues related to environmental conditions, movement speed, occlusions, and adherence to linguistic rules. Deep-learning-based sign language recognition models have gained significant interest in the last few years due to the quality of the recognition and translation that they provide and their ability in dealing with the various sign language recognition challenges

1.2.

Contribution

The main contributions of this work are:

Provide a description of important concepts related to sign language including acquiring methods, types of sign language, and a description of many public datasets in different languages around the world.

Identify the various challenges and problems encountered in the implementation of sign language recognition using DL.

Review more than 140 related works for DL-based sign language recognition from the year 2018 to 2022.

Classify these relevant works according to the specific problem addressed and the technique or method employed to overcome the specified challenge or problem.

1.3.

Paper Organization

This paper is organized into eight main sections as described in Fig1. To facilitate a better and smoother reading for this review, a detailed description of each section is presented below:

Introduction: Provides a brief introduction about sign language, deep learning, describes the motivation behind this review, introduces the main contributions, and illustrates the main layout of this work.

Sign language Overview: Provides a comprehensive overview of sign language, encompassing its historical context and the fundamental principles employed in its construction and development. Additionally, it includes a description of the various forms used to represent letters, words, and sentences in sign language, as well as the acquisition methods employed for capturing sign language.

Deep Learning Background: Introduces the historical background of DL networks structures, properties, layers, and commonly utilized architectures.

Sign Language Recognition Challenges Using Deep Learning: Describes the main challenges and problems facing the recognition of sign language using DL.

Sign Language Public Datasets Description: Presents an overview of widely accessible sign language datasets, encompassing various languages and types (such as images and videos) and available in different formats (including RGB, depth, and skeleton data). Additionally, provide a description of public action datasets related to sign language.

Deep learning-based sign language-related works: Introduces a considerable number of promising related works for sign language using DL techniques from 2018 to 2022 that are organized based on the type of problem being addressed.

Discussion: Discusses the results and methods utilized by the presented related works.

Conclusion: Concludes the review conducted, and illustrates the conclusions reached by performing this review paper on sign language recognition using DL, in addition to a set of recommendations for future research in this area.

2.

Sign Language Overview

Sign language (SL) serves as a crucial means of communication for individuals who experience difficulties in speaking or hearing. Unlike spoken language, understanding sign language does not rely on auditory perception, nor does it involve vocalization. Instead, sign language is primarily conveyed through the simultaneous combination of hand shapes, orientations, movements, as well as facial expressions, making it a visual language. [9]. Historically, the linguistic studies of sign language started in the 1970s [10] and show that it is like spoken languages, in which it arranges elementary units called phonemes into meaningful units known as semantic units, which contain lingual information which include different symbols and letters. Sign language is not derived from spoken languages, instead, it has its own independent vocabulary and grammatical constructions [11]. However, the signs used by individuals who are deaf possess an internal structure similar to spoken words. Just as a limited number of sounds can generate hundreds of thousands of words in spoken languages, signs are formed by a finite set of gestural features. Thus, signs are not only gestures, but they are actually a group of linguistically significant features. There is a common misapprehension, that there is only a universal, and single sign language. Just like spoken languages, sign languages evolve, and grow inherently across time, and space [12]. Many countries have their own national sign languages. However, there is also regional variance and domestic dialects. Moreover, the signs do not have a one-to-one mapping to a specific word. Therefore, sign language recognition is a complex process that extends beyond a simple substitution of individual signs with their corresponding spoken language counterparts. This is attributed to the fact that sign languages possess distinct vocabularies and grammatical structures that are not confined to any particular spoken language. Furthermore, even within regions where the same spoken language is used, there can be significant variations in the sign languages employed [13].

2.1.

Sign Language Acquisition Modalities

The signs of sign language must be captured and attained to provide input for the recognition system and there are various acquisition techniques that provide several types of input such as image, video, and signals. Basically, the main acquisition methods for any sign language recognition system depend on one of these acquisition techniques.

1- Vision-Based: In this type of system signs are captured using single or multiple images capturing devices in the form of single images or video stream and in some cases uses an active and invasive device, to collect the depth information that represent an accurate information associated to the distance between the image plane and the relevant object in the intended image [14]. This category is easy and presents a low computational. There are many imaging devices for signs capturing images in the form of RGB and depth data including [15]:

Single camera: Refers to a filming technique or production method that involves using only one camera such as Webcam, digital cam, video cam, and smartphone cam.

Stereo-camera: Obtains many monocular cameras, or thermal ones to capture in-depth information.

Active methods: Utilizes the projection of structured light using devices such as Kinect and Leap Motion Controller (LMC), which are 3D cameras that can gather movement and skeletal data.

Other methods such as body markers in colored gloves, wrist bands, and LED lights.

Generally, the major advantages of vision methods are that it is not costly, convenient, and non-intrusive. The user simply needs to communicate using sign language naturally in front of an image capturing device. This makes it suitable for real-time applications [16]. However, the use of vision-based input suffers from a set of problems including [17]: -

Too much redundant information causes low recognition efficiency.

-

Low recognition accuracy, due to occlusion and motion blur.

-

The variances in sign language style between individuals resulted in poor generalization of algorithms.

-

Small recognizable words vocabulary due to the large vocabulary datasets containing similar words.

-

Have some challenging matters about time, speed, and overlapping.

-

Need more feature extraction methods to operate correctly.

2- Hardware-Based: This type mainly depends on the use of some types of hardware devices that can capture or sense the signs performed by the user when attached to his/her arm, hand, or fingers, and convert these signs into signals, or images, or in some cases video. Motion sensors are the most widely utilized devices that can track the movements, position, shapes, and velocity of fingers and hands [18]. Electronic gloves serve as the predominant sensor technology employed for capturing hand pose and associated motion. They are affixed to both hands to acquire precise data on hand movements and gestures. The hand’s position, orientation, and location are calculated precisely due to the hundreds of sensors supplemented in the gloves. The most significant advantage of this method is its fast reaction [19], so it is highly accurate. However, since it depends on costly sensors it cannot be considered an affordable method for the common deaf people. Moreover, it suffers from relatively low accuracy or complicated structures, and the insufficient amount of information provided by the wearable sensors often affects the overall performance of this system. Some popular examples of sensors are described below [20]: -

Inertial Measurement Unit (IMU): It is an electronic device employed to measure and report an object's specific force, position, angular rate, and sometimes orientation with respect to an inertial reference frame, and acceleration. It typically consists of a combination of accelerometers, gyroscopes, and sometimes magnetometers.

-

Electromyography (EMG): It is the device that uses electrodes placed on or inserted into the skin near the muscle of interest to measure human muscle’s electrical pulses and employ the bio-signal to detect movements.

-

Wi-Fi and Radar: These devices mainly depend on radio waves, broad beam radar, or spectrogram to detect in-air signal strength variation. They are employed to monitor the movements and positions of the deaf by capturing the reflections of radio waves off their body or hand movements. Radar systems can provide data on the dynamics and trajectories of sign language gestures. This information then can be used for analysis or recognition purposes.

-

Others include flex sensors, ultrasonic, mechanical, electromagnetics, and haptic technologies.

In general, although these methods exhibit higher speed and accuracy, the necessity for individuals to wear sensors remains impractical for the following reasons [21]: (1)

It may cause a burden on the users because they must take electronic devices with them when moving.

(2)

Portable electronic devices require a battery, which needs to be charged from time to time.

(3)

Specific equipment is required to process the signals acquired from the wearable devices.

3- Hybrid-based: In this type, the vision-based cameras together with other types of sensors, such as infrared depth sensors, are combined to acquire multi-mode information regarding the shapes of the hands [22]. This approach requires calibration between the hardware and vision-based modalities, which can be particularly challenging. The purpose of a hybrid system was to enhance data acquisition, and accuracy, and attempt to reduce the challenges and problems of both visions and hardware-based approaches [23].

2.2.

Sign Language Types

Static: A specific hand configuration and pose, depicted through a single image, is employed for the recognition of fingerspelled gestures of alphabets and digits. This recognition process relies solely on still images as input to predict and generate the corresponding output, without incorporating any movement. It is considered to be very inconvenient, due to the time required to perform prediction each time an input is given, and depends basically on handshapes, hand positions, and facial expressions to convey meaning [24].

Dynamic: Refers to a variant of sign language, in which signs are produced with movement. This form of communication encompasses not only handshapes and positions but also incorporates the movement of hands, arms, and other body parts to convey meaning. To capture and represent this type of sign language, video streams are required [25]. There are certain words in sign language, such as in American Sign Language, which necessitate hand movements for proper representation, making it a dynamic form. It plays a vital role in facilitating communication, as well as establishing linguistic and cultural identities within the deaf community. Dynamic signs find application in various contexts, including everyday conversations, education, storytelling, performances, and broadcasting. Broadly speaking, dynamic signs can be categorized into two main types based on what they represent, be it individual words or complete sentences. These are described below [26]: a)

Isolated: The input dynamic signs are used to represent words by performing more than one sign each time and pauses only happen between words.

b)

Continues: The continuous dynamic entries are mainly employed to represent sentences because it incorporates more than one sign performed continuously without any pause between signs [27].

3.

Deep Learning Background

The Deep Neural Network is basically a branch of Machine Learning (ML), that was originally inspired by and resembles the human nervous system, and the structure of the brain. It is composed of several layers and nodes, in which the layers are processing units systemized in input, output, and hidden layers. The nodes or units in every layer are linked to nodes in contiguous layers, and every connection owns its singular weight value. The inputs are multiplied by the intended weights and summed at every unit. The summation result undergoes a transformation depending on some type of activation function such as, Sigmoid function, Tan hyperbolic or Rectified Linear Unit (ReLU) [28]. Thus, DL includes stacking many learning layers to learn high-level abstractions in the data within approximate highly nonlinear functions giving the learning algorithm the ability to learn hierarchical features from the input data. This feature learning highly replaced the hand-engineered features and owes its regeneration to effective optimization methods, and powerful computational resources [29]. DL powerful properties give it the ability to taking the lead in achieving the desired results depending on a set of factors including: [30]

Feature learning refers to the capacity to acquire descriptive features from data that have an impact on other correlated tasks. This implies that numerous relevant factors are disentangled within these features, in contrast to handcrafted features that are designed to remain constant with respect to the targeted factors.

Hierarchical Impersonation: the features in this type of method are represented in a hierarchical format, in which the simple ones are represented in the lower layers, and the high layers learn increasingly complicated features. This will provide a successful encoding for properties of two types, including local and global in the last features representation.

Distributed Impersonation: Signifies a many-to-many relationship where the representations are dispersed. This occurs because multiple neurons can demonstrate a single factor, while one neuron can account for multiple factors. Such an arrangement eradicates the potential for dimensionality and offers a compact and comprehensive representation.

Large-Scale Datasets: the DL is able to deal with the datasets with a vast number of samples and gives outstanding performance in many domains.

In recent years, DL methods have demonstrated exceptional performance surpassing previous state-of-the-art ML techniques across various domains. One domain in which DL has emerged as a prominent methodology is computer vision, particularly in the context of sign language recognition. DL has provided novel solutions to challenges in sign language recognition and has become a leading approach in this field [31]. Many architectures of DL have been utilized for sign language recognition and in an accurate, fast, and efficient manner, due to their ability in dealing with most challenges and the complexity of sign language [32]. The most popular and utilized DL architectures are the Convolutional Neural Network (CNN), Deep Boltzmann Machine (RBM), Deep Belief Network (DBN), Auto Encoder (AE), Variational Auto Encoder (VAE), Generative Adversarial Network (GAN), and Recursive Neural Network (RNN) including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) [33].

4.

Sign Language Recognition Challenges Using Deep Learning

Detection, tracking, pose estimation, gesture recognition, and pose recovery represent key sub-fields within sign language recognition. These sub-fields are extensively employed in human-computer interaction applications that utilize DL techniques. Nevertheless, the recognition or conversion of signs performed by deaf individuals using DL presents a range of challenges that can significantly impact the output results. These challenges include:

4.1.

Feature Extraction

Feature extraction is the process used to select and /or combine variables into features, and effectively reduce the amount of data that must be processed, while still accurately and completely describing the original data. It has a role in addressing the problem of finding the most compact and informative set of features, to enhance the efficiency of data storage and processing. Defining feature vectors remains the most common and convenient means of data representation for classification and regression [34]. In the context of sign language recognition, the process of extracting pertinent features plays a vital, and decisive role. Irrelevant features, on the other hand, can result in misclassification and erroneous recognition [35]. Within the realm of DL techniques for data classification, automatic feature extraction holds paramount importance. Integrating various features extracted from both training and testing images without any data loss is a crucial step that greatly impacts the recognition accuracy of sign language. In general, two types of features are considered in sign language including manual, and non-manual features. Manual features encompass the movements of hands, fingers, and arms. On the other hand, nonmanual features represent the fundamental component in sign language [36] like facial expressions, eye gaze, head movements, upper body motion, and positioning. The combination of manual and non-manual features offers a comprehensive representation of sign language. In the domain of DL, features can be classified into two categories: spatial and temporal. Spatial features pertain to the geometric representation of shapes within a specific coordinate space, while temporal features account for time-related aspects during movement especially when dealing with a sequence of images as input. By employing feature fusion and combining multiple types of features in the process of sign language recognition using DL, one can achieve the desired outcomes effectively [37].

4.2.

Environment Conditions

The variability in the environment during sign capture poses a significant technical challenge with a notable impact on sign language recognition. When capturing an image, numerous factors, such as lighting conditions (spectra, source distribution, and intensity) and camera characteristics (sensor response and lenses), exert their influence on the appearance of the captured sign. Additionally, skin reflectance properties and internal camera controls [38] further contribute to these effects. Moreover, noise originating from other elements present in the background and landmarks can also influence sign recognition outcomes.

4.3.

Movement

The movements in sign language are dynamic acts, exhibiting trajectories with distinct beginnings and ends. The representation of dynamic sign language involves both isolated and continuous signing, wherein signs are performed consecutively without pauses. This introduces challenges related to similarity and occlusion, arising from variations in hand movements and orientations, involving one or both hands in different angles and directions [39]. The determination of each sign's precise beginning and end presents a significant hurdle, resulting in what is termed Movement Epenthesis (ME) or transition segments. These ME segments act as connectors between sequential signs when transitioning from the final position of one sign to the initial position of the next. However, ME segments do not convey any specific sign information; instead, they contribute to the complexity of recognizing continuous sign sequences. The lack of well-defined rules for making such transitions poses a significant challenge [40], demanding careful attention and a demonstrable approach to address effectively.

4.4.

Hand Segmentation, and Tracking

The segmentation process stands out as one of the most formidable challenges in computer vision, especially in the context of sign language recognition, where the extraction of hands from video frames or images holds particular significance due to their critical role in the recognition process. To address this, image segmentation is employed to isolate relevant hand data while eliminating undesired elements, such as background and other objects in the input, which might conflict with the classifier operations [41]. Image segmentation restricts the data region, enabling the classifier to focus solely on the Region of Interest (ROI). Segmentation methods can be categorized as contextual and non-contextual. Contextual segmentation takes spatial relationships between features into account, often using edge detection techniques. Conversely, non-contextual segmentation does not consider spatial relationships; rather, it gathers pixels based on global attributes. Hand tracking can be viewed as a subfield of segmentation, and it typically poses challenges, particularly when the hand moves swiftly, leading to significant appearance changes within a few frames [42].

4.5.

Classifier

In the realm of sign language recognition, the classifier's selection and design require meticulous attention. It is essential to carefully determine the architecture of the classifier, encompassing its layers and parameters, in order to steer clear of potential problems like overfitting or underfitting. The primary objective is to achieve optimal performance in classifying sign language. Furthermore, the classifier's ability to generalize effectively across diverse data types, rather than being confined to specific subsets, is of paramount importance [43].

4.6.

Time and Complexity

Real-time recognition of sign language is considered to be an important concern and one of the main problems that needs a real solution in order to give an efficient interpreter to fill up the communication gap between the deaf and public communities. The time problem arises from the need to process video data in real-time or with minimal delay. Computational complexity, both in hardware and software, can be quite demanding and may present challenges for the deaf community to effectively deal with [44].

5.

Sign Language Public Datasets

The availability of sign language datasets is limited and can be considered as one of the main obstacles in designing an accurate recognition system, in which there are few datasets available for sign language compared to gesture databases. Several sign language datasets are created with many variations such as regional differences, type of images (RGB or Depth), type of acquiring methods (images, video), and so on. Sign language differs from one region to another just like spoken languages, and each type has its own properties and linguistic grammar. Most publicly available and utilized sign and gesture datasets in different languages are described in this section and categorized depending on the type of language as illustrated in Table 1, and Fig2.

Table 1:

Public sign language datasets

Dataset	Language	Equipment	Modalities	Signers	Samples
ASL alphabets [45]	American	Webcam	RGB images	-	87,000
MNIST [46]	American	Webcam	Grey images	-	27,455
ASL Fingerspelling A [47]	American	Microsoft Kinect	RGB and depth images	5	48,000
NYU [48]	American	Kinect	RGB and depth images	36	81,009
ASL by Surrey [49	American	Kinect	RGB and depth images	23	130,000
Jochen-Triesch [50]	American	Cam	Grey images with different background	24	720
MKLM [51]	American	Leap Motion device and a Kinect sensor	RGB and depth images	14	1400
NTU-HD [52]	American	Kinect sensor	RGB and depth images	10	1000
HUST [53]	American	Microsoft Kinect	RGB and depth images	10	10880
RVL-SLLL [54]	American	Cam	RGB video	14
ChicagoFSWild [55]	American	Collected online from YouTube	RGB video	160	7,304
ASLG-PC12 [56]	American	Cam	RGB video	-	880
American Sign Language Lexicon Video (ASLLVD) [57]	American	Cam	RGB videos of different angles	6	3,300
MU [58]	American	Cam	RGB images with illumination variations in five different angles	5	2515
ASLID [59]	American	Web cam	RGB images	6	809
KSU-SSL [60]	Arabic	Cam and Kinect	RGB Videos with uncontrolled environment	40	16000
KArSL [61]	Arabic	Kinect V2	RGB video	3	75,300
ArSL by University of Sharjah [62]	Arabic	Analog camcorder	RGB images	3	3450
JTD [63]	Indian	Webcam	RGB images with 3 different backgrounds	24	720
IISL2020 [64]	Indian	Webcam	RGB video with uncontrolled environment	16	12100
RWTH-PHOENIX-Weather 2014 [65]	German	Webcam	RGB Video	9	8,257
SIGNUM [66]	German	Cam	RGB Video	25	33210
DEVISIGN-D [67]	Chinese	Cam	RGB videos	8	6000
DEVISIGN-L [67]	Chinese	Cam	RGB videos	8	24000
CSL-500 [68]	Chinese	Cam	RGB, depth and skeleton videos	50	25,000
Chinese Sign Language [69]	Chinese	Kinect	RGB, depth and skeleton videos	50	125000
38 BdSL [70]	Bengali	Cam	RGB images	320	12,160
Ishara-Lipi [71]	Bengali	Cam	Greyscale images	-	1800
ChaLearn14 [72]	Italian	Kinect	RGB and depth video	940	940
Montalbano II [73]	Italian	Kinect	RGB and depth video	20	940
UFOP–LIBRAS [74]	Brazilian	Kinect	RGB, depth and skeleton videos	5	2800
AUTSL [75]	Turkish	Kinect v2	RGB, depth and skeleton videos	43	38,336
RKS-PERSIANSIGN [76] in	Persian	Cam	RGB video	10	10,000
LSA64 [77]	Argentine	Cam	RGB video	10	3200
Polytropon (PGSL) [78]	Greek	Cam	RGB video	6	840
kETI [79]	Korean	Cam	RGB video	40	14,672

Several critical factors contribute to the evaluation of sign language datasets. One such factor is the number of signers involved in performing the signs, which significantly impacts the dataset's diversity and subsequently affects the evaluation of recognition systems' generalization rate. Additionally, the quantity of distinct signs within the datasets, particularly in isolated and continuous formats, holds considerable importance. Furthermore, the number of samples per sign plays a crucial role in training systems that require an ample representation of each sign. Adequate sample representation helps improve the robustness and accuracy of the recognition systems. Moreover, when dealing with continuous datasets, annotating them with temporal information for continuous sentence components is very important. This temporal information is vital for effectively processing and understanding this type of dataset [80]. Although sign language recognition is one of the gesture recognition applications, gesture datasets are seldom utilized for sign language recognition for many reasons. First, the classes count in the gesture recognition dataset has some degree of limitation. Secondly, sign language involves the simultaneous use of manual and non-manual gestures, posing challenges in annotating both types of gestures within a single gesture dataset. Moreover, sign language relies on hand gestures, while gesture datasets are broader and include gestures about full body movements. Additionally, gesture datasets lack the necessary details about hand fingers, which are essential for developing accurate sign language recognition systems [81]. Nevertheless, despite these limitations, gesture datasets can still play a role in training sign recognition systems. In this context, Table 2 presents a comprehensive overview of various gesture datasets, and Fig3 illustrates some representative examples.

Table 2:

Gesture public datasets

Name	Modality	device	signers	samples
LMDHG [82]	RGB, and depth videos	Kinect and	21	608
SHREC Shape Retrieval Contest (SHREC) [83]	RGB, and depth videos	Intel RealSense short range depth camera	28	2800
UTD–MHAD [84]	RGB, depth and skeleton videos	Kinect and wearable inertial sensor	8	861
The Multicamera Human Action Video Data (MuHAVi) [85]	RGB video	8 camera views	14	1904
NUMA [86]	RGB, depth and skeleton videos	10 Kinect with three different views	10	1493
WEIZMANN [87]	Low resolution RGB video	Camera with 10 different viewpoints	9	90
NTU RGB [88]	RGB, depth and skeleton videos	Kinect	40	56 880
Cambridge hand gesture [89]	RGB video captured under five different illuminations	Cam	9	900
VIVA [90]	RGB, and depth videos	Kinect	8	885
MSR [91]	RGB, and depth videos	Kinect	10	320
CAD-60 [92]	RGB and depth video in different environments, such as a kitchen, a living room, and office	Kinect	4	48
HDM05MoCap (motion capture) [93]	RGB video	Cam	5	2337
CMU [94]	RGB images	CAM	25	204
isoGD [95]	RGB and depth videos	Kinect	21	47,933
NVIDIA [96]	RGB and depth video	Kinect	8	885
G3D [97]	RGB and depth video	Kinect	16	1280
UT Kinect [98]	RGB and depth video	Kinect	10	200
First-Person [99]	RGB and depth video	RealSense SR300 cam	6	1,175
Jester [100]	RGB	Cam	25	148,092
Ego Guster [101]	RGB and depth video	Kinect	50	2,081
NUS II [102]	RGB images with complex backgrounds, and various hand shapes and sizes	Cam	40	2000

6.

Deep Learning based Sign Language Recognition-Related Works

Numerous research efforts have been dedicated to the recognition and translation of sign language across diverse languages worldwide, aiming to facilitate its conversion into other communication forms used by individuals, such as text or sound. This study categorizes the works of sign language recognition using DL according to the primary challenges encountered in recognition and the corresponding solutions proposed by each of the investigated works. Any sign language recognition system consists of key stages, which include signs acquisition, hand segmentation, tracking, preprocessing, feature extraction, and classification, as depicted in Fig4.

In sign Acquisition, the input modalities as mentioned earlier are either an image or a video stream using one type of vision-based capturing device or depth information using one of the hardware-based collecting equipment. The input modality may be in any format including RGB-colored, greyscale, and binary. In general, DL techniques need high quality data samples with sufficient amount for training to be conducted.

Accuracy is one of the most common performance measurements considered in any type of recognition system, in addition to the percentage of error that may be identified using the Equal Error Rate, Word Error Rate, and False Rate. Another evaluation metric named Bilingual Evaluation Understudy Score (BLEU), is used to measure the matching between the resulting sentences to the entered sign language. The perfect match results in a score of 1.0, while the worst score that represents mismatching is 0.0, so it is also considered as a measurement of accurate translation and widely used in machine learning systems [103]

The related sign language works using DL are categorized based on the type of problem solved in this work, and what is the technique utilized to get the desired result.

6.1.

Related Works on Preprocessing Problem

The acquired signs may exhibit issues such as low quality, noise, varying degrees of orientation, or enormous size. Therefore, the preprocessing step becomes indispensable to rectify these issues in sign images and videos, effectively eliminating any environmental influences that might have affected them, such as variations in illumination and color. This phase involves the application of filters and other techniques to adjust the size, orientation, and color, ensuring improved data quality for subsequent analysis and recognition. The primary advantage of preprocessing is enhancing the image quality, which enables efficient hand segmentation from the scene for effective feature extraction. In the case of video streams, preprocessing serves to eliminate redundant and similar frames from the input video, thereby increasing the processing speed of the neural network without sacrificing essential information. Many sign language recognition using DL overcome the environmental condition problem using a variety of techniques. Table 3 illustrates the most important related work, the environmental condition being addressed, and the proposed technique. Fig5 shows a sample of the NUSII dataset to show the environmental condition problem.

Table 3:

Related works on SLR using DL that address the various environmental conditions problem.

Author (s)	Year	Language	Modality	Type of condition	Deal with technique	results
[130]	2018	Bengali	RGB images	Variant background and skin colors	Modified VGG net	84.68%
[134]	2018	American	RGB images	noise and missing data	Augmentation	98.13%
[150]	2018	Indian	RGB video	Different viewing angles, background lighting, and distance	Novel CNN	92.88%
[158]	2019	American	Binary images	Noise	Erosion, closing, contour generation, and polygonal approximation,	96.83%
[159]	2019	American	Depth image	Variant illumination, and background	Attain depth images	88.7%
[164]	2019	chines	RGB, and depth video	Variant illumination, and background	Two-stream spatiotemporal network	96.7%
[173]	2019	Indian	RGB, and depth video	Variant illumination, background, and camera distance	Four stream CNN	86.87%
[178]	2020	Arabic	RGB images	Variant illumination, and skin color	DCNN	94.31%
[179]	2020	Arabic	RGB videos	Variant illumination, background, pose, scale, shape, position, and clothes	Bi-directional Long Short-Term Memory (BiLSTM)	89.59%
[180]	2020	Arabic	RGB Videos	Variant illumination, clothes, position, scale, and speed	3DCNN and SoftMax function	87.69%
[182]	2020	Arabic	RGB Videos	Variations in heights and distances from camera	Normalization	84.3%
[194]	2020	Arabic	RGB images	variant illumination, and background	VGG16 and the ResNet152 with enhanced softmax layer	99%
[201]	2020	American	Grayscale images	illumination, and skin color	Set the hand histogram	95%
[202]	2020	American	RGB images	Variant illumination, background	DCNN	99.96%
[206]	2021	Indian	RGB video	Variant illuminations, camera positions, and orientations	Google net+ BiLSTM	76.21%
[207]	2021	Indian	RGB images	Light and dark backgrounds	DCNN with few numbers of parameters	99.96%
[209]	2021	American	RGB video	Noise	Gaussian Blur	99.63%
[213]	2021	Korean	Depth Videos	Low resolution	Augmentation	91%
[224]	2021	Bengali	RGB images	Variant backgrounds, camera angle, light contrast, and skin tone	Conventional deep learning + Zero-shot learning ZSL	93.68%
[225]	2021	Arabic	RGB video	Variant illumination, background, and clothes	Inception-BiLSTM	84.2%
[227]	2021	American	Thermal images	Varying illumination	Adopt live images taken by a low-resolution thermal camera	99.52%
[229]	2021	Indian	RGB video	Varying illumination	3DCNN	88.24%
[230]	2021	American	RGB video	Noise, varying illumination	Median filtering + histogram equalization	96%
[236]	2021	Arabic	RGB images	Variant illumination, and background	Region-based Convolutional Neural Network (R-CNN)	93.4%
[239]	2022	Indian	RGB video	Variant illumination, and views	Grey scale conversion and histogram equalization	98.7%
[241]	2022	Arabic	RGB video	Variant illumination, and background	CNN+ RNN	98.8%
[249]	2022	Arabic	Greyscale images	Variant illumination, and background	Sobel filter	97%
[253]	2022	Arabic	RGB, and depth video	Variant Background	ResNet50-BiLSTM	99%
[259]	2022	American	RGB, and depth images	Noise and illumination variation	Median filtering and histogram equalization	91.4%
[261]	2022	American	Skeleton video	Noise in video frames	An innovative weighted least square (WLS) algorithm	97.98%
[270]	2022	English	Wi-Fi signal	Noise and uncleaned Wi-Fi signals.	Principal Component Analysis (PCA)	95.03%

Another challenge arises when attempting to recognize signs, particularly in the dynamic type, where movement is considered one of the key phonological parameters in sign phonology. This pertains to the variations in hand location, speed, orientation, and angles during the signing process [104]. A consensus on how to characterize and organize movement types and their associated features in a phonological representation has been lacking. Due to divergent approaches and perspectives, there remains uncertainty about the most suitable and standardized way to define and categorize movements in sign language. In general, there are three main types of movements in sign language [105,106]:

Movement of the hands and arms: include waving, pointing, or tracing shapes in the air.

Movement of the body: include twisting, turning, or leaning to indicate direction or location.

Movement of the face and head: include nodding, shaking the head, or raising the eyebrows to convey different meanings or emotions.

The movement involved in demonstrating sign language also involves a significant challenge, which includes dealing with similar paths of movement (Trajectory), and Occlusion. The arm trajectory formation refers to the principles and laws that invariantly govern the selection, planning, and generation processes of multi-joint movements, as well as to the factors that dictate their kinematics, namely geometrical and temporal features [107]. The sign language movement trajectory swerves to some extent, due to the action speed, and arm length of the user; even for the same user, the psychological changes resulted in inconsistent implementation speed of sign language movement [108]. Movement trajectory recognition is the key link of sign language translation research, which influence directly on the accuracy of sign language translation, in which the same signs match with variant movement trajectory predominantly refer to two variant meanings, that is, illustrating different sign language [109]. On the other hand, occlusion means that some fingers or parts of the hand would be covered (not in view of the camera) or hidden by other parts of the scene, so the sign cannot be detected accurately [110]. The occlusion may appear in various parts including hand/hand, and hand/face depending on the movement and the captured scene. The occlusion has a great effect on the segmentation procedure especially skin segmentation techniques [111]. Table 4 summarizes the most important related DL works that handle these types of problems in sign language recognition.

Table 4:

Related works on SLR using DL that address movement orientation, trajectory, occlusion problems.

Author(s)	Year	Type of variation	language	Signing mode	Model	Accuracy	Error Rate
[129]	2018	similarities, and occlusion	American	Static	DCNN	92.4%
[135]	2018	Movement	Brazilian	Isolated	Long-term Recurrent Convolutional Networks	99%	-
[138]	2018	size, shape, and position of the fingers or hands	American	Static	CNN	82%	-
[140]	2018	Hand movement	American	Isolated	VGG 16	99%	-
[144]	2018	Movement	American	Isolated	Leap Motion Controller	88.79%	-
[145]	2018	3D motion	Indian	Isolated	Joint Angular Displacement Maps (JADMs)	92.14%
[150]	2018	head and hand movements	Indian	Continues	CNN	92.88 %	-
[155]	2019	Hand movement	Indian	Continues	Wearable systems to measure muscle intensity, hand orientation, motion, and position	92.50%	-
[156]	2019	Variant hand orientations	Chines	Continues	Hierarchical Attention Network (HAN) and Latent Space	82.7%	-
[165]	2019	Similarity and trajectory	Chines	Isolated	Deep 3-d Residual ConvNet + BiLSTM	89.8%	-
[166]	2019	orientation of camera, hand position and movement, inter hand relation	Vietnam	Isolated	DCNN	95.83%
[173]	2019	Movement, self-occlusions, orientation, and angles	Indian	Continues	Four stream CNN	86.87%
[174]	2019	Movement in different distance from the camera	American	Static	Novel DNN	97.29%	-
[176]	2020	Angles, distance, object size, and rotations	Arabic	Static	Image Augmentation	90%	0.53
[180]	2020	fingers' configuration, hand's orientation, and its position to the body	Arabic	Isolated	Multilayer perceptron+ Autoencoder	87.69%
[185]	2020	Hand Movement	Persian	Isolated	Single Shot Detector (SSD) +CNN+LSTM	98.42%
[186]	2020	shape, orientation, and trajectory	Greek	Isolated	Fully convolutional attention-based encoder-decoder	95.31%	-
[192]	2020	Trajectory	Greek	Isolated	incorporate the depth dimension in the coordinates of the hand joints	93.56%	-
[195]	2020	finger angles and Multi finger movements	Taiwan	Continues	Wristband with ten modified barometric sensors+ dual DCNN	97.5%
[196]	2020	movement of fingers and hands	Chinese	Isolated	Motion data from IMU sensors	99.81%	-
[197]	2020	finger movement	Chinese	Isolated	Trigno Wireless sEMG acquisition system used to collect multichannel sEMG signals of forearm muscles	93.33%
[199]	2020	finger and arm motions, two-handed signs, and hand rotation	Chinees	Continues	Two armbands embedded with an IMU sensor and multi-channel sEMG sensors are attached on the forearms to capture both arm, and finger movements	-	10.8%
[76]	2020	Hand occlusion	Persian	Isolated	Skeleton detection	99.8%
[204]	2020	Trajectory	Brazilian	Isolated	Convert the trajectory information into spherical coordinates	64.33%
[210]	2021	Trajectory	Arabic	Isolated	Multi-Sign Language Ontology (MSLO)	94.5%
[213]	2021	Movement	Korean	Isolated	3DCNN	91%
[214]	2021	finger movement	Chines	Isolated	Design a low-cost data glove with simple hardware structure to capture finger movement and bending simultaneously	77.42%
[218]	2021	Skewing, and angle rotation	Bengali	Static	DCNN	99.57	0.56
[219]	2021	Hand motion	American	Continues	Sensing Gloves	86.67%
[223]	2021	spatial appearance and temporal motion	Chines	Continues	Lexical prediction network	91.72%	6.10
[226]	2021	finger self-occlusions, view invariance	Indian	Continues	Motion modelled deep attention network (M2DA-Net)	84.95%
[228]	2021	Occlusions of hand/hand, hands/face, or hands/upper body postures.	American	Continues	Novel hyperparameter based optimized Generative Adversarial Networks (H-GANs) Deep Long Short-Term Memory (LSTM) as generator and LSTM with 3D Convolutional Neural Network (3D-CNN) as a discriminator	97%	1.4
[230]	2021	Variant view	American	Isolated	3-D CNN’s cascaded	96%
[233]	2021	Hand occlusion,	Italian	Isolated	LSTM+CNN	99.08%
[237]	2021	Finger occlusion, motion blurring, variant signing styles.	Chines	Continues	Dual Network up on a Graph Convolutional Network (GCN).	98.08%
[239]	2022	self-structural characteristics, and occlusion	Indian	Continues	Dynamic Time Warping (DTW)	98.7%
[240]	2022	High similarity and complexity	American	Static	DCNN	99.67%	0.0016
[241]	2022	Movement	Arabic	Isolated	The difference function	98.8%
[259]	2022	Hand Occlusion	American	Static	Re-formation layer in the CNN	91.40%
[260]	2022	Trajectory, hand shapes, and orientation	American	Isolated	Media Pipe’s Landmarks with GRU	99%
[261]	2022	ambiguous and 3D double-hand motion trajectories	American	Isolated	3D extended Kalman filter (EKF) tracking, and approximation of a probability density function over a time frame.	97.98%
[262]	2022	Movement	Turkish	Continues	Motion History Images (MHI) generated from RGB video frames	94.83%
[264]	2022	Movement	Argentina	Continues	Propose an accumulative video motion (AVM) technique	91.8%
[269]	2022	orientation angle, prosodic, and similarity	American	continues	Develop robust fast fisher vector (FFV) in in Deep Bi-LSTM	98.33%
[270]	2022	variant length, sequential patterns,	English	Isolated	Novel Residual-Multi Head model	95.03%

6.2.

Related Works on Segmentation and Tracking Problem

Detecting the signer hand in a still image or tracking it in a video stream is challenging and affected by many factors discussed earlier in the preprocessing phase such as environment, movement, hand shape, and occlusion. Hence, the careful choice of an appropriate segmentation technique is of utmost importance, as it profoundly influences the recognition of sign language and the work of the subsequent phases (feature extraction and classification). The hand segmentation identifies the beginning and end of each sign. This is necessary for accurate recognition and understanding of the signer's message. Through the process of segmenting the sign language input, the recognition system can concentrate on discerning individual signs and their respective meanings, thereby avoiding the interpretation of the entire continuous signing stream as a single sign. In addition to enhancing recognition accuracy, segmentation contributes to system efficiency and speed. By dividing the input into distinct signs, the system can process each sign independently, reducing computational complexity and improving response time. Furthermore, segmentation facilitates advancements in sign language recognition technology by enabling the creation of sign language corpora annotated with information about individual signs. Such resources are valuable for training and evaluating sign language recognition systems and conducting linguistic research on sign language structure and syntax. Various segmentation techniques are employed, including Background subtraction [112], Skin color detection [113], Template matching [114], Optical flow [115], and Machine learning [116]. Table 5 presents DL for sign language recognition-related works that focus on addressing the segmentation and tracking challenges to achieve optimal system performance.

Table 5:

Related works on SLR using DL that address segmentation problem.

Author(s)	Year	Input Modality	Segmentation method	Results
[131]	2018	RGB image	HSV color model	99.85%
[148]	2018	RGB image	Skin segmentation algorithm based on color information	94.7%
[149]	2018	RGB images	k-means-based algorithm	94.37%
[158]	2019	RGB images	Color segmentation by MLP network	96.83%
[159]	2019	Depth image	Wrist line localization by algorithm-based thresholding	88.7%
[164]	2019	RGB, and depth video	Aligned Random Sampling in Segments (ARSS)	96.7%
[168]	2019	RGB, and depth images	Depth based segmentation using data of Kinect RGB-D camera	97.71%
[171]	2019	RGB video	Design an adaptive temporal encoder to capture crucial RGB visemes and skeleton signees	94.7%
[179]	2020	RGB videos	Hand semantic Segmentation named as DeepLabv3+	89.59 %
[180]	2020	RGB Videos	Novel method based on open pose	87.69 %
[182]	2020	RGB Videos	Viola and Jones, and human body part ratios	84.3%
[183]	2020	RGB images	Robert edge detection method	99.3 %
[185]	2020	RGB video	SSD is a feed-forward convolutional network A Non-Maximum Suppression (NMS) step is used in the final step to estimate the final detection	98.42%
[187]	2020	RGB images	Sobel edge detector, and skin color by thresholding	98.89%
[188]	2020	RGB images	Open-CV with a Region of Interest (ROI) box in the driver program	93%
[189]	2020	RGB Videos	Frame stream density compression (FSDC) algorithm	10.73 error
[199]	2020	RGB Videos	Design an attention-based encoder-decoder model to realize end-to-end continuous SLR without segmentation	10.8% WER
[200]	2020	RGB images	Single Shot Multi Box Detection (SSD)	99.90%
[209]	2021	RGB Video	Canny	99.63%
[216]	2021	RGB images	Erosion, Dilation, and Watershed Segmentation	99.7 %
[219]	2021	RGB Video	Data sliding window	86.67%
[236]	2021	RGB images	R-CNN	93%
[239]	2022	RGB videos	Novel Adaptive Hough Transform (AHT)	98.7%
[246]	2022	RGB images, and video	Grad Cam and Cam shift algorithm	99.85%
[248]	2022	Grey images	YCbCr, HSV and watershed algorithm	99.60%,
[249]	2022	RGB images	Sobel operator method	97 %
[263]	2022	RGB images	Semantic	99.91%
[267]	2022	RGB images	R-CNN	99.7%
[268]	2022	RGB video	Mask is created by extracting the maximum connected region in the foreground assuming it to be the hand+ Canny method	99%

6.3.

Related Works on Feature Extraction Problem

The feature extraction goal is to capture the most essential information about the sign language gestures while removing any redundant or irrelevant information that may be present in the input data. The process of feature extraction offers numerous advantages in sign language recognition. It enhances accuracy by effectively representing the distinctive characteristics of each sign and gesture, thereby facilitating the system's ability to differentiate between them. Moreover, feature extraction reduces both processing time and computational complexity, as the extracted features are typically represented in a more compact and informative manner compared to raw input data. Additionally, feature extraction confers robustness against noise and variability, as features can be designed to be invariant to specific types of variations, such as changes in lighting conditions or background clutter [117,118]. This enables the recognition system to maintain its performance even in challenging and diverse environments. Table 6 shows related DL works for sign language recognition that focus on solving the problem of features extraction.

Table 6:

Related works on SLR using DL that address feature extraction problem.

Author(s)	Year	Dataset	Technique	Signing mode	Feature(s)	Result
[130]	2018	Collected	DCNN	static	Hand shape	84.6%
[135]	2018	Collected	3D CNN	Isolated	spatiotemporal	99%
[138]	2018	ASL Finger Spelling	CNN	Static	depth and intensity	82%
[141]	2018	RWTH-2014	3D Residual Convolutional Network (3D-ResNet)	Continues	Spatial information, and temporal connections across frames	37.3WER
[143]	2018	Collected	3D-CNNs	Isolated	spatiotemporal	88.7%
[144]	2018	Collected	DCNN	Isolated	hand palm sphere radius, and position of hand palm and fingertip	88.79%
[149]	2018	ASL Finger Spelling	Histograms of oriented gradients, and Zernike moments	Static	Hand shape	94.37%
[150]	2018	Collected	CNN	Continues	Hand shape	92.88 %
[151]	2018	Collected	3DRCNN	Continues/Isolated	motion, depth, and temporal	69.2%
[152]	2018	SHREC	Leap Motion Controller (LMC) sensor	Isolated, static	finger bones of hands.	96.4%
[153]	2018	Collected	Hybrid Discrete Wavelet Transform, Gabor filter, and histogram of distances from Centre of Mass	Static	Hand shape	76.25%
[154]	2018	Collected	DCNN	Static	Facial expressions	89%
[156]	2019	Collected	Two-stream 3-D CNN	Continues	Spatiotemporal	82.7%
[158]	2019	Collected	CNN	Static	Hand shape	96.83%
[79]	2019	Collected	Open Pose library	Continues	human key points (hand, face, body)	55.2%
[159]	2019	ASL fingerspelling	PCA Net	Static	hand shape (corners, edges, blobs, or ridges)	88.7%
[161]	2019	SIGNUM	Stacked temporal fusion layers in DCNN	Continues	spatiotemporal	2.80WER
[162]	2019	Collected	Leap motion device	Continues Isolated	3D positions of the fingertips	72.3%89%
[163]	2019	Collected	CNN	Static	Hand shape	95%
[164]	2019	CSL	D-shift Net	Continues	spatial features time features, and temporal.	96.7%
[165]	2019	DEVISIGN_D	B3D Res-Net	Isolated	spatiotemporal	89.8%
[166]	2019	Collected	Local and GIST Descriptor	Isolated	Spatial and scene-based features	95.83%
[169]	2019	Collected	Restricted Boltzmann Machine (RBM)	Isolated	Handshape, and network generated features	88.2%
[170]	2019	KSU-SSL	3D-CNN	Isolated	hand shape, position, orientation, and temporal dependence in consecutive frames	77.32%
[171]	2019	Collected	C3D, and Kinect device	Continues	Temporal, and Skeleton	94.7%
[175]	2019	Collected	Open Pose library with Kinect V2	Static	3D skeleton	98.9%.
[177]	2020	Ishara-Lipi	Mobile Net V1	Isolated	Two hands shape	95.71%
[178]	2020	Collected	DCNN	Static	Hand shape	94.31%.
[179]	2020	Collected	Single layer Convolutional Self-Organizing Map (CSOM)	Isolated	Hand shape	89.59%
[180]	2020	KSU-SSL	Enhanced C3D architecture	Isolated	Spatiotemporal of hand and body	87.69 %
[182]	2020	KSU-SSL	3DCNN	Isolated	Spatiotemporal	84.3%
[185]	2020	Collected	ResNet50 model	Isolated	Hand shape, Extra Spatial hand Relation (ESHR) features, and Hand Pose (HP), temporal.	98.42%
[186]	2020	Polytropon (PGSL)	ResNet-18	Isolated	Optical flow of skeletal, handshapes, and mouthing	95.31%
[187]	2020	Collected	Discrete cosines transform, Zernike moment, scale-invariant feature transform, and social ski driver optimization algorithm	Static	Hand shape	98.89%
[189]	2020	RWTH-2014	Temporal convolution unit and dynamic hierarchical bidirectional GRU unit	Continues	spatiotemporal	10.73% BLEU
[191]	2020	Collected	Standard score’ normalization on the raw Channel State Information (CSI) acquired from the Wi-Fi device, and MIFS algorithm	Static, and continues	The cross-cumulant features (unbiased estimates of covariance, normalized skewness, normalized kurtosis)	99.9%
[192]	2020	GSL	Open Pose human joint detector	Isolated	3D hand skeletal, and region of hand, and mouth	93.56%
[197]	2020	Collected	Four channel surface electromyography (sEMG) signals	Isolated	time-frequency joint features	93.33%
[199]	2020	Collected	Euler angle, Quaternion from IMU signal	Continues	Hand Rotation	10.8% WER
[76]	2020	RKS-PERSIANSIGN	3DCNNs	Isolated	Spatiotemporal	99.8%
[202]	2020	ASL fingerspelling A	DCNN	Static	Hand Shape	99.96%
[203]	2020	Collected	Construct a color-coded topographical descriptor from joint distances and angles, to be used in 2 streams (CNN)	Isolated	distance and angular	93.01%
[204]	2020	Collected	Two CNN models and a descriptor based on Histogram of cumulative magnitudes	Isolated	Two hands, skeleton, and body	64.33%
[208]	2021	RWTH-2014T	Semantic Focus of Interest Network with Face Highlight Module (SFoI-Net-FHM)	Isolated	Body and facial expression	10.89Bleu
[210]	2021	Collected	(ConvLSTM)	Isolated	Spatiotemporal	94.5%
[212]	2021	Collected	ResNet50	Static	hand area, the length of axis of first eigenvector, and hand position changes.	96.42%.
[214]	2021	Collected	f-CNN (fusion of 1-D CNN and 2-D CNN	Isolated	Time and spatial-domain features of finger resistance movement	77.42%
[217]	2021	MU	Modified Alex Net and VGG16	Static	Hand edges and shape	99.82%
[222]	2021	Collected	VGG net of six convolutional layers	Static	Hand shape	97.62%
[224]	2021	38 BdSL	DenseNet201, and Linear Discriminant Analysis	Static	Hand shape	93.68%
[225]	2021	KSU-ArSL	Bi-LSTM	Isolated	spatiotemporal	84.2%
[226]	2021	Collected	Paired pooling network in view pair pooling net (VPPN)	Isolated	spatiotemporal	84.95%
[228]	2021	ASLLVD	Bayesian Parallel Hidden Markov Model (BPaHMM) + stacked denoising variational autoencoders (SD-VAE) + PCA	Continues	Shape of hand, palm, and face, along with their position, speed, and distance between them	97%
[230]	2021	ASLLVD	3-D CNN’s cascaded	Isolated	spatiotemporal	96.0%
[231]	2021	Collected	leap motion controller	Static, and Isolated	sphere radius, angles between fingers their distance	91.82%
[232]	2021	RWTH-2014	(3 C 2 C 1) D ResNet	Continues	height, motion of hand, and frame blurriness levels	23.30WER
[233]	2021	Montalbano II	AlexNet + Optical Flow (OF) + Scene Flow (SF) methods	Isolated	Pixel level, and hand pose	99.08%
[234]	2021	RWTH-2014	GAN	Continues	spatiotemporal	23.4WER
[235]	2021	MNIST	DCNN	Static	Hand shape	98.58%
[236]	2021	Collected	R-CNN	Static	Hand shape	93%
[237]	2021	CSL-500	Multi-scale spatiotemporal attention network (MSSTA)	Isolated	Spatiotemporal	98.08%
[242]	2022	MNIST	modified CapsNet	Static	Spatial, and orientations	99.60%
[243]	2022	RKS-PERSIANSIGN	Singular value decomposition SVD	Isolated	3D hand key points between the segments of each finger, and their angles.	99.5%
[244]	2022	Collected	2DCRNN + 3DCRNN	Continues	Spatiotemporal out of small patches	99%
[246]	2022	Collected	Atrous convolution mechanism, and semantic spatial multi-cue model	Static Isolated	pose, face, and hand, and Spatial, full frame,	99.85%
[253]	2022	Collected	4 DNN models using 2D and 3D CNN	Isolated	Spatiotemporal	99%
[255]	2022	Collected	Scale-Invariant Feature Transformation (SIFT)	Static	Corner, edges, rotation, blurring, and illumination.	97.89%
[256]	2022	Collected	InceptionResNetV2	Isolated	Hand shape	97%
[257]	2022	Collected	Alex net	Static	Hand shape	94.81%
[258]	2022	Collected	Sensor + mathematical equations+ CNN	Continues	Mean, Magnitude of Mean, Variance, correlation, Covariance, and frequency domain features+ spatiotemporal	0.088WER
[260]	2022	Collected	Media Pipe framework	Isolated	hands, body, and face	99%
[261]	2022	Collected	Bi-RNN network, maximal information correlation, and leap motion controller	Isolated	hand shape, orientation, position, and motion of 3D skeletal videos.	97.98%
[264]	2022	LSA64	dynamic motion network (DMN)+ Accumulative motion network (AMN)	Isolated	spatiotemporal	91.8%
[265]	2022	CSL-500	Spatial–temporal–channel attention (STCA) is proposed	isolated	spatiotemporal	97.45%
[268]	2022	Collected	SURF (Speeded Up Robust Features)	Isolated	distribution of the intensity material within the neighborhood of the interest point	99%
[269]	2022	Collected	Thresholding and Fast Fisher Vector Encoding (FFV)	Isolated	Hand, palm, finger shape, and position and 3D skeletal hand characteristics	98.33%

6.4.

Author(s)	Year	Datasets	Technique	Result
[129]	2018	ASL finger spelling ANTU	DCNN	92.4%99.7%
[134]	2018	NYUMUASL Fingerspelling AASL Surrey	Restricted Boltzmann Machine (RBM)	90.01%99.31%98.13%97.56%
[136]	2018	NTUHUST	DAN	98.5%73.4%
[143]	2018	Collected CSLChaLearn14	3D-CNN	88.7%95.3%
[145]	2018	Collected MD05CMU	JADM+CNN	88.59%87.92%87.27%
[146]	2018	RWTH 2012RWTH 2014SIGNUM	CNN-HMM hybrid	30.0 WER32.57.4
[156]	2019	CollectedRWTH-2014	Hierarchical Attention Network (HAN) + Latent Space LS-HAN	82.7%61.6%
[161]	2019	RWTH-2014SIGNUM	DCNN	22.86 WER2.80
[164]	2019	CSLIsoGD	Proposed multimodal two-stream CNN	96.7%63.78%
[165]	2019	DEVISIGN-DCollected	Deep 3-d Residual ConvNet + BiLSTM	89.8%86.9%
[170]	2019	KSU-SSLArSLRVL-SLLL	3D-CNN	77.32%34.90%70%
[173]	2019	Collected RGB-DMSRUT KinectG3D	Four stream CNN	86.87%86.98%85.23%88.68%
[174]	2019	Jochen-TrieschMKLMNovel SI-PSL	Novel DNN	97.29%96.8%51.88%
[182]	2020	KSU-SSLArSL by University of SharjahRVL-SLLL	3DCNN	84.38%34.9%70%
[186]	2020	PGSLChicagoFSWildRWTH 2014T	DCNN	95.31%92.63%76.30%
[187]	2020	ASLMU	Deep Elman recurrent neural network	98.89%97.5%
[192]	2020	GSLChicagoFSWild	CNN	93.56%91.38%
[76]	2020	NYUFirst-Person, RKS-PERSIANSIGN	CNN	4.64 error91.12%99.8%
[202]	2020	NUSAmerican fingerspelling A	DCNN	94.7%99.96%
[203]	2020	HDM05CMUNTUCollected	2 stream CNN	93.42%92.67%94.42%93.01%
[204]	2020	UTD–MHADIsoGDCollected	linear SVM classifier	94.81%67.36%64.33%
[207]	2021	Collected RGB images.Jochen-Triesch’s	DCNN	99.96%100%
[210]	2021	LSA64LSACollected	3DCNN	98.5%99.2 %94.5%
[211]	2021	ASLG-PC12RWTH-2014	GRU and LSTM Bahdanau and Luong’s attention mechanisms	66.59%19.56% BLEU
[221]	2021	ASL alphabet, ASL MNIST MSL	Optimized CNN based on PSO	99.58%99.58%99.10%
[225]	2021	KSU-ArSLJesterNVIDIA	Inception-BiLSTM	84.2%95.8%86.6%
[226]	2021	CollectedNTUMuHAVi,WEIZMANNNUMA	Motion modelled deep attention network (M2DA-Net)	84.95%89.98%85.12%82.25%88.25%
[228]	2021	RWTH-2014ASLLVD	Novel hyperparameter based optimized Generative.Adversarial Networks (H-GANs)	73.9%97%
[232]	2021	RWTH-2014Collected	Bidirectional encoder representations from transformers (BERT) + ResNet	20.123.30 WER
[233]	2021	Montalbano IIisoGDMSRCAD-60	LSTM+CNN	99.08%86.10%98.40%95.50%
[234]	2021	RWTH2014(CSL)(GSL)	GAN	23.42.12.26
[237]	2021	CSL-500DEVISIGN-L	Dual Network up on a Graph Convolutional Network (GCN).	98.08%64.57%
[242]	2022	SLDDMNIST	Modified Caps Net architecture (SLR-Caps Net)	99.52%99.60%
[243]	2022	RKS-PERSIANSIGNFirst-PersonASVIDisoGD	Single shot detector, 2D convolutional neural network, singular value decomposition (SVD), and LSTM	99.5%91%93%86.1%
[247]	2022	CollectedCollectedASL finger spelling	DCNN+ diffGrad optimizer	92.43%88.01%99.52%
[248]	2022	38 BdSLCollectedIshara-Lipi	BenSignNet	94.00%99.60%99.60%
[251]	2022	CollectedCollectedCollected	DCNN	99.41%99.48%99.38%
[254]	2022	CollectedCambridge hand gesture	Hybrid model based on VGG16-BiLSTM	83.36%97%
[255]	2022	CollectedMNIST,JTDNUS	Hybrid Fist CNN	97.89%,95.68%94.90%95.87%
[256]	2022	ASLGSLAUTSLIISL2020	LSTM+GRU	95.3%94%95.1%97.1%
[261]	2022	CollectedSHRECLMDHG	DLSTM	97.98%96.99%97.99%
[262]	2022	AUTSLCollected	3D-CNN	93.53%94.83%
[265]	2022	CSL-500JesterEgo Gesture	deep R (2+1) D	97.45%97.05%94%
[266]	2022	MUHUST-ASL	end-to-end fine-tuning method of a pre-trained CNN model with score-level fusion technique	98.14%64.55%
[269]	2022	SHRECCollectedLMDHG	FFV-Bi-LSTM	92.99%98.33%93.08%

Deep Learning for Sign Language Recognition: A Comparative Review

Shahad Thamear Abd Al-Latief

Salman Yussof

Azhana Ahmad

Saif Khadim

Article Category: Article

Published Online: Jun 15, 2024

Page range: 77 - 116

Received: May 27, 2024

Accepted: Jun 05, 2024

DOI: https://doi.org/10.2478/jsiot-2024-0006

KeywordsSign language, Recognition, Deep Learning, Classification

© 2023 Shahad Thamear Abd Al-Latief et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Sign language, Recognition, Deep Learning, Classification