Categoria dell'articolo: Article
Pubblicato online: 15 giu 2024
Pagine: 77 - 116
Ricevuto: 27 mag 2024
Accettato: 05 giu 2024
DOI: https://doi.org/10.2478/jsiot-2024-0006
Parole chiave
© 2023 Shahad Thamear Abd Al-Latief et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Communication plays an essential role with enormous effects on individuals’ lives, such as in gaining and exchanging knowledge, interacting, developing social relationships, and revealing feelings and needs. While most humans communicate verbally, there are those with limited verbal abilities who need to communicate using Sign language (SL). Sign Language is a type of language that is visual, which is utilized by the deaf individuals and mainly relies on the various parts of an individual’s body including fingers, hand, arm, head, body, and facial expression to transfer information rather than using the vocal tract [1]. According to the World Federation of the Deaf, there are more than seventy million deaf people around the world that use more than 300 types of sign language [2]. However, it is not popular among the individuals with typical hearing and communication abilities, and a few of them are able to understand, and learn sign languages. This reveals a genuine communication gap between deaf individuals and the rest of society. Automated recognition and translation of sign language by performing sign language recognition would help to break down these barriers by providing a comfortable communication platform between deaf, and hearing individuals and give the same opportunities for deaf individuals to obtain information as everyone else [3]. Machine translation demonstrates a remarkable capacity for overcoming linguistic barriers, particularly through the utilization of Deep Learning (DL), as a branch of the field. Deep learning exhibits outstanding and exceptional performance in diverse domains, including image classification, pattern recognition, and various other fields and applications [4]. The advancement of DL networks has witnessed a significant surge in their performance, particularly in the realm of video-related tasks, such as Human Action Recognition, Motion Capturing, Gesture Recognition [5,6,7]. Basically, DL techniques offer remarkable attributes that render them highly advantageous in Sign Language Recognition (SLR). This is primarily attributed to their hidden layers, which autonomously extract latent features, as well as their capacity to effectively handle the intricate nature of hand gestures in sign language. This is achieved by leveraging extensive datasets, enabling the generation of accurate outcomes without time-consuming processes, a characteristic often lacking in conventional translation methods [8]. This paper presents a review of various deep learning models used to recognize sign language in to spot the light the key challenges encountered in using deep learning for sign language recognition and determine the unresolved issues. Additionally, this paper provides some suggestions to overcome these challenges that are based on our knowledge have not been solved.
The communication gap that exists between normal and deaf individuals is the most important motivation in designing and building an interpreter to facilitate communication between them. When embarking on the design of such a translator, a comprehensive set of objectives must be taken into account. These include ensuring accuracy, speed, efficiency, scalability, and other factors that contribute to delivering a satisfactory translation outcome for both parties involved. However, numerous challenges have been identified in the realm of sign language recognition, necessitating the development of an efficient and robust system to address various issues related to environmental conditions, movement speed, occlusions, and adherence to linguistic rules. Deep-learning-based sign language recognition models have gained significant interest in the last few years due to the quality of the recognition and translation that they provide and their ability in dealing with the various sign language recognition challenges
The main contributions of this work are:
Provide a description of important concepts related to sign language including acquiring methods, types of sign language, and a description of many public datasets in different languages around the world. Identify the various challenges and problems encountered in the implementation of sign language recognition using DL. Review more than 140 related works for DL-based sign language recognition from the year 2018 to 2022. Classify these relevant works according to the specific problem addressed and the technique or method employed to overcome the specified challenge or problem.
This paper is organized into eight main sections as described in Fig1. To facilitate a better and smoother reading for this review, a detailed description of each section is presented below:
Introduction: Provides a brief introduction about sign language, deep learning, describes the motivation behind this review, introduces the main contributions, and illustrates the main layout of this work. Sign language Overview: Provides a comprehensive overview of sign language, encompassing its historical context and the fundamental principles employed in its construction and development. Additionally, it includes a description of the various forms used to represent letters, words, and sentences in sign language, as well as the acquisition methods employed for capturing sign language. Deep Learning Background: Introduces the historical background of DL networks structures, properties, layers, and commonly utilized architectures. Sign Language Recognition Challenges Using Deep Learning: Describes the main challenges and problems facing the recognition of sign language using DL. Sign Language Public Datasets Description: Presents an overview of widely accessible sign language datasets, encompassing various languages and types (such as images and videos) and available in different formats (including RGB, depth, and skeleton data). Additionally, provide a description of public action datasets related to sign language. Deep learning-based sign language-related works: Introduces a considerable number of promising related works for sign language using DL techniques from 2018 to 2022 that are organized based on the type of problem being addressed. Discussion: Discusses the results and methods utilized by the presented related works. Conclusion: Concludes the review conducted, and illustrates the conclusions reached by performing this review paper on sign language recognition using DL, in addition to a set of recommendations for future research in this area.

Paper Organization
Sign language (SL) serves as a crucial means of communication for individuals who experience difficulties in speaking or hearing. Unlike spoken language, understanding sign language does not rely on auditory perception, nor does it involve vocalization. Instead, sign language is primarily conveyed through the simultaneous combination of hand shapes, orientations, movements, as well as facial expressions, making it a visual language. [9]. Historically, the linguistic studies of sign language started in the 1970s [10] and show that it is like spoken languages, in which it arranges elementary units called phonemes into meaningful units known as semantic units, which contain lingual information which include different symbols and letters. Sign language is not derived from spoken languages, instead, it has its own independent vocabulary and grammatical constructions [11]. However, the signs used by individuals who are deaf possess an internal structure similar to spoken words. Just as a limited number of sounds can generate hundreds of thousands of words in spoken languages, signs are formed by a finite set of gestural features. Thus, signs are not only gestures, but they are actually a group of linguistically significant features. There is a common misapprehension, that there is only a universal, and single sign language. Just like spoken languages, sign languages evolve, and grow inherently across time, and space [12]. Many countries have their own national sign languages. However, there is also regional variance and domestic dialects. Moreover, the signs do not have a one-to-one mapping to a specific word. Therefore, sign language recognition is a complex process that extends beyond a simple substitution of individual signs with their corresponding spoken language counterparts. This is attributed to the fact that sign languages possess distinct vocabularies and grammatical structures that are not confined to any particular spoken language. Furthermore, even within regions where the same spoken language is used, there can be significant variations in the sign languages employed [13].
The signs of sign language must be captured and attained to provide input for the recognition system and there are various acquisition techniques that provide several types of input such as image, video, and signals. Basically, the main acquisition methods for any sign language recognition system depend on one of these acquisition techniques.
Single camera: Refers to a filming technique or production method that involves using only one camera such as Webcam, digital cam, video cam, and smartphone cam. Stereo-camera: Obtains many monocular cameras, or thermal ones to capture in-depth information. Active methods: Utilizes the projection of structured light using devices such as Kinect and Leap Motion Controller (LMC), which are 3D cameras that can gather movement and skeletal data. Other methods such as body markers in colored gloves, wrist bands, and LED lights.
Generally, the major advantages of vision methods are that it is not costly, convenient, and non-intrusive. The user simply needs to communicate using sign language naturally in front of an image capturing device. This makes it suitable for real-time applications [16]. However, the use of vision-based input suffers from a set of problems including [17]:
Too much redundant information causes low recognition efficiency. Low recognition accuracy, due to occlusion and motion blur. The variances in sign language style between individuals resulted in poor generalization of algorithms. Small recognizable words vocabulary due to the large vocabulary datasets containing similar words. Have some challenging matters about time, speed, and overlapping. Need more feature extraction methods to operate correctly.
Inertial Measurement Unit (IMU): It is an electronic device employed to measure and report an object's specific force, position, angular rate, and sometimes orientation with respect to an inertial reference frame, and acceleration. It typically consists of a combination of accelerometers, gyroscopes, and sometimes magnetometers. Electromyography (EMG): It is the device that uses electrodes placed on or inserted into the skin near the muscle of interest to measure human muscle’s electrical pulses and employ the bio-signal to detect movements. Wi-Fi and Radar: These devices mainly depend on radio waves, broad beam radar, or spectrogram to detect in-air signal strength variation. They are employed to monitor the movements and positions of the deaf by capturing the reflections of radio waves off their body or hand movements. Radar systems can provide data on the dynamics and trajectories of sign language gestures. This information then can be used for analysis or recognition purposes. Others include flex sensors, ultrasonic, mechanical, electromagnetics, and haptic technologies.
In general, although these methods exhibit higher speed and accuracy, the necessity for individuals to wear sensors remains impractical for the following reasons [21]:
It may cause a burden on the users because they must take electronic devices with them when moving. Portable electronic devices require a battery, which needs to be charged from time to time. Specific equipment is required to process the signals acquired from the wearable devices.
Static: A specific hand configuration and pose, depicted through a single image, is employed for the recognition of fingerspelled gestures of alphabets and digits. This recognition process relies solely on still images as input to predict and generate the corresponding output, without incorporating any movement. It is considered to be very inconvenient, due to the time required to perform prediction each time an input is given, and depends basically on handshapes, hand positions, and facial expressions to convey meaning [24].
Dynamic: Refers to a variant of sign language, in which signs are produced with movement. This form of communication encompasses not only handshapes and positions but also incorporates the movement of hands, arms, and other body parts to convey meaning. To capture and represent this type of sign language, video streams are required [25]. There are certain words in sign language, such as in American Sign Language, which necessitate hand movements for proper representation, making it a dynamic form. It plays a vital role in facilitating communication, as well as establishing linguistic and cultural identities within the deaf community. Dynamic signs find application in various contexts, including everyday conversations, education, storytelling, performances, and broadcasting. Broadly speaking, dynamic signs can be categorized into two main types based on what they represent, be it individual words or complete sentences. These are described below [26]:
Isolated: The input dynamic signs are used to represent words by performing more than one sign each time and pauses only happen between words. Continues: The continuous dynamic entries are mainly employed to represent sentences because it incorporates more than one sign performed continuously without any pause between signs [27].
The Deep Neural Network is basically a branch of Machine Learning (ML), that was originally inspired by and resembles the human nervous system, and the structure of the brain. It is composed of several layers and nodes, in which the layers are processing units systemized in input, output, and hidden layers. The nodes or units in every layer are linked to nodes in contiguous layers, and every connection owns its singular weight value. The inputs are multiplied by the intended weights and summed at every unit. The summation result undergoes a transformation depending on some type of activation function such as, Sigmoid function, Tan hyperbolic or Rectified Linear Unit (ReLU) [28]. Thus, DL includes stacking many learning layers to learn high-level abstractions in the data within approximate highly nonlinear functions giving the learning algorithm the ability to learn hierarchical features from the input data. This feature learning highly replaced the hand-engineered features and owes its regeneration to effective optimization methods, and powerful computational resources [29]. DL powerful properties give it the ability to taking the lead in achieving the desired results depending on a set of factors including: [30]
Feature learning refers to the capacity to acquire descriptive features from data that have an impact on other correlated tasks. This implies that numerous relevant factors are disentangled within these features, in contrast to handcrafted features that are designed to remain constant with respect to the targeted factors. Hierarchical Impersonation: the features in this type of method are represented in a hierarchical format, in which the simple ones are represented in the lower layers, and the high layers learn increasingly complicated features. This will provide a successful encoding for properties of two types, including local and global in the last features representation. Distributed Impersonation: Signifies a many-to-many relationship where the representations are dispersed. This occurs because multiple neurons can demonstrate a single factor, while one neuron can account for multiple factors. Such an arrangement eradicates the potential for dimensionality and offers a compact and comprehensive representation. Large-Scale Datasets: the DL is able to deal with the datasets with a vast number of samples and gives outstanding performance in many domains.
In recent years, DL methods have demonstrated exceptional performance surpassing previous state-of-the-art ML techniques across various domains. One domain in which DL has emerged as a prominent methodology is computer vision, particularly in the context of sign language recognition. DL has provided novel solutions to challenges in sign language recognition and has become a leading approach in this field [31]. Many architectures of DL have been utilized for sign language recognition and in an accurate, fast, and efficient manner, due to their ability in dealing with most challenges and the complexity of sign language [32]. The most popular and utilized DL architectures are the Convolutional Neural Network (CNN), Deep Boltzmann Machine (RBM), Deep Belief Network (DBN), Auto Encoder (AE), Variational Auto Encoder (VAE), Generative Adversarial Network (GAN), and Recursive Neural Network (RNN) including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) [33].
Detection, tracking, pose estimation, gesture recognition, and pose recovery represent key sub-fields within sign language recognition. These sub-fields are extensively employed in human-computer interaction applications that utilize DL techniques. Nevertheless, the recognition or conversion of signs performed by deaf individuals using DL presents a range of challenges that can significantly impact the output results. These challenges include:
Feature extraction is the process used to select and /or combine variables into features, and effectively reduce the amount of data that must be processed, while still accurately and completely describing the original data. It has a role in addressing the problem of finding the most compact and informative set of features, to enhance the efficiency of data storage and processing. Defining feature vectors remains the most common and convenient means of data representation for classification and regression [34]. In the context of sign language recognition, the process of extracting pertinent features plays a vital, and decisive role. Irrelevant features, on the other hand, can result in misclassification and erroneous recognition [35]. Within the realm of DL techniques for data classification, automatic feature extraction holds paramount importance. Integrating various features extracted from both training and testing images without any data loss is a crucial step that greatly impacts the recognition accuracy of sign language. In general, two types of features are considered in sign language including manual, and non-manual features. Manual features encompass the movements of hands, fingers, and arms. On the other hand, nonmanual features represent the fundamental component in sign language [36] like facial expressions, eye gaze, head movements, upper body motion, and positioning. The combination of manual and non-manual features offers a comprehensive representation of sign language. In the domain of DL, features can be classified into two categories: spatial and temporal. Spatial features pertain to the geometric representation of shapes within a specific coordinate space, while temporal features account for time-related aspects during movement especially when dealing with a sequence of images as input. By employing feature fusion and combining multiple types of features in the process of sign language recognition using DL, one can achieve the desired outcomes effectively [37].
The variability in the environment during sign capture poses a significant technical challenge with a notable impact on sign language recognition. When capturing an image, numerous factors, such as lighting conditions (spectra, source distribution, and intensity) and camera characteristics (sensor response and lenses), exert their influence on the appearance of the captured sign. Additionally, skin reflectance properties and internal camera controls [38] further contribute to these effects. Moreover, noise originating from other elements present in the background and landmarks can also influence sign recognition outcomes.
The movements in sign language are dynamic acts, exhibiting trajectories with distinct beginnings and ends. The representation of dynamic sign language involves both isolated and continuous signing, wherein signs are performed consecutively without pauses. This introduces challenges related to similarity and occlusion, arising from variations in hand movements and orientations, involving one or both hands in different angles and directions [39]. The determination of each sign's precise beginning and end presents a significant hurdle, resulting in what is termed Movement Epenthesis (ME) or transition segments. These ME segments act as connectors between sequential signs when transitioning from the final position of one sign to the initial position of the next. However, ME segments do not convey any specific sign information; instead, they contribute to the complexity of recognizing continuous sign sequences. The lack of well-defined rules for making such transitions poses a significant challenge [40], demanding careful attention and a demonstrable approach to address effectively.
The segmentation process stands out as one of the most formidable challenges in computer vision, especially in the context of sign language recognition, where the extraction of hands from video frames or images holds particular significance due to their critical role in the recognition process. To address this, image segmentation is employed to isolate relevant hand data while eliminating undesired elements, such as background and other objects in the input, which might conflict with the classifier operations [41]. Image segmentation restricts the data region, enabling the classifier to focus solely on the Region of Interest (ROI). Segmentation methods can be categorized as contextual and non-contextual. Contextual segmentation takes spatial relationships between features into account, often using edge detection techniques. Conversely, non-contextual segmentation does not consider spatial relationships; rather, it gathers pixels based on global attributes. Hand tracking can be viewed as a subfield of segmentation, and it typically poses challenges, particularly when the hand moves swiftly, leading to significant appearance changes within a few frames [42].
In the realm of sign language recognition, the classifier's selection and design require meticulous attention. It is essential to carefully determine the architecture of the classifier, encompassing its layers and parameters, in order to steer clear of potential problems like overfitting or underfitting. The primary objective is to achieve optimal performance in classifying sign language. Furthermore, the classifier's ability to generalize effectively across diverse data types, rather than being confined to specific subsets, is of paramount importance [43].
Real-time recognition of sign language is considered to be an important concern and one of the main problems that needs a real solution in order to give an efficient interpreter to fill up the communication gap between the deaf and public communities. The time problem arises from the need to process video data in real-time or with minimal delay. Computational complexity, both in hardware and software, can be quite demanding and may present challenges for the deaf community to effectively deal with [44].
The availability of sign language datasets is limited and can be considered as one of the main obstacles in designing an accurate recognition system, in which there are few datasets available for sign language compared to gesture databases. Several sign language datasets are created with many variations such as regional differences, type of images (RGB or Depth), type of acquiring methods (images, video), and so on. Sign language differs from one region to another just like spoken languages, and each type has its own properties and linguistic grammar. Most publicly available and utilized sign and gesture datasets in different languages are described in this section and categorized depending on the type of language as illustrated in Table 1, and Fig2.
Public sign language datasets
Dataset | Language | Equipment | Modalities | Signers | Samples |
---|---|---|---|---|---|
ASL alphabets [45] | American | Webcam | RGB images | - | 87,000 |
MNIST [46] | American | Webcam | Grey images | - | 27,455 |
ASL Fingerspelling A [47] | American | Microsoft Kinect | RGB and depth images | 5 | 48,000 |
NYU [48] | American | Kinect | RGB and depth images | 36 | 81,009 |
ASL by Surrey [49 | American | Kinect | RGB and depth images | 23 | 130,000 |
Jochen-Triesch [50] | American | Cam | Grey images with different background | 24 | 720 |
MKLM [51] | American | Leap Motion device and a Kinect sensor | RGB and depth images | 14 | 1400 |
NTU-HD [52] | American | Kinect sensor | RGB and depth images | 10 | 1000 |
HUST [53] | American | Microsoft Kinect | RGB and depth images | 10 | 10880 |
RVL-SLLL [54] | American | Cam | RGB video | 14 | |
ChicagoFSWild [55] | American | Collected online from YouTube | RGB video | 160 | 7,304 |
ASLG-PC12 [56] | American | Cam | RGB video | - | 880 |
American Sign Language Lexicon Video (ASLLVD) [57] | American | Cam | RGB videos of different angles | 6 | 3,300 |
MU [58] | American | Cam | RGB images with illumination variations in five different angles | 5 | 2515 |
ASLID [59] | American | Web cam | RGB images | 6 | 809 |
KSU-SSL [60] | Arabic | Cam and Kinect | RGB Videos with uncontrolled environment | 40 | 16000 |
KArSL [61] | Arabic | Kinect V2 | RGB video | 3 | 75,300 |
ArSL by University of Sharjah [62] | Arabic | Analog camcorder | RGB images | 3 | 3450 |
JTD [63] | Indian | Webcam | RGB images with 3 different backgrounds | 24 | 720 |
IISL2020 [64] | Indian | Webcam | RGB video with uncontrolled environment | 16 | 12100 |
RWTH-PHOENIX-Weather 2014 [65] | German | Webcam | RGB Video | 9 | 8,257 |
SIGNUM [66] | German | Cam | RGB Video | 25 | 33210 |
DEVISIGN-D [67] | Chinese | Cam | RGB videos | 8 | 6000 |
DEVISIGN-L [67] | Chinese | Cam | RGB videos | 8 | 24000 |
CSL-500 [68] | Chinese | Cam | RGB, depth and skeleton videos | 50 | 25,000 |
Chinese Sign Language [69] | Chinese | Kinect | RGB, depth and skeleton videos | 50 | 125000 |
38 BdSL [70] | Bengali | Cam | RGB images | 320 | 12,160 |
Ishara-Lipi [71] | Bengali | Cam | Greyscale images | - | 1800 |
ChaLearn14 [72] | Italian | Kinect | RGB and depth video | 940 | 940 |
Montalbano II [73] | Italian | Kinect | RGB and depth video | 20 | 940 |
UFOP–LIBRAS [74] | Brazilian | Kinect | RGB, depth and skeleton videos | 5 | 2800 |
AUTSL [75] | Turkish | Kinect v2 | RGB, depth and skeleton videos | 43 | 38,336 |
RKS-PERSIANSIGN [76] in | Persian | Cam | RGB video | 10 | 10,000 |
LSA64 [77] | Argentine | Cam | RGB video | 10 | 3200 |
Polytropon (PGSL) [78] | Greek | Cam | RGB video | 6 | 840 |
kETI [79] | Korean | Cam | RGB video | 40 | 14,672 |

Samples of sign language datasets.
Several critical factors contribute to the evaluation of sign language datasets. One such factor is the number of signers involved in performing the signs, which significantly impacts the dataset's diversity and subsequently affects the evaluation of recognition systems' generalization rate. Additionally, the quantity of distinct signs within the datasets, particularly in isolated and continuous formats, holds considerable importance. Furthermore, the number of samples per sign plays a crucial role in training systems that require an ample representation of each sign. Adequate sample representation helps improve the robustness and accuracy of the recognition systems. Moreover, when dealing with continuous datasets, annotating them with temporal information for continuous sentence components is very important. This temporal information is vital for effectively processing and understanding this type of dataset [80]. Although sign language recognition is one of the gesture recognition applications, gesture datasets are seldom utilized for sign language recognition for many reasons. First, the classes count in the gesture recognition dataset has some degree of limitation. Secondly, sign language involves the simultaneous use of manual and non-manual gestures, posing challenges in annotating both types of gestures within a single gesture dataset. Moreover, sign language relies on hand gestures, while gesture datasets are broader and include gestures about full body movements. Additionally, gesture datasets lack the necessary details about hand fingers, which are essential for developing accurate sign language recognition systems [81]. Nevertheless, despite these limitations, gesture datasets can still play a role in training sign recognition systems. In this context, Table 2 presents a comprehensive overview of various gesture datasets, and Fig3 illustrates some representative examples.
Gesture public datasets
Name | Modality | device | signers | samples |
---|---|---|---|---|
LMDHG [82] | RGB, and depth videos | Kinect and | 21 | 608 |
SHREC Shape Retrieval Contest (SHREC) [83] | RGB, and depth videos | Intel RealSense short range depth camera | 28 | 2800 |
UTD–MHAD [84] | RGB, depth and skeleton videos | Kinect and wearable inertial sensor | 8 | 861 |
The Multicamera Human Action Video Data (MuHAVi) [85] | RGB video | 8 camera views | 14 | 1904 |
NUMA [86] | RGB, depth and skeleton videos | 10 Kinect with three different views | 10 | 1493 |
WEIZMANN [87] | Low resolution RGB video | Camera with 10 different viewpoints | 9 | 90 |
NTU RGB [88] | RGB, depth and skeleton videos | Kinect | 40 | 56 880 |
Cambridge hand gesture [89] | RGB video captured under five different illuminations | Cam | 9 | 900 |
VIVA [90] | RGB, and depth videos | Kinect | 8 | 885 |
MSR [91] | RGB, and depth videos | Kinect | 10 | 320 |
CAD-60 [92] | RGB and depth video in different environments, such as a kitchen, a living room, and office | Kinect | 4 | 48 |
HDM05MoCap (motion capture) [93] | RGB video | Cam | 5 | 2337 |
CMU [94] | RGB images | CAM | 25 | 204 |
isoGD [95] | RGB and depth videos | Kinect | 21 | 47,933 |
NVIDIA [96] | RGB and depth video | Kinect | 8 | 885 |
G3D [97] | RGB and depth video | Kinect | 16 | 1280 |
UT Kinect [98] | RGB and depth video | Kinect | 10 | 200 |
First-Person [99] | RGB and depth video | RealSense SR300 cam | 6 | 1,175 |
Jester [100] | RGB | Cam | 25 | 148,092 |
Ego Guster [101] | RGB and depth video | Kinect | 50 | 2,081 |
NUS II [102] | RGB images with complex backgrounds, and various hand shapes and sizes | Cam | 40 | 2000 |

Samples of gesture datasets
Numerous research efforts have been dedicated to the recognition and translation of sign language across diverse languages worldwide, aiming to facilitate its conversion into other communication forms used by individuals, such as text or sound. This study categorizes the works of sign language recognition using DL according to the primary challenges encountered in recognition and the corresponding solutions proposed by each of the investigated works. Any sign language recognition system consists of key stages, which include signs acquisition, hand segmentation, tracking, preprocessing, feature extraction, and classification, as depicted in Fig4.

The procedural stages of sign language recognition
In sign Acquisition, the input modalities as mentioned earlier are either an image or a video stream using one type of vision-based capturing device or depth information using one of the hardware-based collecting equipment. The input modality may be in any format including RGB-colored, greyscale, and binary. In general, DL techniques need high quality data samples with sufficient amount for training to be conducted.
Accuracy is one of the most common performance measurements considered in any type of recognition system, in addition to the percentage of error that may be identified using the Equal Error Rate, Word Error Rate, and False Rate. Another evaluation metric named Bilingual Evaluation Understudy Score (BLEU), is used to measure the matching between the resulting sentences to the entered sign language. The perfect match results in a score of 1.0, while the worst score that represents mismatching is 0.0, so it is also considered as a measurement of accurate translation and widely used in machine learning systems [103]
The related sign language works using DL are categorized based on the type of problem solved in this work, and what is the technique utilized to get the desired result.
The acquired signs may exhibit issues such as low quality, noise, varying degrees of orientation, or enormous size. Therefore, the preprocessing step becomes indispensable to rectify these issues in sign images and videos, effectively eliminating any environmental influences that might have affected them, such as variations in illumination and color. This phase involves the application of filters and other techniques to adjust the size, orientation, and color, ensuring improved data quality for subsequent analysis and recognition. The primary advantage of preprocessing is enhancing the image quality, which enables efficient hand segmentation from the scene for effective feature extraction. In the case of video streams, preprocessing serves to eliminate redundant and similar frames from the input video, thereby increasing the processing speed of the neural network without sacrificing essential information. Many sign language recognition using DL overcome the environmental condition problem using a variety of techniques. Table 3 illustrates the most important related work, the environmental condition being addressed, and the proposed technique. Fig5 shows a sample of the NUSII dataset to show the environmental condition problem.
Related works on SLR using DL that address the various environmental conditions problem.
Author (s) | Year | Language | Modality | Type of condition | Deal with technique | results |
---|---|---|---|---|---|---|
[130] | 2018 | Bengali | RGB images | Variant background and skin colors | Modified VGG net | 84.68% |
[134] | 2018 | American | RGB images | noise and missing data | Augmentation | 98.13% |
[150] | 2018 | Indian | RGB video | Different viewing angles, background lighting, and distance | Novel CNN | 92.88% |
[158] | 2019 | American | Binary images | Noise | Erosion, closing, contour generation, and polygonal approximation, | 96.83% |
[159] | 2019 | American | Depth image | Variant illumination, and background | Attain depth images | 88.7% |
[164] | 2019 | chines | RGB, and depth video | Variant illumination, and background | Two-stream spatiotemporal network | 96.7% |
[173] | 2019 | Indian | RGB, and depth video | Variant illumination, background, and camera distance | Four stream CNN | 86.87% |
[178] | 2020 | Arabic | RGB images | Variant illumination, and skin color | DCNN | 94.31% |
[179] | 2020 | Arabic | RGB videos | Variant illumination, background, pose, scale, shape, position, and clothes | Bi-directional Long Short-Term Memory (BiLSTM) | 89.59% |
[180] | 2020 | Arabic | RGB Videos | Variant illumination, clothes, position, scale, and speed | 3DCNN and SoftMax function | 87.69% |
[182] | 2020 | Arabic | RGB Videos | Variations in heights and distances from camera | Normalization | 84.3% |
[194] | 2020 | Arabic | RGB images | variant illumination, and background | VGG16 and the ResNet152 with enhanced softmax layer | 99% |
[201] | 2020 | American | Grayscale images | illumination, and skin color | Set the hand histogram | 95% |
[202] | 2020 | American | RGB images | Variant illumination, background | DCNN | 99.96% |
[206] | 2021 | Indian | RGB video | Variant illuminations, camera positions, and orientations | Google net+ BiLSTM | 76.21% |
[207] | 2021 | Indian | RGB images | Light and dark backgrounds | DCNN with few numbers of parameters | 99.96% |
[209] | 2021 | American | RGB video | Noise | Gaussian Blur | 99.63% |
[213] | 2021 | Korean | Depth Videos | Low resolution | Augmentation | 91% |
[224] | 2021 | Bengali | RGB images | Variant backgrounds, camera angle, light contrast, and skin tone | Conventional deep learning + Zero-shot learning ZSL | 93.68% |
[225] | 2021 | Arabic | RGB video | Variant illumination, background, and clothes | Inception-BiLSTM | 84.2% |
[227] | 2021 | American | Thermal images | Varying illumination | Adopt live images taken by a low-resolution thermal camera | 99.52% |
[229] | 2021 | Indian | RGB video | Varying illumination | 3DCNN | 88.24% |
[230] | 2021 | American | RGB video | Noise, varying illumination | Median filtering + histogram equalization | 96% |
[236] | 2021 | Arabic | RGB images | Variant illumination, and background | Region-based Convolutional Neural Network (R-CNN) | 93.4% |
[239] | 2022 | Indian | RGB video | Variant illumination, and views | Grey scale conversion and histogram equalization | 98.7% |
[241] | 2022 | Arabic | RGB video | Variant illumination, and background | CNN+ RNN | 98.8% |
[249] | 2022 | Arabic | Greyscale images | Variant illumination, and background | Sobel filter | 97% |
[253] | 2022 | Arabic | RGB, and depth video | Variant Background | ResNet50-BiLSTM | 99% |
[259] | 2022 | American | RGB, and depth images | Noise and illumination variation | Median filtering and histogram equalization | 91.4% |
[261] | 2022 | American | Skeleton video | Noise in video frames | An innovative weighted least square (WLS) algorithm | 97.98% |
[270] | 2022 | English | Wi-Fi signal | Noise and uncleaned Wi-Fi signals. | Principal Component Analysis (PCA) | 95.03% |

Sample images (class 9) from NUS hand posture dataset-II (data subset A), showing the variations in hand posture sizes and appearances.
Another challenge arises when attempting to recognize signs, particularly in the dynamic type, where movement is considered one of the key phonological parameters in sign phonology. This pertains to the variations in hand location, speed, orientation, and angles during the signing process [104]. A consensus on how to characterize and organize movement types and their associated features in a phonological representation has been lacking. Due to divergent approaches and perspectives, there remains uncertainty about the most suitable and standardized way to define and categorize movements in sign language. In general, there are three main types of movements in sign language [105,106]:
Movement of the hands and arms: include waving, pointing, or tracing shapes in the air. Movement of the body: include twisting, turning, or leaning to indicate direction or location. Movement of the face and head: include nodding, shaking the head, or raising the eyebrows to convey different meanings or emotions.
The movement involved in demonstrating sign language also involves a significant challenge, which includes dealing with similar paths of movement (Trajectory), and Occlusion. The arm trajectory formation refers to the principles and laws that invariantly govern the selection, planning, and generation processes of multi-joint movements, as well as to the factors that dictate their kinematics, namely geometrical and temporal features [107]. The sign language movement trajectory swerves to some extent, due to the action speed, and arm length of the user; even for the same user, the psychological changes resulted in inconsistent implementation speed of sign language movement [108]. Movement trajectory recognition is the key link of sign language translation research, which influence directly on the accuracy of sign language translation, in which the same signs match with variant movement trajectory predominantly refer to two variant meanings, that is, illustrating different sign language [109]. On the other hand, occlusion means that some fingers or parts of the hand would be covered (not in view of the camera) or hidden by other parts of the scene, so the sign cannot be detected accurately [110]. The occlusion may appear in various parts including hand/hand, and hand/face depending on the movement and the captured scene. The occlusion has a great effect on the segmentation procedure especially skin segmentation techniques [111]. Table 4 summarizes the most important related DL works that handle these types of problems in sign language recognition.
Related works on SLR using DL that address movement orientation, trajectory, occlusion problems.
Author(s) | Year | Type of variation | language | Signing mode | Model | Accuracy | Error Rate |
---|---|---|---|---|---|---|---|
[129] | 2018 | similarities, and occlusion | American | Static | DCNN | 92.4% | |
[135] | 2018 | Movement | Brazilian | Isolated | Long-term Recurrent Convolutional Networks | 99% | - |
[138] | 2018 | size, shape, and position of the fingers or hands | American | Static | CNN | 82% | - |
[140] | 2018 | Hand movement | American | Isolated | VGG 16 | 99% | - |
[144] | 2018 | Movement | American | Isolated | Leap Motion Controller | 88.79% | - |
[145] | 2018 | 3D motion | Indian | Isolated | Joint Angular Displacement Maps (JADMs) | 92.14% | |
[150] | 2018 | head and hand movements | Indian | Continues | CNN | 92.88 % | - |
[155] | 2019 | Hand movement | Indian | Continues | Wearable systems to measure muscle intensity, hand orientation, motion, and position | 92.50% | - |
[156] | 2019 | Variant hand orientations | Chines | Continues | Hierarchical Attention Network (HAN) and Latent Space | 82.7% | - |
[165] | 2019 | Similarity and trajectory | Chines | Isolated | Deep 3-d Residual ConvNet + BiLSTM | 89.8% | - |
[166] | 2019 | orientation of camera, hand position and movement, inter hand relation | Vietnam | Isolated | DCNN | 95.83% | |
[173] | 2019 | Movement, self-occlusions, orientation, and angles | Indian | Continues | Four stream CNN | 86.87% | |
[174] | 2019 | Movement in different distance from the camera | American | Static | Novel DNN | 97.29% | - |
[176] | 2020 | Angles, distance, object size, and rotations | Arabic | Static | Image Augmentation | 90% | 0.53 |
[180] | 2020 | fingers' configuration, hand's orientation, and its position to the body | Arabic | Isolated | Multilayer perceptron+ Autoencoder | 87.69% | |
[185] | 2020 | Hand Movement | Persian | Isolated | Single Shot Detector (SSD) +CNN+LSTM | 98.42% | |
[186] | 2020 | shape, orientation, and trajectory | Greek | Isolated | Fully convolutional attention-based encoder-decoder | 95.31% | - |
[192] | 2020 | Trajectory | Greek | Isolated | incorporate the depth dimension in the coordinates of the hand joints | 93.56% | - |
[195] | 2020 | finger angles and Multi finger movements | Taiwan | Continues | Wristband with ten modified barometric sensors+ dual DCNN | 97.5% | |
[196] | 2020 | movement of fingers and hands | Chinese | Isolated | Motion data from IMU sensors | 99.81% | - |
[197] | 2020 | finger movement | Chinese | Isolated | Trigno Wireless sEMG acquisition system used to collect multichannel sEMG signals of forearm muscles | 93.33% | |
[199] | 2020 | finger and arm motions, two-handed signs, and hand rotation | Chinees | Continues | Two armbands embedded with an IMU sensor and multi-channel sEMG sensors are attached on the forearms to capture both arm, and finger movements | - | 10.8% |
[76] | 2020 | Hand occlusion | Persian | Isolated | Skeleton detection | 99.8% | |
[204] | 2020 | Trajectory | Brazilian | Isolated | Convert the trajectory information into spherical coordinates | 64.33% | |
[210] | 2021 | Trajectory | Arabic | Isolated | Multi-Sign Language Ontology (MSLO) | 94.5% | |
[213] | 2021 | Movement | Korean | Isolated | 3DCNN | 91% | |
[214] | 2021 | finger movement | Chines | Isolated | Design a low-cost data glove with simple hardware structure to capture finger movement and bending simultaneously | 77.42% | |
[218] | 2021 | Skewing, and angle rotation | Bengali | Static | DCNN | 99.57 | 0.56 |
[219] | 2021 | Hand motion | American | Continues | Sensing Gloves | 86.67% | |
[223] | 2021 | spatial appearance and temporal motion | Chines | Continues | Lexical prediction network | 91.72% | 6.10 |
[226] | 2021 | finger self-occlusions, view invariance | Indian | Continues | Motion modelled deep attention network (M2DA-Net) | 84.95% | |
[228] | 2021 | Occlusions of hand/hand, hands/face, or hands/upper body postures. | American | Continues | Novel hyperparameter based optimized Generative Adversarial Networks (H-GANs) Deep Long Short-Term Memory (LSTM) as generator and LSTM with 3D Convolutional Neural Network (3D-CNN) as a discriminator | 97% | 1.4 |
[230] | 2021 | Variant view | American | Isolated | 3-D CNN’s cascaded | 96% | |
[233] | 2021 | Hand occlusion, | Italian | Isolated | LSTM+CNN | 99.08% | |
[237] | 2021 | Finger occlusion, motion blurring, variant signing styles. | Chines | Continues | Dual Network up on a Graph Convolutional Network (GCN). | 98.08% | |
[239] | 2022 | self-structural characteristics, and occlusion | Indian | Continues | Dynamic Time Warping (DTW) | 98.7% | |
[240] | 2022 | High similarity and complexity | American | Static | DCNN | 99.67% | 0.0016 |
[241] | 2022 | Movement | Arabic | Isolated | The difference function | 98.8% | |
[259] | 2022 | Hand Occlusion | American | Static | Re-formation layer in the CNN | 91.40% | |
[260] | 2022 | Trajectory, hand shapes, and orientation | American | Isolated | Media Pipe’s Landmarks with GRU | 99% | |
[261] | 2022 | ambiguous and 3D double-hand motion trajectories | American | Isolated | 3D extended Kalman filter (EKF) tracking, and approximation of a probability density function over a time frame. | 97.98% | |
[262] | 2022 | Movement | Turkish | Continues | Motion History Images (MHI) generated from RGB video frames | 94.83% | |
[264] | 2022 | Movement | Argentina | Continues | Propose an accumulative video motion (AVM) technique | 91.8% | |
[269] | 2022 | orientation angle, prosodic, and similarity | American | continues | Develop robust fast fisher vector (FFV) in in Deep Bi-LSTM | 98.33% | |
[270] | 2022 | variant length, sequential patterns, | English | Isolated | Novel Residual-Multi Head model | 95.03% |
Detecting the signer hand in a still image or tracking it in a video stream is challenging and affected by many factors discussed earlier in the preprocessing phase such as environment, movement, hand shape, and occlusion. Hence, the careful choice of an appropriate segmentation technique is of utmost importance, as it profoundly influences the recognition of sign language and the work of the subsequent phases (feature extraction and classification). The hand segmentation identifies the beginning and end of each sign. This is necessary for accurate recognition and understanding of the signer's message. Through the process of segmenting the sign language input, the recognition system can concentrate on discerning individual signs and their respective meanings, thereby avoiding the interpretation of the entire continuous signing stream as a single sign. In addition to enhancing recognition accuracy, segmentation contributes to system efficiency and speed. By dividing the input into distinct signs, the system can process each sign independently, reducing computational complexity and improving response time. Furthermore, segmentation facilitates advancements in sign language recognition technology by enabling the creation of sign language corpora annotated with information about individual signs. Such resources are valuable for training and evaluating sign language recognition systems and conducting linguistic research on sign language structure and syntax. Various segmentation techniques are employed, including Background subtraction [112], Skin color detection [113], Template matching [114], Optical flow [115], and Machine learning [116]. Table 5 presents DL for sign language recognition-related works that focus on addressing the segmentation and tracking challenges to achieve optimal system performance.
Related works on SLR using DL that address segmentation problem.
Author(s) | Year | Input Modality | Segmentation method | Results |
---|---|---|---|---|
[131] | 2018 | RGB image | HSV color model | 99.85% |
[148] | 2018 | RGB image | Skin segmentation algorithm based on color information | 94.7% |
[149] | 2018 | RGB images | k-means-based algorithm | 94.37% |
[158] | 2019 | RGB images | Color segmentation by MLP network | 96.83% |
[159] | 2019 | Depth image | Wrist line localization by algorithm-based thresholding | 88.7% |
[164] | 2019 | RGB, and depth video | Aligned Random Sampling in Segments (ARSS) | 96.7% |
[168] | 2019 | RGB, and depth images | Depth based segmentation using data of Kinect RGB-D camera | 97.71% |
[171] | 2019 | RGB video | Design an adaptive temporal encoder to capture crucial RGB visemes and skeleton signees | 94.7% |
[179] | 2020 | RGB videos | Hand semantic Segmentation named as DeepLabv3+ | 89.59 % |
[180] | 2020 | RGB Videos | Novel method based on open pose | 87.69 % |
[182] | 2020 | RGB Videos | Viola and Jones, and human body part ratios | 84.3% |
[183] | 2020 | RGB images | Robert edge detection method | 99.3 % |
[185] | 2020 | RGB video | SSD is a feed-forward convolutional network A Non-Maximum Suppression (NMS) step is used in the final step to estimate the final detection | 98.42% |
[187] | 2020 | RGB images | Sobel edge detector, and skin color by thresholding | 98.89% |
[188] | 2020 | RGB images | Open-CV with a Region of Interest (ROI) box in the driver program | 93% |
[189] | 2020 | RGB Videos | Frame stream density compression (FSDC) algorithm | 10.73 error |
[199] | 2020 | RGB Videos | Design an attention-based encoder-decoder model to realize end-to-end continuous SLR without segmentation | 10.8% WER |
[200] | 2020 | RGB images | Single Shot Multi Box Detection (SSD) | 99.90% |
[209] | 2021 | RGB Video | Canny | 99.63% |
[216] | 2021 | RGB images | Erosion, Dilation, and Watershed Segmentation | 99.7 % |
[219] | 2021 | RGB Video | Data sliding window | 86.67% |
[236] | 2021 | RGB images | R-CNN | 93% |
[239] | 2022 | RGB videos | Novel Adaptive Hough Transform (AHT) | 98.7% |
[246] | 2022 | RGB images, and video | Grad Cam and Cam shift algorithm | 99.85% |
[248] | 2022 | Grey images | YCbCr, HSV and watershed algorithm | 99.60%, |
[249] | 2022 | RGB images | Sobel operator method | 97 % |
[263] | 2022 | RGB images | Semantic | 99.91% |
[267] | 2022 | RGB images | R-CNN | 99.7% |
[268] | 2022 | RGB video | Mask is created by extracting the maximum connected region in the foreground assuming it to be the hand+ Canny method | 99% |
The feature extraction goal is to capture the most essential information about the sign language gestures while removing any redundant or irrelevant information that may be present in the input data. The process of feature extraction offers numerous advantages in sign language recognition. It enhances accuracy by effectively representing the distinctive characteristics of each sign and gesture, thereby facilitating the system's ability to differentiate between them. Moreover, feature extraction reduces both processing time and computational complexity, as the extracted features are typically represented in a more compact and informative manner compared to raw input data. Additionally, feature extraction confers robustness against noise and variability, as features can be designed to be invariant to specific types of variations, such as changes in lighting conditions or background clutter [117,118]. This enables the recognition system to maintain its performance even in challenging and diverse environments. Table 6 shows related DL works for sign language recognition that focus on solving the problem of features extraction.
Related works on SLR using DL that address feature extraction problem.
Author(s) | Year | Dataset | Technique | Signing mode | Feature(s) | Result |
---|---|---|---|---|---|---|
[130] | 2018 | Collected | DCNN | static | Hand shape | 84.6% |
[135] | 2018 | Collected | 3D CNN | Isolated | spatiotemporal | 99% |
[138] | 2018 | ASL Finger Spelling | CNN | Static | depth and intensity | 82% |
[141] | 2018 | RWTH-2014 | 3D Residual Convolutional Network (3D-ResNet) | Continues | Spatial information, and temporal connections across frames | 37.3 |
[143] | 2018 | Collected | 3D-CNNs | Isolated | spatiotemporal | 88.7% |
[144] | 2018 | Collected | DCNN | Isolated | hand palm sphere radius, and position of hand palm and fingertip | 88.79% |
[149] | 2018 | ASL Finger Spelling | Histograms of oriented gradients, and Zernike moments | Static | Hand shape | 94.37% |
[150] | 2018 | Collected | CNN | Continues | Hand shape | 92.88 % |
[151] | 2018 | Collected | 3DRCNN | Continues/Isolated | motion, depth, and temporal | 69.2% |
[152] | 2018 | SHREC | Leap Motion Controller (LMC) sensor | Isolated, static | finger bones of hands. | 96.4% |
[153] | 2018 | Collected | Hybrid Discrete Wavelet Transform, Gabor filter, and histogram of distances from Centre of Mass | Static | Hand shape | 76.25% |
[154] | 2018 | Collected | DCNN | Static | Facial expressions | 89% |
[156] | 2019 | Collected | Two-stream 3-D CNN | Continues | Spatiotemporal | 82.7% |
[158] | 2019 | Collected | CNN | Static | Hand shape | 96.83% |
[79] | 2019 | Collected | Open Pose library | Continues | human key points (hand, face, body) | 55.2% |
[159] | 2019 | ASL fingerspelling | PCA Net | Static | hand shape (corners, edges, blobs, or ridges) | 88.7% |
[161] | 2019 | SIGNUM | Stacked temporal fusion layers in DCNN | Continues | spatiotemporal | 2.80 |
[162] | 2019 | Collected | Leap motion device | Continues Isolated | 3D positions of the fingertips | 72.3% |
[163] | 2019 | Collected | CNN | Static | Hand shape | 95% |
[164] | 2019 | CSL | D-shift Net | Continues | spatial features time features, and temporal. | 96.7% |
[165] | 2019 | DEVISIGN_D | B3D Res-Net | Isolated | spatiotemporal | 89.8% |
[166] | 2019 | Collected | Local and GIST Descriptor | Isolated | Spatial and scene-based features | 95.83% |
[169] | 2019 | Collected | Restricted Boltzmann Machine (RBM) | Isolated | Handshape, and network generated features | 88.2% |
[170] | 2019 | KSU-SSL | 3D-CNN | Isolated | hand shape, position, orientation, and temporal dependence in consecutive frames | 77.32% |
[171] | 2019 | Collected | C3D, and Kinect device | Continues | Temporal, and Skeleton | 94.7% |
[175] | 2019 | Collected | Open Pose library with Kinect V2 | Static | 3D skeleton | 98.9%. |
[177] | 2020 | Ishara-Lipi | Mobile Net V1 | Isolated | Two hands shape | 95.71% |
[178] | 2020 | Collected | DCNN | Static | Hand shape | 94.31%. |
[179] | 2020 | Collected | Single layer Convolutional Self-Organizing Map (CSOM) | Isolated | Hand shape | 89.59% |
[180] | 2020 | KSU-SSL | Enhanced C3D architecture | Isolated | Spatiotemporal of hand and body | 87.69 % |
[182] | 2020 | KSU-SSL | 3DCNN | Isolated | Spatiotemporal | 84.3% |
[185] | 2020 | Collected | ResNet50 model | Isolated | Hand shape, Extra Spatial hand Relation (ESHR) features, and Hand Pose (HP), temporal. | 98.42% |
[186] | 2020 | Polytropon (PGSL) | ResNet-18 | Isolated | Optical flow of skeletal, handshapes, and mouthing | 95.31% |
[187] | 2020 | Collected | Discrete cosines transform, Zernike moment, scale-invariant feature transform, and social ski driver optimization algorithm | Static | Hand shape | 98.89% |
[189] | 2020 | RWTH-2014 | Temporal convolution unit and dynamic hierarchical bidirectional GRU unit | Continues | spatiotemporal | 10.73% BLEU |
[191] | 2020 | Collected | Standard score’ normalization on the raw Channel State Information (CSI) acquired from the Wi-Fi device, and MIFS algorithm | Static, and continues | The cross-cumulant features (unbiased estimates of covariance, normalized skewness, normalized kurtosis) | 99.9% |
[192] | 2020 | GSL | Open Pose human joint detector | Isolated | 3D hand skeletal, and region of hand, and mouth | 93.56% |
[197] | 2020 | Collected | Four channel surface electromyography (sEMG) signals | Isolated | time-frequency joint features | 93.33% |
[199] | 2020 | Collected | Euler angle, Quaternion from IMU signal | Continues | Hand Rotation | 10.8% WER |
[76] | 2020 | RKS-PERSIANSIGN | 3DCNNs | Isolated | Spatiotemporal | 99.8% |
[202] | 2020 | ASL fingerspelling A | DCNN | Static | Hand Shape | 99.96% |
[203] | 2020 | Collected | Construct a color-coded topographical descriptor from joint distances and angles, to be used in 2 streams (CNN) | Isolated | distance and angular | 93.01% |
[204] | 2020 | Collected | Two CNN models and a descriptor based on Histogram of cumulative magnitudes | Isolated | Two hands, skeleton, and body | 64.33% |
[208] | 2021 | RWTH-2014T | Semantic Focus of Interest Network with Face Highlight Module (SFoI-Net-FHM) | Isolated | Body and facial expression | 10.89 |
[210] | 2021 | Collected | (ConvLSTM) | Isolated | Spatiotemporal | 94.5% |
[212] | 2021 | Collected | ResNet50 | Static | hand area, the length of axis of first eigenvector, and hand position changes. | 96.42%. |
[214] | 2021 | Collected | f-CNN (fusion of 1-D CNN and 2-D CNN | Isolated | Time and spatial-domain features of finger resistance movement | 77.42% |
[217] | 2021 | MU | Modified Alex Net and VGG16 | Static | Hand edges and shape | 99.82% |
[222] | 2021 | Collected | VGG net of six convolutional layers | Static | Hand shape | 97.62% |
[224] | 2021 | 38 BdSL | DenseNet201, and Linear Discriminant Analysis | Static | Hand shape | 93.68% |
[225] | 2021 | KSU-ArSL | Bi-LSTM | Isolated | spatiotemporal | 84.2% |
[226] | 2021 | Collected | Paired pooling network in view pair pooling net (VPPN) | Isolated | spatiotemporal | 84.95% |
[228] | 2021 | ASLLVD | Bayesian Parallel Hidden Markov Model (BPaHMM) + stacked denoising variational autoencoders (SD-VAE) + PCA | Continues | Shape of hand, palm, and face, along with their position, speed, and distance between them | 97% |
[230] | 2021 | ASLLVD | 3-D CNN’s cascaded | Isolated | spatiotemporal | 96.0% |
[231] | 2021 | Collected | leap motion controller | Static, and Isolated | sphere radius, angles between fingers their distance | 91.82% |
[232] | 2021 | RWTH-2014 | (3 C 2 C 1) D ResNet | Continues | height, motion of hand, and frame blurriness levels | 23.30 |
[233] | 2021 | Montalbano II | AlexNet + Optical Flow (OF) + Scene Flow (SF) methods | Isolated | Pixel level, and hand pose | 99.08% |
[234] | 2021 | RWTH-2014 | GAN | Continues | spatiotemporal | 23.4 |
[235] | 2021 | MNIST | DCNN | Static | Hand shape | 98.58% |
[236] | 2021 | Collected | R-CNN | Static | Hand shape | 93% |
[237] | 2021 | CSL-500 | Multi-scale spatiotemporal attention network (MSSTA) | Isolated | Spatiotemporal | 98.08% |
[242] | 2022 | MNIST | modified CapsNet | Static | Spatial, and orientations | 99.60% |
[243] | 2022 | RKS-PERSIANSIGN | Singular value decomposition SVD | Isolated | 3D hand key points between the segments of each finger, and their angles. | 99.5% |
[244] | 2022 | Collected | 2DCRNN + 3DCRNN | Continues | Spatiotemporal out of small patches | 99% |
[246] | 2022 | Collected | Atrous convolution mechanism, and semantic spatial multi-cue model | Static Isolated | pose, face, and hand, and Spatial, full frame, | 99.85% |
[253] | 2022 | Collected | 4 DNN models using 2D and 3D CNN | Isolated | Spatiotemporal | 99% |
[255] | 2022 | Collected | Scale-Invariant Feature Transformation (SIFT) | Static | Corner, edges, rotation, blurring, and illumination. | 97.89% |
[256] | 2022 | Collected | InceptionResNetV2 | Isolated | Hand shape | 97% |
[257] | 2022 | Collected | Alex net | Static | Hand shape | 94.81% |
[258] | 2022 | Collected | Sensor + mathematical equations+ CNN | Continues | Mean, Magnitude of Mean, Variance, correlation, Covariance, and frequency domain features+ spatiotemporal | 0.088 |
[260] | 2022 | Collected | Media Pipe framework | Isolated | hands, body, and face | 99% |
[261] | 2022 | Collected | Bi-RNN network, maximal information correlation, and leap motion controller | Isolated | hand shape, orientation, position, and motion of 3D skeletal videos. | 97.98% |
[264] | 2022 | LSA64 | dynamic motion network (DMN)+ Accumulative motion network (AMN) | Isolated | spatiotemporal | 91.8% |
[265] | 2022 | CSL-500 | Spatial–temporal–channel attention (STCA) is proposed | isolated | spatiotemporal | 97.45% |
[268] | 2022 | Collected | SURF (Speeded Up Robust Features) | Isolated | distribution of the intensity material within the neighborhood of the interest point | 99% |
[269] | 2022 | Collected | Thresholding and Fast Fisher Vector Encoding (FFV) | Isolated | Hand, palm, finger shape, and position and 3D skeletal hand characteristics | 98.33% |
Classification is the final phase of any sign language recognition system and used before transferring the sign language into another form of data whether text or sound. In general, a particular sign is recognized by comparing it with the trained dataset, in which it categorizes the data into respective classes, depending on the feature vector obtained. Moreover, the system can calculate the probability associated with each class, allowing the data to be categorized under the respective class based on probability values. Overall, the classification conditions for sign language using DL involve selecting appropriate data representation, feature extraction techniques, classification algorithms, evaluation metrics, and ensuring sufficient and diverse training data. These factors collectively contribute to the accuracy and effectiveness of the sign language classification system. However, it may have some kinds of problems such as overfitting. In the realm of DL, overfitting occurs when a neural network model becomes too specialized in learning from the training data to the extent that it fails to generalize effectively to new, and unseen data. In other words, the model "memorizes" the training examples instead of learning the underlying patterns or relationships. When a DL model overfits, it performs very well on the training data but struggles to accurately predict or classify new instances that it has not encountered during training [119]. Various causes and indicators of overfitting exist, including a high model complexity with numerous parameters, insufficient training data, lack of regularization, excessive training epochs, and reliance on training data for evaluation [120]. To mitigate overfitting in deep models, several effective techniques can be employed. These include regularization methods [121], the incorporation of dropout layers [122], early stopping criteria [123], data augmentation strategies [124], and increasing training data [125]. These techniques can help to enhance model generalization and prevent the adverse effects of overfitting. Table 7 summarizes some related work of sign language recognition systems using DL that focuses on solving the problem of overfitting.
Related works on SLR using DL that address overfitting problem.
Author(s) | Year | dataset | Model | technique | result |
---|---|---|---|---|---|
[129] | 2018 | NTU | DCNN | Augmentation | 92.4% |
[130] | 2018 | Collected | Modified VGG net | Dropout | 84.68% |
[132] | 2018 | Ishara-Lipi | DCNN | Dropout | 94.88% |
[133] | 2018 | Collected | DCNN | small convolutional filter sizes, Dropout, and learning strategy | 85.3% |
[136] | 2018 | HUST | Deep Attention Network (DAN) | data augmentation | 73.4% |
[142] | 2018 | ASL Finger Spelling A | DNN | Dense Net | 90.3% |
[143] | 2018 | Collected | 3DCNN | SGD | 88.7% |
[146] | 2018 | SIGNUM | CNN-HMM hybrid | Augmentation | 7.4 error |
[157] | 2019 | Collected | DCNN | Augmentation | 93.667% |
[79] | 2019 | Collected | ResNet-152 | batch size, Augmentation | 55.28% |
[163] | 2019 | Collected | VGG-16 | Dropout | 95% |
[166] | 2019 | Collected | DCNN | Augmentation | 95.83% |
[167] | 2019 | Collected | DCNN | Dense Net | 90.3% |
[171] | 2019 | Collected | LSTM | Increase hidden state number | 94.7% |
[172] | 2019 | NVIDIA | Squeeze-net | Augmentation | 83.29% |
[173] | 2019 | G3D | Four stream CNN | Sharing of multi modal features with RGB spatial features during training and drop out | 86.87% |
[175] | 2019 | Collected | DCNN | Augmentation | 98.9%. |
[176] | 2020 | Collected | DCNN | Pooling Layer | 90% |
[181] | 2020 | Collected | DCNN | Reduce epochs to 30, and dropout added after each maxpooling | 97.6% |
[184] | 2020 | Collected | CNN with 8 layers | Augmentation | 89.32 % |
[188] | 2020 | MNIST | CNN | Dropout | 93% |
[190] | 2020 | Collected | Enhanced Alex Net | Augmentation | 89.48% |
[191] | 2020 | Collected | SVM | Augmentation, and k-fold cross validation | 99.9% |
[193] | 2020 | KETI | CNN+LSTM | New data augmentation | 96.2% |
[194] | 2020 | Collected | VGG16, and ResNet152 with enhanced softmax layer | Augmentation | 99% |
[196] | 2020 | Collected | RNN-LSTM | dropout layer (DR) | 99.81% |
[201] | 2020 | Collected | CNN | dropout layer, and augmentation | 95% |
[203] | 2020 | NTU | 2 stream CNN | randomness in the features interlocking fusion with dropout | 93.01% |
[207] | 2021 | Jochen-Triesch’s | DCNN | two dropouts | 99.96% |
[214] | 2021 | Collected | Generic temporal convolutional network (TCN) | Dropout | 77.42% |
[215] | 2021 | Collected | DCNN | Dropout | 96.65% |
[216] | 2021 | Collected | DCNN | Cyclical learning rate method | 99.7% |
[217] | 2021 | MU | Modified AlexNet and VGG16 | Augmentation | 99.82% |
[222] | 2021 | Collected | CNN | Dropout | 97.62% |
[229] | 2021 | Collected | 3DCNN | Dropout & Regularization | 88.24% |
[236] | 2021 | Collected | ResNet-18 | Zero-patience stopping criteria | 93.4% |
[238] | 2021 | Collected | DCNN | Synthetic Minority Oversampling Technique (SMOTE) | 97% |
[240] | 2022 | Collected | DCNN | Augmentation | 99.67% |
[253] | 2022 | Collected | ResNet50-BiLSTM | Augmentation | 99% |
[256] | 2022 | Collected | LSTM, and GRU | Dropout | 97% |
[263] | 2022 | BdSL | CNN | Augmentation | 99.91% |
Another critical issue that must be considered when designing a deep model for sign language recognition is the generalization, which refers to the capability of a model to operate accurately on unseen data that is distinct from the training one. The model demonstrates a high degree of generalization ability by consistently achieving impressive performance across a wide range of diverse and distinct datasets [126]. Having consistent results across different datasets is an important characteristic for a model to be considered robust and reliable, which demonstrates that it can be applied effectively to various real-world scenarios. The datasets can have different characteristics, biases, or noise levels. Therefore, it is crucial to carefully evaluate and validate the model's performance on each specific dataset to ensure its reliability and generalization ability [127]. Table 8, presents relevant works in sign language recognition using DL, focusing on the model's generalization ability by evaluating its performance on diverse datasets.
Related works on SLR using DL that aim to achieve generalization.
Author(s) | Year | Datasets | Technique | Result |
---|---|---|---|---|
[129] | 2018 | ASL finger spelling A |
DCNN | 92.4% |
[134] | 2018 | NYU |
Restricted Boltzmann Machine (RBM) | 90.01% |
[136] | 2018 | NTU |
DAN | 98.5% |
[143] | 2018 | Collected CSL |
3D-CNN | 88.7% |
[145] | 2018 | Collected |
JADM+CNN | 88.59% |
[146] | 2018 | RWTH 2012 |
CNN-HMM hybrid | 30.0 WER |
[156] | 2019 | Collected |
Hierarchical Attention Network (HAN) + Latent Space LS-HAN | 82.7% |
[161] | 2019 | RWTH-2014 |
DCNN | 22.86 WER |
[164] | 2019 | CSL |
Proposed multimodal two-stream CNN | 96.7% |
[165] | 2019 | DEVISIGN-D |
Deep 3-d Residual ConvNet + BiLSTM | 89.8% |
[170] | 2019 | KSU-SSL |
3D-CNN | 77.32% |
[173] | 2019 | Collected RGB-D |
Four stream CNN | 86.87% |
[174] | 2019 | Jochen-Triesch |
Novel DNN | 97.29% |
[182] | 2020 | KSU-SSL |
3DCNN | 84.38% |
[186] | 2020 | PGSL |
DCNN | 95.31% |
[187] | 2020 | ASL |
Deep Elman recurrent neural network | 98.89% |
[192] | 2020 | GSL |
CNN | 93.56% |
[76] | 2020 | NYU |
CNN | 4.64 error |
[202] | 2020 | NUS |
DCNN | 94.7% |
[203] | 2020 | HDM05 |
2 stream CNN | 93.42% |
[204] | 2020 | UTD–MHAD |
linear SVM classifier | 94.81% |
[207] | 2021 | Collected RGB images. |
DCNN | 99.96% |
[210] | 2021 | LSA64 |
3DCNN | 98.5% |
[211] | 2021 | ASLG-PC12 |
GRU and LSTM Bahdanau and Luong’s attention mechanisms | 66.59% |
[221] | 2021 | ASL alphabet, ASL MNIST MSL | Optimized CNN based on PSO | 99.58% |
[225] | 2021 | KSU-ArSL |
Inception-BiLSTM | 84.2% |
[226] | 2021 | Collected |
Motion modelled deep attention network (M2DA-Net) | 84.95% |
[228] | 2021 | RWTH-2014 |
Novel hyperparameter based optimized Generative. |
73.9% |
[232] | 2021 | RWTH-2014 |
Bidirectional encoder representations from transformers (BERT) + ResNet | 20.1 |
[233] | 2021 | Montalbano II |
LSTM+CNN | 99.08% |
[234] | 2021 | RWTH2014 |
GAN | 23.4 |
[237] | 2021 | CSL-500 |
Dual Network up on a Graph Convolutional Network (GCN). | 98.08% |
[242] | 2022 | SLDD |
Modified Caps Net architecture (SLR-Caps Net) | 99.52% |
[243] | 2022 | RKS-PERSIANSIGN |
Single shot detector, 2D convolutional neural network, singular value decomposition (SVD), and LSTM | 99.5% |
[247] | 2022 | Collected |
DCNN+ diffGrad optimizer | 92.43% |
[248] | 2022 | 38 BdSL |
BenSignNet | 94.00% |
[251] | 2022 | Collected |
DCNN | 99.41% |
[254] | 2022 | Collected |
Hybrid model based on VGG16-BiLSTM | 83.36% |
[255] | 2022 | Collected |
Hybrid Fist CNN | 97.89%, |
[256] | 2022 | ASL |
LSTM+GRU | 95.3% |
[261] | 2022 | Collected |
DLSTM | 97.98% |
[262] | 2022 | AUTSL |
3D-CNN | 93.53% |
[265] | 2022 | CSL-500 |
deep R (2+1) D | 97.45% |
[266] | 2022 | MU |
end-to-end fine-tuning method of a pre-trained CNN model with score-level fusion technique | 98.14% |
[269] | 2022 | SHREC |
FFV-Bi-LSTM | 92.99% |
The choice of DL layers significantly influences the classification model's performance, as it determines the model's architecture and its ability to learn and represent intricate patterns in the input data. Selecting the right layers involves a comprehensive understanding of the data's characteristics, problem complexity, and available resources for training and inference. Often, it necessitates experimentation, tuning, and domain expertise to discover the optimal combination of layers that maximizes classification performance for a particular task [128]. In sign language recognition. numerous authors have designed and utilized deep models to achieve desired performance levels, as depicted in Table 9.
Related works’ Classifiers employed in SLR using DL.
Author | year | Input modality | Classifier | result |
---|---|---|---|---|
[129] | 2018 | Static | DCNN | 92.4% |
[131] | 2018 | Static | DCNN | 99.85% |
[133] | 2018 | Static | DCNN | 85.3 % |
[134] | 2018 | Static | restricted Boltzmann machine | 98.13 % |
[135] | 2018 | Isolated | LRCNs and 3D CNNs | 99 % |
[136] | 2018 | Static | DAN | 73.4% |
[137] | 2018 | Static | (CNNs) of variant depth sizes and stacked denoising autoencoders | 92.83% |
[139] | 2018 | Static | DCNN | 82.5% |
[142] | 2018 | Static | DCNN | 90.3 % |
[145] | 2018 | Isolated | DCNN | 88.59% |
[146] | 2018 | Continues | CNN-HMM hybrid | 7.4 error |
[147] | 2018 | Static | DCNN | 98.05 % |
[151] | 2018 | Isolated | 3DCNN, and enhanced fully connected (FCRNN) | 69.2 % |
[155] | 2019 | Continues | Deep Capsule networks and game theory | 92.50% |
[156] | 2019 | Continues | Hierarchical Attention Network (HAN) and Latent Space | 82.7 % |
[157] | 2019 | Static | DCNN | 93.667% |
[160] | 2019 | Static | DCNN | 97 % |
[161] | 2019 | Continues | DCNN | 2.80 WER |
[162] | 2019 | Continues Isolated | Modified LSTM | 72.3% |
[167] | 2019 | Isolated | DCNN based Dense NET | 90.3 % |
[168] | 2019 | Static | DCNN | 97.71% |
[176] | 2020 | Static | DCNN | 90% |
[181] | 2020 | Static | DCNN | 97.6% |
[184] | 2020 | Static | Eight CNN layers+ stochastic pooling, batch normalization and dropout | 89.32 % |
[185] | 2020 | Isolated | Cascaded model (SSD, CNN, LSTM) | 98.42 % |
[187] | 2020 | Static | Deep Elman recurrent neural network | 98.89 % |
[188] | 2020 | Static | DCNN | 93% |
[190] | 2020 | Static | Enhanced Alex Net | 89.48% |
[198] | 2020 | Static | Multimodality fine-tuned VGG16 CNN+ Leap Motion network | 82.55% |
[199] | 2020 | Continues | Multi-channel CNN | 10.8 WER |
[200] | 2020 | Static | Hybrid model based on the Inception v3+ SVM | 99.90% |
[201] | 2020 | Static | 11 Layer CNN | 95% |
[205] | 2021 | Static | Three-layered CNN model | 90.8% |
[206] | 2021 | Isolated | Hybrid deep learning with convolutional (LSTM)+ and BiLSTM. | 76.21% |
[209] | 2021 | Isolated | DCNN+ Sentiment analysis | 99.63% |
[211] | 2021 | Continues | GRU+LSTM | 19.56 |
[214] | 2021 | Isolated | Generic temporal convolutional network | 77.42% |
[215] | 2021 | Static | DCNN | 96.65% |
[216] | 2021 | Static | DCNN | 99.7% |
[220] | 2021 | Static | Pretrained InceptionV3+ Mini-batch gradient descent optimizer | 85% |
[221] | 2021 | Static | Apply the PSO algorithm to find the optimal parameters of the convolutional neural networks | 99.58% |
[223] | 2021 | Continues | Visual hierarchy to lexical sequence alignment network H2SNet | 91.72% |
[227] | 2021 | Static | Novel lightweight deep learning model based on bottleneck motivated from deep residual learning | 99.52% |
[228] | 2021 | Continues | Novel hyperparameter based optimized Generative Adversarial Networks (H-GANs) | 97% |
[229] | 2021 | Isolated | 3DCNN | 88.24% |
[232] | 2021 | Continues | Bidirectional encoder representations from transformers (BERT) + ResNet | 23.30 WER |
[234] | 2021 | Continues | Generative Adversarial Network (SLRGAN) | 23.4 WER |
[238] | 2021 | Static | DCNN | 97% |
[239] | 2022 | Static | Optimized DCNN hybridization of Electric Fish Optimization (EFO), and Whale Optimization Algorithm (WOA) called Electric Fish based Whale Optimization Algorithm (E-WOA). | 98.7% |
[241] | 2022 | Isolated | CNN+ RNN | 98.8% |
[242] | 2022 | Static | Modified CapsNet architecture, (SLR-CapsNet) | 99.60% |
[245] | 2022 | Static | DCNN | 99.52% |
[247] | 2022 | Static | DCNN+ diffGrad optimizer | 88.01% |
[250] | 2022 | Static | DCNN | 92% |
[251] | 2022 | Static | DCNN | 99.38% |
[252] | 2022 | Static | Lightweight CNN | 94.30% |
[254] | 2022 | Isolated | Hybrid model based on VGG16-BiLSTM | 83.36% |
In real-world classification scenarios using DL, time and delay is a principal factor to consider. It is important to strike a balance between achieving accurate classification results and minimizing the time required. The specific requirements and constraints of the application, such as the desired response time or the available computational resources, should be considered when designing and deploying DL models. As a result, one of the major requirements that make the recognition system of sign language efficient is the recognition time. Table 10 illustrates the related DL works for sign language recognition that focus on improving the recognition time.
Table 10: Related works on SLR using DL that aim to minimize the required time.
Designing systems for recognizing sign language has become an emerging need in society and attracted the attention of academics and practitioners, due to its significant role in eliminating the communication barriers between the hearing and deaf communities. However, many challenges appeared when trying to design a sign language recognition system such as the dynamic gestures, environmental conditions, the availability of public datasets, and the multidimensional feature vectors. Still, many researchers are attempting to develop accurate, generalized, reliable, and robust sign language recognition models using deep learning. Deep learning technology is widely applied in many fields and research areas such as speech recognition, image processing, graphs, medicine, computer vision. With the emergence of DL approaches, sign language recognition has managed to significantly improve its accuracy. From the previous tables that illustrate some promising related works on sign language recognition using DL architectures, it is noticed that the most widely utilized deep architecture is CNN. Convolutional Neural Networks (CNNs) exhibit a remarkable capacity to extract discriminative features from raw data, enabling them to achieve impressive results in several types of sign language recognition tasks. They demonstrate robustness and flexibility, being employed either independently or in combination with other architectures, such as Long Short-Term Memory (LSTM), to enhance performance in sign language recognition. Moreover, CNNs prove to be highly advantageous in handling multi-modality data, such as RGB-D data, skeleton information, and finger points. These modalities provide rich information about the signer's actions, and their utilization has been instrumental in enhancing and addressing multiple challenges in sign language recognition. A set of related works focuses on solving only one type of problem facing the sign language recognition using DL such as in [132, 137, 139, 141, 147, 148, 152, 153, 154, 160, 169, 177, 195, 198, 205, 208, 212, 218, 220, 231, 235, 244, 247, 250, 252, 257, 258, 266], while others trying to solve multiple problems such as in [185, 199].The most widely used feature is the spatiotemporal, that depends on the hand shape, and the location information of the hand [135, 143, 156, 161, 165, 180, 182, 189, 76, 210, 225, 226, 230, 234, 237, 244, 253, 264, 265]. However, there are works that make use of more than one type of features in addition to spatiotemporal such as facial expression, skeleton, orientation of hand and angles [138, 141, 144, 151, 152, 79, 159, 162, 164, 166, 170, 171, 175, 185, 186, 191, 192, 197, 199, 203, 204, 208, 212, 228, 231, 232, 233, 245, 246, 255, 258, 260, 261, 268, 269]. Some works apply separate feature extraction techniques rather than depending only on the DL extracted features and managed to obtain recognition results [149, 152, 153, 79, 159, 162, 166, 169, 171, 175, 177, 179, 187, 189, 191, 192, 197, 199, 203, 204, 208, 228, 231, 233, 235, 237, 245, 246, 255, 258, 260, 261, 265, 268, 269]. Recent works especially from 2020 onwards focus on developing a recognition system for continuous sentences in sign language, which is still an open problem that gathers the most attention and is not completely solved or employed in any commercial application. Two factors that may contribute to an improved accuracy of continuous sign language recognition including feature extraction from frame sequences of the entered video and coordination between the features of every segment in the video and its corresponding sign label. Acquiring features from video frames that are more descriptive and discriminative resulted in better performance. While recent models in continuous sign language recognition have an uptrend in model performance using DL abilities in computer vision and Natural Language Processing (NLP), there is still much space for performance enhancement in this area. One of the main problems that many researchers deal with is the trajectory [186, 192, 204, 210, 260], and occlusion [129, 173, 76, 226, 228, 233, 237, 239, 259]. Furthermore, selecting or designing the appropriate deep model is one of the main challenges that have been addressed to deal with a particular type of challenges in sign language recognition by a variety of research in order to reach the desired accuracy goal. Others focus on solving some classification problems which is the overfitting that leads to the failure of the system. Applying any recognition system on more than one dataset with different properties is significant (high generalization), and one of the major factors that make the system highly effective. Thus, many researchers focus on implementing the recognition system of sign language on more than one dataset with a lot of variation and do not achieve the same results as in [129, 136, 143, 146, 156, 161, 164, 170, 182, 186, 204, 228, 234, 237, 254, 266]. Consequently, based on the information gathered from the preceding tables, deep learning stands out as a potent approach that has achieved the most impressive outcomes in sign language recognition. However, it's important to note that no existing research has successfully tackled all the associated challenges comprehensively. Some studies prioritize achieving high accuracy without considering time constraints, while others concentrate on addressing feature extraction issues and functioning in various environmental conditions. Yet, there's a lack of consideration for the complexity and overall applicability of the model. In addition, a significant aspect not extensively discussed in the related works pertains to hardware cost and complexity, both of which exert a substantial impact on the efficiency of the recognition system, particularly in real-world applications.
Sign language recognition goes a long way from its first start in recognizing alphabets and digits reaching up to words and sentences. Recent systems of sign language recognition have a degree of success in dealing with dynamic signs that are based on hand and body motion, obtained from vision, or hardware devices. The use of DL for sign language recognition has improved the system performance into a higher level and confirmed its effectiveness in recognizing the signs of different forms including letters, word, and sentences, which are captured using different devices and convert them into another form such as text and sound. In this paper, related works on the use of DL for sign language recognition from the year of 2018 to 2022 have been reviewed, and a conclusion is possessed that DL reach to the desired performance, within high result in many aspects. Nevertheless, there remains room for further improvement to develop a comprehensive system capable of effectively handling all challenges encountered in sign language recognition. The goal is to achieve accurate and rapid results across various environmental conditions while utilizing diverse datasets. As future work, the primary objective is to address the issue of generalization and minimize the time needed for sign language recognition. Our objective is to present a deep learning model that can provide precise and highly accurate recognition outcomes for various types of sign language, encompassing both static and dynamic ones in different languages including English, Arabic, Malay, and Chinese. Notably, this model aims to achieve these outcomes while minimizing hardware expenses and the required training time with high recognition accuracy.