Deep Learning for Sign Language Recognition: A Comparative Review
, , oraz
15 cze 2024
O artykule
Kategoria artykułu: Article
Data publikacji: 15 cze 2024
Zakres stron: 77 - 116
Otrzymano: 27 maj 2024
Przyjęty: 05 cze 2024
DOI: https://doi.org/10.2478/jsiot-2024-0006
Słowa kluczowe
© 2023 Shahad Thamear Abd Al-Latief et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Related works’ Classifiers employed in SLR using DL_
Author | year | Input modality | Classifier | result |
---|---|---|---|---|
[ |
2018 | Static | DCNN | 92.4% |
[ |
2018 | Static | DCNN | 99.85% |
[ |
2018 | Static | DCNN | 85.3 % |
[ |
2018 | Static | restricted Boltzmann machine | 98.13 % |
[ |
2018 | Isolated | LRCNs and 3D CNNs | 99 % |
[ |
2018 | Static | DAN | 73.4% |
[ |
2018 | Static | (CNNs) of variant depth sizes and stacked denoising autoencoders | 92.83% |
[ |
2018 | Static | DCNN | 82.5% |
[ |
2018 | Static | DCNN | 90.3 % |
[ |
2018 | Isolated | DCNN | 88.59% |
[ |
2018 | Continues | CNN-HMM hybrid | 7.4 error |
[ |
2018 | Static | DCNN | 98.05 % |
[ |
2018 | Isolated | 3DCNN, and enhanced fully connected (FCRNN) | 69.2 % |
[ |
2019 | Continues | Deep Capsule networks and game theory | 92.50% |
[ |
2019 | Continues | Hierarchical Attention Network (HAN) and Latent Space | 82.7 % |
[ |
2019 | Static | DCNN | 93.667% |
[ |
2019 | Static | DCNN | 97 % |
[ |
2019 | Continues | DCNN | 2.80 WER |
[ |
2019 | Continues Isolated | Modified LSTM | 72.3% |
[ |
2019 | Isolated | DCNN based Dense NET | 90.3 % |
[ |
2019 | Static | DCNN | 97.71% |
[ |
2020 | Static | DCNN | 90% |
[ |
2020 | Static | DCNN | 97.6% |
[ |
2020 | Static | Eight CNN layers+ stochastic pooling, batch normalization and dropout | 89.32 % |
[ |
2020 | Isolated | Cascaded model (SSD, CNN, LSTM) | 98.42 % |
[ |
2020 | Static | Deep Elman recurrent neural network | 98.89 % |
[ |
2020 | Static | DCNN | 93% |
[ |
2020 | Static | Enhanced Alex Net | 89.48% |
[ |
2020 | Static | Multimodality fine-tuned VGG16 CNN+ Leap Motion network | 82.55% |
[ |
2020 | Continues | Multi-channel CNN | 10.8 WER |
[ |
2020 | Static | Hybrid model based on the Inception v3+ SVM | 99.90% |
[ |
2020 | Static | 11 Layer CNN | 95% |
[ |
2021 | Static | Three-layered CNN model | 90.8% |
[ |
2021 | Isolated | Hybrid deep learning with convolutional (LSTM)+ and BiLSTM. | 76.21% |
[ |
2021 | Isolated | DCNN+ Sentiment analysis | 99.63% |
[ |
2021 | Continues | GRU+LSTM | 19.56 |
[ |
2021 | Isolated | Generic temporal convolutional network | 77.42% |
[ |
2021 | Static | DCNN | 96.65% |
[ |
2021 | Static | DCNN | 99.7% |
[ |
2021 | Static | Pretrained InceptionV3+ Mini-batch gradient descent optimizer | 85% |
[ |
2021 | Static | Apply the PSO algorithm to find the optimal parameters of the convolutional neural networks | 99.58% |
[ |
2021 | Continues | Visual hierarchy to lexical sequence alignment network H2SNet | 91.72% |
[ |
2021 | Static | Novel lightweight deep learning model based on bottleneck motivated from deep residual learning | 99.52% |
[ |
2021 | Continues | Novel hyperparameter based optimized Generative Adversarial Networks (H-GANs) | 97% |
[ |
2021 | Isolated | 3DCNN | 88.24% |
[ |
2021 | Continues | Bidirectional encoder representations from transformers (BERT) + ResNet | 23.30 WER |
[ |
2021 | Continues | Generative Adversarial Network (SLRGAN) | 23.4 WER |
[ |
2021 | Static | DCNN | 97% |
[ |
2022 | Static | Optimized DCNN hybridization of Electric Fish Optimization (EFO), and Whale Optimization Algorithm (WOA) called Electric Fish based Whale Optimization Algorithm (E-WOA). | 98.7% |
[ |
2022 | Isolated | CNN+ RNN | 98.8% |
[ |
2022 | Static | Modified CapsNet architecture, (SLR-CapsNet) | 99.60% |
[ |
2022 | Static | DCNN | 99.52% |
[ |
2022 | Static | DCNN+ diffGrad optimizer | 88.01% |
[ |
2022 | Static | DCNN | 92% |
[ |
2022 | Static | DCNN | 99.38% |
[ |
2022 | Static | Lightweight CNN | 94.30% |
[ |
2022 | Isolated | Hybrid model based on VGG16-BiLSTM | 83.36% |
Related works on SLR using DL that address overfitting problem_
Author(s) | Year | dataset | Model | technique | result |
---|---|---|---|---|---|
[ |
2018 | NTU | DCNN | Augmentation | 92.4% |
[ |
2018 | Collected | Modified VGG net | Dropout | 84.68% |
[ |
2018 | Ishara-Lipi | DCNN | Dropout | 94.88% |
[ |
2018 | Collected | DCNN | small convolutional filter sizes, Dropout, and learning strategy | 85.3% |
[ |
2018 | HUST | Deep Attention Network (DAN) | data augmentation | 73.4% |
[ |
2018 | ASL Finger Spelling A | DNN | Dense Net | 90.3% |
[ |
2018 | Collected | 3DCNN | SGD | 88.7% |
[ |
2018 | SIGNUM | CNN-HMM hybrid | Augmentation | 7.4 error |
[ |
2019 | Collected | DCNN | Augmentation | 93.667% |
[ |
2019 | Collected | ResNet-152 | batch size, Augmentation | 55.28% |
[ |
2019 | Collected | VGG-16 | Dropout | 95% |
[ |
2019 | Collected | DCNN | Augmentation | 95.83% |
[ |
2019 | Collected | DCNN | Dense Net | 90.3% |
[ |
2019 | Collected | LSTM | Increase hidden state number | 94.7% |
[ |
2019 | NVIDIA | Squeeze-net | Augmentation | 83.29% |
[ |
2019 | G3D | Four stream CNN | Sharing of multi modal features with RGB spatial features during training and drop out | 86.87% |
[ |
2019 | Collected | DCNN | Augmentation | 98.9%. |
[ |
2020 | Collected | DCNN | Pooling Layer | 90% |
[ |
2020 | Collected | DCNN | Reduce epochs to 30, and dropout added after each maxpooling | 97.6% |
[ |
2020 | Collected | CNN with 8 layers | Augmentation | 89.32 % |
[ |
2020 | MNIST | CNN | Dropout | 93% |
[ |
2020 | Collected | Enhanced Alex Net | Augmentation | 89.48% |
[ |
2020 | Collected | SVM | Augmentation, and k-fold cross validation | 99.9% |
[ |
2020 | KETI | CNN+LSTM | New data augmentation | 96.2% |
[ |
2020 | Collected | VGG16, and ResNet152 with enhanced softmax layer | Augmentation | 99% |
[ |
2020 | Collected | RNN-LSTM | dropout layer (DR) | 99.81% |
[ |
2020 | Collected | CNN | dropout layer, and augmentation | 95% |
[ |
2020 | NTU | 2 stream CNN | randomness in the features interlocking fusion with dropout | 93.01% |
[ |
2021 | Jochen-Triesch’s | DCNN | two dropouts | 99.96% |
[ |
2021 | Collected | Generic temporal convolutional network (TCN) | Dropout | 77.42% |
[ |
2021 | Collected | DCNN | Dropout | 96.65% |
[ |
2021 | Collected | DCNN | Cyclical learning rate method | 99.7% |
[ |
2021 | MU | Modified AlexNet and VGG16 | Augmentation | 99.82% |
[ |
2021 | Collected | CNN | Dropout | 97.62% |
[ |
2021 | Collected | 3DCNN | Dropout & Regularization | 88.24% |
[ |
2021 | Collected | ResNet-18 | Zero-patience stopping criteria | 93.4% |
[ |
2021 | Collected | DCNN | Synthetic Minority Oversampling Technique (SMOTE) | 97% |
[ |
2022 | Collected | DCNN | Augmentation | 99.67% |
[ |
2022 | Collected | ResNet50-BiLSTM | Augmentation | 99% |
[ |
2022 | Collected | LSTM, and GRU | Dropout | 97% |
[ |
2022 | BdSL | CNN | Augmentation | 99.91% |
Public sign language datasets
Dataset | Language | Equipment | Modalities | Signers | Samples |
---|---|---|---|---|---|
ASL alphabets [ |
American | Webcam | RGB images | - | 87,000 |
MNIST [ |
American | Webcam | Grey images | - | 27,455 |
ASL Fingerspelling A [ |
American | Microsoft Kinect | RGB and depth images | 5 | 48,000 |
NYU [ |
American | Kinect | RGB and depth images | 36 | 81,009 |
ASL by Surrey [ |
American | Kinect | RGB and depth images | 23 | 130,000 |
Jochen-Triesch [ |
American | Cam | Grey images with different background | 24 | 720 |
MKLM [ |
American | Leap Motion device and a Kinect sensor | RGB and depth images | 14 | 1400 |
NTU-HD [ |
American | Kinect sensor | RGB and depth images | 10 | 1000 |
HUST [ |
American | Microsoft Kinect | RGB and depth images | 10 | 10880 |
RVL-SLLL [ |
American | Cam | RGB video | 14 | |
ChicagoFSWild [ |
American | Collected online from YouTube | RGB video | 160 | 7,304 |
ASLG-PC12 [ |
American | Cam | RGB video | - | 880 |
American Sign Language Lexicon Video (ASLLVD) [ |
American | Cam | RGB videos of different angles | 6 | 3,300 |
MU [ |
American | Cam | RGB images with illumination variations in five different angles | 5 | 2515 |
ASLID [ |
American | Web cam | RGB images | 6 | 809 |
KSU-SSL [ |
Arabic | Cam and Kinect | RGB Videos with uncontrolled environment | 40 | 16000 |
KArSL [ |
Arabic | Kinect V2 | RGB video | 3 | 75,300 |
ArSL by University of Sharjah [ |
Arabic | Analog camcorder | RGB images | 3 | 3450 |
JTD [ |
Indian | Webcam | RGB images with 3 different backgrounds | 24 | 720 |
IISL2020 [ |
Indian | Webcam | RGB video with uncontrolled environment | 16 | 12100 |
RWTH-PHOENIX-Weather 2014 [ |
German | Webcam | RGB Video | 9 | 8,257 |
SIGNUM [ |
German | Cam | RGB Video | 25 | 33210 |
DEVISIGN-D [ |
Chinese | Cam | RGB videos | 8 | 6000 |
DEVISIGN-L [ |
Chinese | Cam | RGB videos | 8 | 24000 |
CSL-500 [ |
Chinese | Cam | RGB, depth and skeleton videos | 50 | 25,000 |
Chinese Sign Language [ |
Chinese | Kinect | RGB, depth and skeleton videos | 50 | 125000 |
38 BdSL [ |
Bengali | Cam | RGB images | 320 | 12,160 |
Ishara-Lipi [ |
Bengali | Cam | Greyscale images | - | 1800 |
ChaLearn14 [ |
Italian | Kinect | RGB and depth video | 940 | 940 |
Montalbano II [ |
Italian | Kinect | RGB and depth video | 20 | 940 |
UFOP–LIBRAS [ |
Brazilian | Kinect | RGB, depth and skeleton videos | 5 | 2800 |
AUTSL [ |
Turkish | Kinect v2 | RGB, depth and skeleton videos | 43 | 38,336 |
RKS-PERSIANSIGN [ |
Persian | Cam | RGB video | 10 | 10,000 |
LSA64 [ |
Argentine | Cam | RGB video | 10 | 3200 |
Polytropon (PGSL) [ |
Greek | Cam | RGB video | 6 | 840 |
kETI [ |
Korean | Cam | RGB video | 40 | 14,672 |
Gesture public datasets
Name | Modality | device | signers | samples |
---|---|---|---|---|
LMDHG [ |
RGB, and depth videos | Kinect and | 21 | 608 |
SHREC Shape Retrieval Contest (SHREC) [ |
RGB, and depth videos | Intel RealSense short range depth camera | 28 | 2800 |
UTD–MHAD [ |
RGB, depth and skeleton videos | Kinect and wearable inertial sensor | 8 | 861 |
The Multicamera Human Action Video Data (MuHAVi) [ |
RGB video | 8 camera views | 14 | 1904 |
NUMA [ |
RGB, depth and skeleton videos | 10 Kinect with three different views | 10 | 1493 |
WEIZMANN [ |
Low resolution RGB video | Camera with 10 different viewpoints | 9 | 90 |
NTU RGB [ |
RGB, depth and skeleton videos | Kinect | 40 | 56 880 |
Cambridge hand gesture [ |
RGB video captured under five different illuminations | Cam | 9 | 900 |
VIVA [ |
RGB, and depth videos | Kinect | 8 | 885 |
MSR [ |
RGB, and depth videos | Kinect | 10 | 320 |
CAD-60 [ |
RGB and depth video in different environments, such as a kitchen, a living room, and office | Kinect | 4 | 48 |
HDM05MoCap (motion capture) [ |
RGB video | Cam | 5 | 2337 |
CMU [ |
RGB images | CAM | 25 | 204 |
isoGD [ |
RGB and depth videos | Kinect | 21 | 47,933 |
NVIDIA [ |
RGB and depth video | Kinect | 8 | 885 |
G3D [ |
RGB and depth video | Kinect | 16 | 1280 |
UT Kinect [ |
RGB and depth video | Kinect | 10 | 200 |
First-Person [ |
RGB and depth video | RealSense SR300 cam | 6 | 1,175 |
Jester [ |
RGB | Cam | 25 | 148,092 |
Ego Guster [ |
RGB and depth video | Kinect | 50 | 2,081 |
NUS II [ |
RGB images with complex backgrounds, and various hand shapes and sizes | Cam | 40 | 2000 |
Related works on SLR using DL that address movement orientation, trajectory, occlusion problems_
Author(s) | Year | Type of variation | language | Signing mode | Model | Accuracy | Error Rate |
---|---|---|---|---|---|---|---|
[ |
2018 | similarities, and occlusion | American | Static | DCNN | 92.4% | |
[ |
2018 | Movement | Brazilian | Isolated | Long-term Recurrent Convolutional Networks | 99% | - |
[ |
2018 | size, shape, and position of the fingers or hands | American | Static | CNN | 82% | - |
[ |
2018 | Hand movement | American | Isolated | VGG 16 | 99% | - |
[ |
2018 | Movement | American | Isolated | Leap Motion Controller | 88.79% | - |
[ |
2018 | 3D motion | Indian | Isolated | Joint Angular Displacement Maps (JADMs) | 92.14% | |
[ |
2018 | head and hand movements | Indian | Continues | CNN | 92.88 % | - |
[ |
2019 | Hand movement | Indian | Continues | Wearable systems to measure muscle intensity, hand orientation, motion, and position | 92.50% | - |
[ |
2019 | Variant hand orientations | Chines | Continues | Hierarchical Attention Network (HAN) and Latent Space | 82.7% | - |
[ |
2019 | Similarity and trajectory | Chines | Isolated | Deep 3-d Residual ConvNet + BiLSTM | 89.8% | - |
[ |
2019 | orientation of camera, hand position and movement, inter hand relation | Vietnam | Isolated | DCNN | 95.83% | |
[ |
2019 | Movement, self-occlusions, orientation, and angles | Indian | Continues | Four stream CNN | 86.87% | |
[ |
2019 | Movement in different distance from the camera | American | Static | Novel DNN | 97.29% | - |
[ |
2020 | Angles, distance, object size, and rotations | Arabic | Static | Image Augmentation | 90% | 0.53 |
[ |
2020 | fingers' configuration, hand's orientation, and its position to the body | Arabic | Isolated | Multilayer perceptron+ Autoencoder | 87.69% | |
[ |
2020 | Hand Movement | Persian | Isolated | Single Shot Detector (SSD) +CNN+LSTM | 98.42% | |
[ |
2020 | shape, orientation, and trajectory | Greek | Isolated | Fully convolutional attention-based encoder-decoder | 95.31% | - |
[ |
2020 | Trajectory | Greek | Isolated | incorporate the depth dimension in the coordinates of the hand joints | 93.56% | - |
[ |
2020 | finger angles and Multi finger movements | Taiwan | Continues | Wristband with ten modified barometric sensors+ dual DCNN | 97.5% | |
[ |
2020 | movement of fingers and hands | Chinese | Isolated | Motion data from IMU sensors | 99.81% | - |
[ |
2020 | finger movement | Chinese | Isolated | Trigno Wireless sEMG acquisition system used to collect multichannel sEMG signals of forearm muscles | 93.33% | |
[ |
2020 | finger and arm motions, two-handed signs, and hand rotation | Chinees | Continues | Two armbands embedded with an IMU sensor and multi-channel sEMG sensors are attached on the forearms to capture both arm, and finger movements | - | 10.8% |
[ |
2020 | Hand occlusion | Persian | Isolated | Skeleton detection | 99.8% | |
[ |
2020 | Trajectory | Brazilian | Isolated | Convert the trajectory information into spherical coordinates | 64.33% | |
[ |
2021 | Trajectory | Arabic | Isolated | Multi-Sign Language Ontology (MSLO) | 94.5% | |
[ |
2021 | Movement | Korean | Isolated | 3DCNN | 91% | |
[ |
2021 | finger movement | Chines | Isolated | Design a low-cost data glove with simple hardware structure to capture finger movement and bending simultaneously | 77.42% | |
[ |
2021 | Skewing, and angle rotation | Bengali | Static | DCNN | 99.57 | 0.56 |
[ |
2021 | Hand motion | American | Continues | Sensing Gloves | 86.67% | |
[ |
2021 | spatial appearance and temporal motion | Chines | Continues | Lexical prediction network | 91.72% | 6.10 |
[ |
2021 | finger self-occlusions, view invariance | Indian | Continues | Motion modelled deep attention network (M2DA-Net) | 84.95% | |
[ |
2021 | Occlusions of hand/hand, hands/face, or hands/upper body postures. | American | Continues | Novel hyperparameter based optimized Generative Adversarial Networks (H-GANs) Deep Long Short-Term Memory (LSTM) as generator and LSTM with 3D Convolutional Neural Network (3D-CNN) as a discriminator | 97% | 1.4 |
[ |
2021 | Variant view | American | Isolated | 3-D CNN’s cascaded | 96% | |
[ |
2021 | Hand occlusion, | Italian | Isolated | LSTM+CNN | 99.08% | |
[ |
2021 | Finger occlusion, motion blurring, variant signing styles. | Chines | Continues | Dual Network up on a Graph Convolutional Network (GCN). | 98.08% | |
[ |
2022 | self-structural characteristics, and occlusion | Indian | Continues | Dynamic Time Warping (DTW) | 98.7% | |
[ |
2022 | High similarity and complexity | American | Static | DCNN | 99.67% | 0.0016 |
[ |
2022 | Movement | Arabic | Isolated | The difference function | 98.8% | |
[ |
2022 | Hand Occlusion | American | Static | Re-formation layer in the CNN | 91.40% | |
[ |
2022 | Trajectory, hand shapes, and orientation | American | Isolated | Media Pipe’s Landmarks with GRU | 99% | |
[ |
2022 | ambiguous and 3D double-hand motion trajectories | American | Isolated | 3D extended Kalman filter (EKF) tracking, and approximation of a probability density function over a time frame. | 97.98% | |
[ |
2022 | Movement | Turkish | Continues | Motion History Images (MHI) generated from RGB video frames | 94.83% | |
[ |
2022 | Movement | Argentina | Continues | Propose an accumulative video motion (AVM) technique | 91.8% | |
[ |
2022 | orientation angle, prosodic, and similarity | American | continues | Develop robust fast fisher vector (FFV) in in Deep Bi-LSTM | 98.33% | |
[ |
2022 | variant length, sequential patterns, | English | Isolated | Novel Residual-Multi Head model | 95.03% |
Related works on SLR using DL that aim to achieve generalization_
Author(s) | Year | Datasets | Technique | Result |
---|---|---|---|---|
[ |
2018 | ASL finger spelling A |
DCNN | 92.4% |
[ |
2018 | NYU |
Restricted Boltzmann Machine (RBM) | 90.01% |
[ |
2018 | NTU |
DAN | 98.5% |
[ |
2018 | Collected CSL |
3D-CNN | 88.7% |
[ |
2018 | Collected |
JADM+CNN | 88.59% |
[ |
2018 | RWTH 2012 |
CNN-HMM hybrid | 30.0 WER |
[ |
2019 | Collected |
Hierarchical Attention Network (HAN) + Latent Space LS-HAN | 82.7% |
[ |
2019 | RWTH-2014 |
DCNN | 22.86 WER |
[ |
2019 | CSL |
Proposed multimodal two-stream CNN | 96.7% |
[ |
2019 | DEVISIGN-D |
Deep 3-d Residual ConvNet + BiLSTM | 89.8% |
[ |
2019 | KSU-SSL |
3D-CNN | 77.32% |
[ |
2019 | Collected RGB-D |
Four stream CNN | 86.87% |
[ |
2019 | Jochen-Triesch |
Novel DNN | 97.29% |
[ |
2020 | KSU-SSL |
3DCNN | 84.38% |
[ |
2020 | PGSL |
DCNN | 95.31% |
[ |
2020 | ASL |
Deep Elman recurrent neural network | 98.89% |
[ |
2020 | GSL |
CNN | 93.56% |
[ |
2020 | NYU |
CNN | 4.64 error |
[ |
2020 | NUS |
DCNN | 94.7% |
[ |
2020 | HDM05 |
2 stream CNN | 93.42% |
[ |
2020 | UTD–MHAD |
linear SVM classifier | 94.81% |
[ |
2021 | Collected RGB images. |
DCNN | 99.96% |
[ |
2021 | LSA64 |
3DCNN | 98.5% |
[ |
2021 | ASLG-PC12 |
GRU and LSTM Bahdanau and Luong’s attention mechanisms | 66.59% |
[ |
2021 | ASL alphabet, ASL MNIST MSL | Optimized CNN based on PSO | 99.58% |
[ |
2021 | KSU-ArSL |
Inception-BiLSTM | 84.2% |
[ |
2021 | Collected |
Motion modelled deep attention network (M2DA-Net) | 84.95% |
[ |
2021 | RWTH-2014 |
Novel hyperparameter based optimized Generative. |
73.9% |
[ |
2021 | RWTH-2014 |
Bidirectional encoder representations from transformers (BERT) + ResNet | 20.1 |
[ |
2021 | Montalbano II |
LSTM+CNN | 99.08% |
[ |
2021 | RWTH2014 |
GAN | 23.4 |
[ |
2021 | CSL-500 |
Dual Network up on a Graph Convolutional Network (GCN). | 98.08% |
[ |
2022 | SLDD |
Modified Caps Net architecture (SLR-Caps Net) | 99.52% |
[ |
2022 | RKS-PERSIANSIGN |
Single shot detector, 2D convolutional neural network, singular value decomposition (SVD), and LSTM | 99.5% |
[ |
2022 | Collected |
DCNN+ diffGrad optimizer | 92.43% |
[ |
2022 | 38 BdSL |
BenSignNet | 94.00% |
[ |
2022 | Collected |
DCNN | 99.41% |
[ |
2022 | Collected |
Hybrid model based on VGG16-BiLSTM | 83.36% |
[ |
2022 | Collected |
Hybrid Fist CNN | 97.89%, |
[ |
2022 | ASL |
LSTM+GRU | 95.3% |
[ |
2022 | Collected |
DLSTM | 97.98% |
[ |
2022 | AUTSL |
3D-CNN | 93.53% |
[ |
2022 | CSL-500 |
deep R (2+1) D | 97.45% |
[ |
2022 | MU |
end-to-end fine-tuning method of a pre-trained CNN model with score-level fusion technique | 98.14% |
[ |
2022 | SHREC |
FFV-Bi-LSTM | 92.99% |
Related works on SLR using DL that address the various environmental conditions problem_
Author (s) | Year | Language | Modality | Type of condition | Deal with technique | results |
---|---|---|---|---|---|---|
[ |
2018 | Bengali | RGB images | Variant background and skin colors | Modified VGG net | 84.68% |
[ |
2018 | American | RGB images | noise and missing data | Augmentation | 98.13% |
[ |
2018 | Indian | RGB video | Different viewing angles, background lighting, and distance | Novel CNN | 92.88% |
[ |
2019 | American | Binary images | Noise | Erosion, closing, contour generation, and polygonal approximation, | 96.83% |
[ |
2019 | American | Depth image | Variant illumination, and background | Attain depth images | 88.7% |
[ |
2019 | chines | RGB, and depth video | Variant illumination, and background | Two-stream spatiotemporal network | 96.7% |
[ |
2019 | Indian | RGB, and depth video | Variant illumination, background, and camera distance | Four stream CNN | 86.87% |
[ |
2020 | Arabic | RGB images | Variant illumination, and skin color | DCNN | 94.31% |
[ |
2020 | Arabic | RGB videos | Variant illumination, background, pose, scale, shape, position, and clothes | Bi-directional Long Short-Term Memory (BiLSTM) | 89.59% |
[ |
2020 | Arabic | RGB Videos | Variant illumination, clothes, position, scale, and speed | 3DCNN and SoftMax function | 87.69% |
[ |
2020 | Arabic | RGB Videos | Variations in heights and distances from camera | Normalization | 84.3% |
[ |
2020 | Arabic | RGB images | variant illumination, and background | VGG16 and the ResNet152 with enhanced softmax layer | 99% |
[ |
2020 | American | Grayscale images | illumination, and skin color | Set the hand histogram | 95% |
[ |
2020 | American | RGB images | Variant illumination, background | DCNN | 99.96% |
[ |
2021 | Indian | RGB video | Variant illuminations, camera positions, and orientations | Google net+ BiLSTM | 76.21% |
[ |
2021 | Indian | RGB images | Light and dark backgrounds | DCNN with few numbers of parameters | 99.96% |
[ |
2021 | American | RGB video | Noise | Gaussian Blur | 99.63% |
[ |
2021 | Korean | Depth Videos | Low resolution | Augmentation | 91% |
[ |
2021 | Bengali | RGB images | Variant backgrounds, camera angle, light contrast, and skin tone | Conventional deep learning + Zero-shot learning ZSL | 93.68% |
[ |
2021 | Arabic | RGB video | Variant illumination, background, and clothes | Inception-BiLSTM | 84.2% |
[ |
2021 | American | Thermal images | Varying illumination | Adopt live images taken by a low-resolution thermal camera | 99.52% |
[ |
2021 | Indian | RGB video | Varying illumination | 3DCNN | 88.24% |
[ |
2021 | American | RGB video | Noise, varying illumination | Median filtering + histogram equalization | 96% |
[ |
2021 | Arabic | RGB images | Variant illumination, and background | Region-based Convolutional Neural Network (R-CNN) | 93.4% |
[ |
2022 | Indian | RGB video | Variant illumination, and views | Grey scale conversion and histogram equalization | 98.7% |
[ |
2022 | Arabic | RGB video | Variant illumination, and background | CNN+ RNN | 98.8% |
[ |
2022 | Arabic | Greyscale images | Variant illumination, and background | Sobel filter | 97% |
[ |
2022 | Arabic | RGB, and depth video | Variant Background | ResNet50-BiLSTM | 99% |
[ |
2022 | American | RGB, and depth images | Noise and illumination variation | Median filtering and histogram equalization | 91.4% |
[ |
2022 | American | Skeleton video | Noise in video frames | An innovative weighted least square (WLS) algorithm | 97.98% |
[ |
2022 | English | Wi-Fi signal | Noise and uncleaned Wi-Fi signals. | Principal Component Analysis (PCA) | 95.03% |
Related works on SLR using DL that address feature extraction problem_
Author(s) | Year | Dataset | Technique | Signing mode | Feature(s) | Result |
---|---|---|---|---|---|---|
[ |
2018 | Collected | DCNN | static | Hand shape | 84.6% |
[ |
2018 | Collected | 3D CNN | Isolated | spatiotemporal | 99% |
[ |
2018 | ASL Finger Spelling | CNN | Static | depth and intensity | 82% |
[ |
2018 | RWTH-2014 | 3D Residual Convolutional Network (3D-ResNet) | Continues | Spatial information, and temporal connections across frames | 37.3 |
[ |
2018 | Collected | 3D-CNNs | Isolated | spatiotemporal | 88.7% |
[ |
2018 | Collected | DCNN | Isolated | hand palm sphere radius, and position of hand palm and fingertip | 88.79% |
[ |
2018 | ASL Finger Spelling | Histograms of oriented gradients, and Zernike moments | Static | Hand shape | 94.37% |
[ |
2018 | Collected | CNN | Continues | Hand shape | 92.88 % |
[ |
2018 | Collected | 3DRCNN | Continues/Isolated | motion, depth, and temporal | 69.2% |
[ |
2018 | SHREC | Leap Motion Controller (LMC) sensor | Isolated, static | finger bones of hands. | 96.4% |
[ |
2018 | Collected | Hybrid Discrete Wavelet Transform, Gabor filter, and histogram of distances from Centre of Mass | Static | Hand shape | 76.25% |
[ |
2018 | Collected | DCNN | Static | Facial expressions | 89% |
[ |
2019 | Collected | Two-stream 3-D CNN | Continues | Spatiotemporal | 82.7% |
[ |
2019 | Collected | CNN | Static | Hand shape | 96.83% |
[ |
2019 | Collected | Open Pose library | Continues | human key points (hand, face, body) | 55.2% |
[ |
2019 | ASL fingerspelling | PCA Net | Static | hand shape (corners, edges, blobs, or ridges) | 88.7% |
[ |
2019 | SIGNUM | Stacked temporal fusion layers in DCNN | Continues | spatiotemporal | 2.80 |
[ |
2019 | Collected | Leap motion device | Continues Isolated | 3D positions of the fingertips | 72.3% |
[ |
2019 | Collected | CNN | Static | Hand shape | 95% |
[ |
2019 | CSL | D-shift Net | Continues | spatial features time features, and temporal. | 96.7% |
[ |
2019 | DEVISIGN_D | B3D Res-Net | Isolated | spatiotemporal | 89.8% |
[ |
2019 | Collected | Local and GIST Descriptor | Isolated | Spatial and scene-based features | 95.83% |
[ |
2019 | Collected | Restricted Boltzmann Machine (RBM) | Isolated | Handshape, and network generated features | 88.2% |
[ |
2019 | KSU-SSL | 3D-CNN | Isolated | hand shape, position, orientation, and temporal dependence in consecutive frames | 77.32% |
[ |
2019 | Collected | C3D, and Kinect device | Continues | Temporal, and Skeleton | 94.7% |
[ |
2019 | Collected | Open Pose library with Kinect V2 | Static | 3D skeleton | 98.9%. |
[ |
2020 | Ishara-Lipi | Mobile Net V1 | Isolated | Two hands shape | 95.71% |
[ |
2020 | Collected | DCNN | Static | Hand shape | 94.31%. |
[ |
2020 | Collected | Single layer Convolutional Self-Organizing Map (CSOM) | Isolated | Hand shape | 89.59% |
[ |
2020 | KSU-SSL | Enhanced C3D architecture | Isolated | Spatiotemporal of hand and body | 87.69 % |
[ |
2020 | KSU-SSL | 3DCNN | Isolated | Spatiotemporal | 84.3% |
[ |
2020 | Collected | ResNet50 model | Isolated | Hand shape, Extra Spatial hand Relation (ESHR) features, and Hand Pose (HP), temporal. | 98.42% |
[ |
2020 | Polytropon (PGSL) | ResNet-18 | Isolated | Optical flow of skeletal, handshapes, and mouthing | 95.31% |
[ |
2020 | Collected | Discrete cosines transform, Zernike moment, scale-invariant feature transform, and social ski driver optimization algorithm | Static | Hand shape | 98.89% |
[ |
2020 | RWTH-2014 | Temporal convolution unit and dynamic hierarchical bidirectional GRU unit | Continues | spatiotemporal | 10.73% BLEU |
[ |
2020 | Collected | Standard score’ normalization on the raw Channel State Information (CSI) acquired from the Wi-Fi device, and MIFS algorithm | Static, and continues | The cross-cumulant features (unbiased estimates of covariance, normalized skewness, normalized kurtosis) | 99.9% |
[ |
2020 | GSL | Open Pose human joint detector | Isolated | 3D hand skeletal, and region of hand, and mouth | 93.56% |
[ |
2020 | Collected | Four channel surface electromyography (sEMG) signals | Isolated | time-frequency joint features | 93.33% |
[ |
2020 | Collected | Euler angle, Quaternion from IMU signal | Continues | Hand Rotation | 10.8% WER |
[ |
2020 | RKS-PERSIANSIGN | 3DCNNs | Isolated | Spatiotemporal | 99.8% |
[ |
2020 | ASL fingerspelling A | DCNN | Static | Hand Shape | 99.96% |
[ |
2020 | Collected | Construct a color-coded topographical descriptor from joint distances and angles, to be used in 2 streams (CNN) | Isolated | distance and angular | 93.01% |
[ |
2020 | Collected | Two CNN models and a descriptor based on Histogram of cumulative magnitudes | Isolated | Two hands, skeleton, and body | 64.33% |
[ |
2021 | RWTH-2014T | Semantic Focus of Interest Network with Face Highlight Module (SFoI-Net-FHM) | Isolated | Body and facial expression | 10.89 |
[ |
2021 | Collected | (ConvLSTM) | Isolated | Spatiotemporal | 94.5% |
[ |
2021 | Collected | ResNet50 | Static | hand area, the length of axis of first eigenvector, and hand position changes. | 96.42%. |
[ |
2021 | Collected | f-CNN (fusion of 1-D CNN and 2-D CNN | Isolated | Time and spatial-domain features of finger resistance movement | 77.42% |
[ |
2021 | MU | Modified Alex Net and VGG16 | Static | Hand edges and shape | 99.82% |
[ |
2021 | Collected | VGG net of six convolutional layers | Static | Hand shape | 97.62% |
[ |
2021 | 38 BdSL | DenseNet201, and Linear Discriminant Analysis | Static | Hand shape | 93.68% |
[ |
2021 | KSU-ArSL | Bi-LSTM | Isolated | spatiotemporal | 84.2% |
[ |
2021 | Collected | Paired pooling network in view pair pooling net (VPPN) | Isolated | spatiotemporal | 84.95% |
[ |
2021 | ASLLVD | Bayesian Parallel Hidden Markov Model (BPaHMM) + stacked denoising variational autoencoders (SD-VAE) + PCA | Continues | Shape of hand, palm, and face, along with their position, speed, and distance between them | 97% |
[ |
2021 | ASLLVD | 3-D CNN’s cascaded | Isolated | spatiotemporal | 96.0% |
[ |
2021 | Collected | leap motion controller | Static, and Isolated | sphere radius, angles between fingers their distance | 91.82% |
[ |
2021 | RWTH-2014 | (3 C 2 C 1) D ResNet | Continues | height, motion of hand, and frame blurriness levels | 23.30 |
[ |
2021 | Montalbano II | AlexNet + Optical Flow (OF) + Scene Flow (SF) methods | Isolated | Pixel level, and hand pose | 99.08% |
[ |
2021 | RWTH-2014 | GAN | Continues | spatiotemporal | 23.4 |
[ |
2021 | MNIST | DCNN | Static | Hand shape | 98.58% |
[ |
2021 | Collected | R-CNN | Static | Hand shape | 93% |
[ |
2021 | CSL-500 | Multi-scale spatiotemporal attention network (MSSTA) | Isolated | Spatiotemporal | 98.08% |
[ |
2022 | MNIST | modified CapsNet | Static | Spatial, and orientations | 99.60% |
[ |
2022 | RKS-PERSIANSIGN | Singular value decomposition SVD | Isolated | 3D hand key points between the segments of each finger, and their angles. | 99.5% |
[ |
2022 | Collected | 2DCRNN + 3DCRNN | Continues | Spatiotemporal out of small patches | 99% |
[ |
2022 | Collected | Atrous convolution mechanism, and semantic spatial multi-cue model | Static Isolated | pose, face, and hand, and Spatial, full frame, | 99.85% |
[ |
2022 | Collected | 4 DNN models using 2D and 3D CNN | Isolated | Spatiotemporal | 99% |
[ |
2022 | Collected | Scale-Invariant Feature Transformation (SIFT) | Static | Corner, edges, rotation, blurring, and illumination. | 97.89% |
[ |
2022 | Collected | InceptionResNetV2 | Isolated | Hand shape | 97% |
[ |
2022 | Collected | Alex net | Static | Hand shape | 94.81% |
[ |
2022 | Collected | Sensor + mathematical equations+ CNN | Continues | Mean, Magnitude of Mean, Variance, correlation, Covariance, and frequency domain features+ spatiotemporal | 0.088 |
[ |
2022 | Collected | Media Pipe framework | Isolated | hands, body, and face | 99% |
[ |
2022 | Collected | Bi-RNN network, maximal information correlation, and leap motion controller | Isolated | hand shape, orientation, position, and motion of 3D skeletal videos. | 97.98% |
[ |
2022 | LSA64 | dynamic motion network (DMN)+ Accumulative motion network (AMN) | Isolated | spatiotemporal | 91.8% |
[ |
2022 | CSL-500 | Spatial–temporal–channel attention (STCA) is proposed | isolated | spatiotemporal | 97.45% |
[ |
2022 | Collected | SURF (Speeded Up Robust Features) | Isolated | distribution of the intensity material within the neighborhood of the interest point | 99% |
[ |
2022 | Collected | Thresholding and Fast Fisher Vector Encoding (FFV) | Isolated | Hand, palm, finger shape, and position and 3D skeletal hand characteristics | 98.33% |
Related works on SLR using DL that address segmentation problem_
Author(s) | Year | Input Modality | Segmentation method | Results |
---|---|---|---|---|
[ |
2018 | RGB image | HSV color model | 99.85% |
[ |
2018 | RGB image | Skin segmentation algorithm based on color information | 94.7% |
[ |
2018 | RGB images | k-means-based algorithm | 94.37% |
[ |
2019 | RGB images | Color segmentation by MLP network | 96.83% |
[ |
2019 | Depth image | Wrist line localization by algorithm-based thresholding | 88.7% |
[ |
2019 | RGB, and depth video | Aligned Random Sampling in Segments (ARSS) | 96.7% |
[ |
2019 | RGB, and depth images | Depth based segmentation using data of Kinect RGB-D camera | 97.71% |
[ |
2019 | RGB video | Design an adaptive temporal encoder to capture crucial RGB visemes and skeleton signees | 94.7% |
[ |
2020 | RGB videos | Hand semantic Segmentation named as DeepLabv3+ | 89.59 % |
[ |
2020 | RGB Videos | Novel method based on open pose | 87.69 % |
[ |
2020 | RGB Videos | Viola and Jones, and human body part ratios | 84.3% |
[ |
2020 | RGB images | Robert edge detection method | 99.3 % |
[ |
2020 | RGB video | SSD is a feed-forward convolutional network A Non-Maximum Suppression (NMS) step is used in the final step to estimate the final detection | 98.42% |
[ |
2020 | RGB images | Sobel edge detector, and skin color by thresholding | 98.89% |
[ |
2020 | RGB images | Open-CV with a Region of Interest (ROI) box in the driver program | 93% |
[ |
2020 | RGB Videos | Frame stream density compression (FSDC) algorithm | 10.73 error |
[ |
2020 | RGB Videos | Design an attention-based encoder-decoder model to realize end-to-end continuous SLR without segmentation | 10.8% WER |
[ |
2020 | RGB images | Single Shot Multi Box Detection (SSD) | 99.90% |
[ |
2021 | RGB Video | Canny | 99.63% |
[ |
2021 | RGB images | Erosion, Dilation, and Watershed Segmentation | 99.7 % |
[ |
2021 | RGB Video | Data sliding window | 86.67% |
[ |
2021 | RGB images | R-CNN | 93% |
[ |
2022 | RGB videos | Novel Adaptive Hough Transform (AHT) | 98.7% |
[ |
2022 | RGB images, and video | Grad Cam and Cam shift algorithm | 99.85% |
[ |
2022 | Grey images | YCbCr, HSV and watershed algorithm | 99.60%, |
[ |
2022 | RGB images | Sobel operator method | 97 % |
[ |
2022 | RGB images | Semantic | 99.91% |
[ |
2022 | RGB images | R-CNN | 99.7% |
[ |
2022 | RGB video | Mask is created by extracting the maximum connected region in the foreground assuming it to be the hand+ Canny method | 99% |