Enhanced lstm network with semi-supervised learning and data augmentation for low-resource ASR

[1] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020. BaevskiAlexei ZhouYuhao MohamedAbdelrahman AuliMichael wav2vec 2.0: A framework for self-supervised learning of speech representations Advances in neural information processing systems 33 12449 12460 2020 Search in Google Scholar

[2] Anton Ragni, Kate M Knill, Shakti P Rath, and Mark JF Gales. Data augmentation for low resource languages. In INTERSPEECH 2014: 15th annual conference of the international speech communication association, pages 810–814. International Speech Communication Association (ISCA), 2014. RagniAnton KnillKate M RathShakti P GalesMark JF Data augmentation for low resource languages In INTERSPEECH 2014: 15th annual conference of the international speech communication association 810 814 International Speech Communication Association (ISCA) 2014 Search in Google Scholar

[3] Shiyu Zhou, Shuang Xu, and Bo Xu. Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv preprint arXiv:1806.05059, 2018. ZhouShiyu XuShuang XuBo Multilingual end-to-end speech recognition with a single transformer on low-resource languages arXiv preprint arXiv:1806.05059 2018 Search in Google Scholar

[4] Cheng Yi, Jianzhong Wang, Ning Cheng, Shiyu Zhou, and Bo Xu. Applying wav2vec2. 0 to speech recognition in various low-resource languages. arXiv preprint arXiv:2012.12121, 2020. YiCheng WangJianzhong ChengNing ZhouShiyu XuBo Applying wav2vec2. 0 to speech recognition in various low-resource languages arXiv preprint arXiv:2012.12121 2020 Search in Google Scholar

[5] Satwinder Singh, Ruili Wang, and Feng Hou. Improved meta learning for low resource speech recognition. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4798–4802. IEEE, 2022. SinghSatwinder WangRuili HouFeng Improved meta learning for low resource speech recognition In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4798 4802 IEEE 2022 Search in Google Scholar

[6] Ankit Kumar and Rajesh Kumar Aggarwal. A hybrid cnn-ligru acoustic modeling using raw waveform sincnet for hindi asr. Computer Science, 21(4), 2020. KumarAnkit AggarwalRajesh Kumar A hybrid cnn-ligru acoustic modeling using raw waveform sincnet for hindi asr Computer Science 21 4 2020 Search in Google Scholar

[7] A Kumar, T Choudhary, M Dua, and M Sabharwal. Hybrid end-to-end architecture for hindi speech recognition system. In Proceedings of the International Conference on Paradigms of Communication, Computing and Data Sciences: PCCDS 2021, pages 267–276. Springer, 2022. KumarA ChoudharyT DuaM SabharwalM Hybrid end-to-end architecture for hindi speech recognition system In Proceedings of the International Conference on Paradigms of Communication, Computing and Data Sciences: PCCDS 2021 267 276 Springer 2022 Search in Google Scholar

[8] Ankit Kumar and Rajesh K Aggarwal. An investigation of multilingual tdnn-blstm acoustic modeling for hindi speech recognition. International Journal of Sensors Wireless Communications and Control, 12(1):19–31, 2022. KumarAnkit AggarwalRajesh K An investigation of multilingual tdnn-blstm acoustic modeling for hindi speech recognition International Journal of Sensors Wireless Communications and Control 12 1 19 31 2022 Search in Google Scholar

[9] Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. Speech recognition using deep neural networks: A systematic review. IEEE access, 7:19143–19165, 2019. NassifAli Bou ShahinIsmail AttiliImtinan AzzehMohammad ShaalanKhaled Speech recognition using deep neural networks: A systematic review IEEE access 7 19143 19165 2019 Search in Google Scholar

[10] Martijn Bartelds, Nay San, Bradley McDonnell, Dan Jurafsky, and Martijn Wieling. Making more of little data: Improving low-resource automatic speech recognition using data augmentation. arXiv preprint arXiv:2305.10951, 2023. BarteldsMartijn SanNay McDonnellBradley JurafskyDan WielingMartijn Making more of little data: Improving low-resource automatic speech recognition using data augmentation arXiv preprint arXiv:2305.10951 2023 Search in Google Scholar

[11] Ankit Kumar and Rajesh Kumar Aggarwal. An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. Journal of Reliable Intelligent Environments, 8(2):117–132, 2022. KumarAnkit AggarwalRajesh Kumar An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition Journal of Reliable Intelligent Environments 8 2 117 132 2022 Search in Google Scholar

[12] Jacob Kahn, Ann Lee, and Awni Hannun. Self-training for end-to-end speech recognition. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7084–7088. IEEE, 2020. KahnJacob LeeAnn HannunAwni Self-training for end-to-end speech recognition In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 7084 7088 IEEE 2020 Search in Google Scholar

[13] Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, and Takaaki Hori. Momentum pseudo-labeling for semi-supervised speech recognition. arXiv preprint arXiv:2106.08922, 2021. HiguchiYosuke MoritzNiko RouxJonathan Le HoriTakaaki Momentum pseudo-labeling for semi-supervised speech recognition arXiv preprint arXiv:2106.08922 2021 Search in Google Scholar

[14] Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, 2020. ParkDaniel S ZhangYu JiaYe HanWei ChiuChung-Cheng LiBo WuYonghui LeQuoc V Improved noisy student training for automatic speech recognition arXiv preprint arXiv:2005.09629 2020 Search in Google Scholar

[15] Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, and Yonghong Yan. Alternative pseudo-labeling for semi-supervised automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023. ZhuHan GaoDongji ChengGaofeng PoveyDaniel ZhangPengyuan YanYonghong Alternative pseudo-labeling for semi-supervised automatic speech recognition IEEE/ACM Transactions on Audio, Speech, and Language Processing 2023 Search in Google Scholar

[16] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020. BaevskiAlexei ZhouYuhao MohamedAbdelrahman AuliMichael wav2vec 2.0: A framework for self-supervised learning of speech representations Advances in neural information processing systems 33 12449 12460 2020 Search in Google Scholar

[17] Julia Mainzinger. Fine-tuning asr models for very low-resource languages: A study on mvskoke. Master's thesis, University of Washington, 2024. MainzingerJulia Fine-tuning asr models for very low-resource languages: A study on mvskoke Master's thesis, University of Washington 2024 Search in Google Scholar

[18] Robert Jimerson, Zoey Liu, and Emily Prud'Hommeaux. An (unhelpful) guide to selecting the best asr architecture for your under-resourced language. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1008–1016, 2023. JimersonRobert LiuZoey Prud'HommeauxEmily An (unhelpful) guide to selecting the best asr architecture for your under-resourced language In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 1008 1016 2023 Search in Google Scholar

[19] Shiyue Zhang, Ben Frey, and Mohit Bansal. How can nlp help revitalize endangered languages? a case study and roadmap for the cherokee language. arXiv preprint arXiv:2204.11909, 2022. ZhangShiyue FreyBen BansalMohit How can nlp help revitalize endangered languages? a case study and roadmap for the cherokee language arXiv preprint arXiv:2204.11909 2022 Search in Google Scholar

[20] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52, 2024. PratapVineel TjandraAndros ShiBowen TomaselloPaden BabuArun KunduSayani ElkahkyAli NiZhaoheng VyasApoorv Fazel-ZarandiMaryam Scaling speech technology to 1,000+ languages Journal of Machine Learning Research 25 97 1 52 2024 Search in Google Scholar

[21] Marieke Meelen, Alexander O'neill, and Rolando Coto-Solano. End-to-end speech recognition for endangered languages of nepal. In Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 83–93, 2024. MeelenMarieke O'neillAlexander Coto-SolanoRolando End-to-end speech recognition for endangered languages of nepal In Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages 83 93 2024 Search in Google Scholar

[22] Panji Arisaputra, Alif Tri Handoyo, and Amalia Zahra. Xls-r deep learning model for multilingual asr on low-resource languages: Indonesian, javanese, and sundanese. arXiv preprint arXiv:2401.06832, 2024. ArisaputraPanji HandoyoAlif Tri ZahraAmalia Xls-r deep learning model for multilingual asr on low-resource languages: Indonesian, javanese, and sundanese arXiv preprint arXiv:2401.06832 2024 Search in Google Scholar

[23] Siqing Qin, Longbiao Wang, Sheng Li, Jianwu Dang, and Lixin Pan. Improving low-resource tibetan end-to-end asr by multilingual and multilevel unit modeling. EURASIP Journal on Audio, Speech, and Music Processing, 2022(1):2, 2022. QinSiqing WangLongbiao LiSheng DangJianwu PanLixin Improving low-resource tibetan end-to-end asr by multilingual and multilevel unit modeling EURASIP Journal on Audio, Speech, and Music Processing 2022 1 2 2022 Search in Google Scholar

[24] Kaushal Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M Khapra. Effectiveness of mining audio and text pairs from public data for improving asr systems for low-resource languages. In Icassp 2023–2023 ieee international conference on acoustics, speech and signal processing (icassp), pages 1–5. IEEE, 2023. BhogaleKaushal RamanAbhigyan JavedTahir DoddapaneniSumanth KunchukuttanAnoop KumarPratyush KhapraMitesh M Effectiveness of mining audio and text pairs from public data for improving asr systems for low-resource languages In Icassp 2023–2023 ieee international conference on acoustics, speech and signal processing (icassp) 1 5 IEEE 2023 Search in Google Scholar

[25] Zoey Liu, Justin Spence, and Emily Prud'Hommeaux. Studying the impact of language model size for low-resource asr. In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 77–83, 2023. LiuZoey SpenceJustin Prud'HommeauxEmily Studying the impact of language model size for low-resource asr In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages 77 83 2023 Search in Google Scholar

[26] Gueorgui Pironkov, Sean UN Wood, and Stéphane Dupont. Hybrid-task learning for robust automatic speech recognition. Computer Speech & Language, 64:101103, 2020. PironkovGueorgui WoodSean UN DupontStéphane Hybrid-task learning for robust automatic speech recognition Computer Speech & Language 64 101103 2020 Search in Google Scholar

[27] Mohamed Tamazin, Ahmed Gouda, and Mohamed Khedr. Enhanced automatic speech recognition system based on enhancing power-normalized cepstral coefficients. Applied Sciences, 9(10):2166, 2019. TamazinMohamed GoudaAhmed KhedrMohamed Enhanced automatic speech recognition system based on enhancing power-normalized cepstral coefficients Applied Sciences 9 10 2166 2019 Search in Google Scholar

[28] Syed Shahnawazuddin, KT Deepak, Gayadhar Pradhan, and Rohit Sinha. Enhancing noise and pitch robustness of children's asr. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5225–5229. IEEE, 2017. ShahnawazuddinSyed DeepakKT PradhanGayadhar SinhaRohit Enhancing noise and pitch robustness of children's asr In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5225 5229 IEEE 2017 Search in Google Scholar

[29] Jiri Malek, Jindrich Zdansky, and Petr Cerva. Robust automatic recognition of speech with background music. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5210–5214. IEEE, 2017. MalekJiri ZdanskyJindrich CervaPetr Robust automatic recognition of speech with background music In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5210 5214 IEEE 2017 Search in Google Scholar

[30] Sheng-Chieh Lee, Jhing-Fa Wang, and Miao-Hia Chen. Threshold-based noise detection and reduction for automatic speech recognition system in human-robot interactions. Sensors, 18(7):2068, 2018. LeeSheng-Chieh WangJhing-Fa ChenMiao-Hia Threshold-based noise detection and reduction for automatic speech recognition system in human-robot interactions Sensors 18 7 2068 2018 Search in Google Scholar

[31] Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, 2020. ParkDaniel S ZhangYu JiaYe HanWei ChiuChung-Cheng LiBo WuYonghui LeQuoc V Improved noisy student training for automatic speech recognition arXiv preprint arXiv:2005.09629 2020 Search in Google Scholar

[32] Satyender Jaglan, Sanjeev Kumar Dhull, and Krishna Kant Singh. Tertiary wavelet model based automatic epilepsy classification system. International Journal of Intelligent Unmanned Systems, 11(1):166–181, 2023. JaglanSatyender DhullSanjeev Kumar SinghKrishna Kant Tertiary wavelet model based automatic epilepsy classification system International Journal of Intelligent Unmanned Systems 11 1 166 181 2023 Search in Google Scholar

[33] Yuzong Liu and Katrin Kirchhoff. Graph-based semisupervised learning for acoustic modeling in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11):1946–1956, 2016. LiuYuzong KirchhoffKatrin Graph-based semisupervised learning for acoustic modeling in automatic speech recognition IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 11 1946 1956 2016 Search in Google Scholar

[34] Michael I Mandel and Jon Barker. Multichannel spatial clustering for robust far-field automatic speech recognition in mismatched conditions. In INTERSPEECH, pages 1991–1995, 2016. MandelMichael I BarkerJon Multichannel spatial clustering for robust far-field automatic speech recognition in mismatched conditions In INTERSPEECH 1991 1995 2016 Search in Google Scholar

[35] Naoki Hirayama, Koichiro Yoshino, Katsutoshi Itoyama, Shinsuke Mori, and Hiroshi G Okuno. Automatic speech recognition for mixed dialect utterances by mixing dialect language models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(2):373–382, 2015. HirayamaNaoki YoshinoKoichiro ItoyamaKatsutoshi MoriShinsuke OkunoHiroshi G Automatic speech recognition for mixed dialect utterances by mixing dialect language models IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 2 373 382 2015 Search in Google Scholar

[36] Delu Zeng, Minyu Liao, Mohammad Tavakolian, Yulan Guo, Bolei Zhou, Dewen Hu, Matti Pietikäinen, and Li Liu. Deep learning for scene classification: A survey. arXiv preprint arXiv:2101.10531, 2021. ZengDelu LiaoMinyu TavakolianMohammad GuoYulan ZhouBolei HuDewen PietikäinenMatti LiuLi Deep learning for scene classification: A survey arXiv preprint arXiv:2101.10531 2021 Search in Google Scholar

[37] Harveen Singh Chadha, Anirudh Gupta, Priyanshi Shah, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur, and Vivek Raghavan. Vakyansh: Asr toolkit for low resource indic languages. arXiv preprint arXiv:2203.16512, 2022. ChadhaHarveen Singh GuptaAnirudh ShahPriyanshi ChhimwalNeeraj DhuriyaAnkur GaurRishabh RaghavanVivek Vakyansh: Asr toolkit for low resource indic languages arXiv preprint arXiv:2203.16512 2022 Search in Google Scholar

[38] Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020. KimJaehyeon KimSungwon KongJungil YoonSungroh Glow-tts: A generative flow for text-to-speech via monotonic alignment search Advances in Neural Information Processing Systems 33 8067 8077 2020 Search in Google Scholar

[39] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020. KongJungil KimJaehyeon BaeJaekyoung Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis Advances in neural information processing systems 33 17022 17033 2020 Search in Google Scholar

[40] Lori Lamel, Jean-Luc Gauvain, and Gilles Adda. Lightly supervised and unsupervised acoustic model training. Computer Speech & Language, 16(1):115–129, 2002. LamelLori GauvainJean-Luc AddaGilles Lightly supervised and unsupervised acoustic model training Computer Speech & Language 16 1 115 129 2002 Search in Google Scholar

[41] Ho Yin Chan and Phil Woodland. Improving broadcast news transcription by lightly supervised discriminative training. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–737. IEEE, 2004. ChanHo Yin WoodlandPhil Improving broadcast news transcription by lightly supervised discriminative training In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing 1 I 737 IEEE 2004 Search in Google Scholar

[42] Vimal Manohar, Hossein Hadian, Daniel Povey, and Sanjeev Khudanpur. Semi-supervised training of acoustic models using lattice-free mmi. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4844–4848. IEEE, 2018. ManoharVimal HadianHossein PoveyDaniel KhudanpurSanjeev Semi-supervised training of acoustic models using lattice-free mmi In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) 4844 4848 IEEE 2018 Search in Google Scholar

[43] Thiago Fraga-Silva, Jean-Luc Gauvain, and Lori Lamel. Lattice-based unsupervised acoustic model training. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4656–4659. IEEE, 2011. Fraga-SilvaThiago GauvainJean-Luc LamelLori Lattice-based unsupervised acoustic model training In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4656 4659 IEEE 2011 Search in Google Scholar

[44] Vaswani, A. Attention is all you need, Advances in Neural Information Processing Systems, 2017. VaswaniA Attention is all you need, Advances in Neural Information Processing Systems 2017 Search in Google Scholar

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Engineering, Introductions and Overviews, Engineering, other

Journal RSS Feed

Enhanced lstm network with semi-supervised learning and data augmentation for low-resource ASR

Tripti Choudhary

Vishal Goyal

Atul Bansal

Article Category: Research Article

Published Online: Mar 04, 2025

Received: Nov 20, 2024

DOI: https://doi.org/10.2478/ijssis-2025-0009

KeywordsAutomatic Speech Recognition, Data Augmentation, Semi-supervised learning, Low-resource ASR

© 2025 Tripti Choudhary et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Automatic Speech Recognition, Data Augmentation, Semi-supervised learning, Low-resource ASR