Automatic Speech Recognition: A Comprehensive Survey

Ruby, J. (2020). Automatic Speech Recognition and Machine Learning for Robotic Arm in Surgery. American Journal of Clinical Surgery, 2(1), 10-18.Search in Google Scholar

Juang, B. H. (1991). Speech recognition in adverse environments. Computer speech & language, 5(3), 275-294.10.1016/0885-2308(91)90011-ESearch in Google Scholar

Kitchenham, B., Brereton, O. P., Budgen, D., Turner, M., Bailey, J., & Linkman, S. (2009). Systematic literature reviews in software engineering–a systematic literature review. Information and software technology, 51(1), 7-15.10.1016/j.infsof.2008.09.009Search in Google Scholar

Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M., & Khalil, M. (2007). Lessons from applying the systematic literature review process within the software engineering domain. Journal of systems and software, 80(4), 571-583.10.1016/j.jss.2006.07.009Search in Google Scholar

Shahin, M., Ahmed, B., McKechnie, J., Ballard, K., & Gutierrez-Osuna, R. (2014). A comparison of GMM-HMM and DNN-HMM based pronunciation verification techniques for use in the assessment of childhood apraxia of speech. In Fifteenth Annual Conference of the International Speech Communication Association.10.21437/Interspeech.2014-377Search in Google Scholar

Tachbelie, M. Y., Abulimiti, A., Abate, S. T., & Schultz, T. (2020, May). Dnn-based speech recognition for globalphone languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8269-8273). IEEE.10.1109/ICASSP40776.2020.9053144Search in Google Scholar

Kadyan, V., & Kaur, M. (2020). SGMM-Based Modeling Classifier for Punjabi Automatic Speech Recognition System. In Smart Computing Paradigms: New Progresses and Challenges (pp. 149-155). Springer, Singapore.Search in Google Scholar

Wu, Z., & Cao, Z. (2005). Improved MFCC-based feature for robust speaker identification. Tsinghua Science & Technology, 10(2), 158-161.10.1016/S1007-0214(05)70048-1Search in Google Scholar

Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1), 30-42.10.1109/TASL.2011.2134090Search in Google Scholar

Tsunoo, E., Kashiwagi, Y., Asakawa, S., & Kumakura, T. (2019). End-to-end adaptation with backpropagation through WFST for on-device speech recognition system. arXiv preprint arXiv:1905.07149.Search in Google Scholar

Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007, July). OpenFst: A general and efficient weighted finite-state transducer library. In International Conference on Implementation and Application of Automata (pp. 11-23). Springer, Berlin, Heidelberg.10.1007/978-3-540-76336-9_3Search in Google Scholar

Chen, Z., Jain, M., Wang, Y., Seltzer, M. L., & Fuegen, C. (2019, May). End-to-end contextual speech recognition using class language models and a token passing decoder. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6186-6190). IEEE.10.1109/ICASSP.2019.8683573Search in Google Scholar

Kneser, R., & Ney, H. (1993). Improved clustering techniques for class-based statistical language modelling. In Third European Conference on Speech Communication and Technology.10.21437/Eurospeech.1993-229Search in Google Scholar

Hall, K., Cho, E., Allauzen, C., Beaufays, F., Coccaro, N., Nakajima, K., ... & Zhang, L. (2015). Composition-based on-the-fly rescoring for salient n-gram biasing.10.21437/Interspeech.2015-340Search in Google Scholar

Wang, C., Wu, Y., Liu, S., Li, J., Lu, L., Ye, G., & Zhou, M. (2020). Low latency end-to-end streaming speech recognition with a scout network. arXiv preprint arXiv:2003.10369.Search in Google Scholar

Sun, R. H., & Chol, R. J. (2020). Subspace Gaussian mixture based language modeling for large vocabulary continuous speech recognition. Speech Communication, 117, 21-27.10.1016/j.specom.2020.01.001Search in Google Scholar

Hermansky, H., Ellis, D. P., & Sharma, S. (2000, June). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1635-1638). IEEE.Search in Google Scholar

Xu, H., Su, H., Chng, E. S., & Li, H. (2014). Semi-supervised training for bottle-neck feature based DNN-HMM hybrid systems. In Fifteenth annual conference of the international speech communication association.10.21437/Interspeech.2014-472Search in Google Scholar

Yu, D., & Seltzer, M. L. (2011). Improved bottleneck features using pretrained deep neural networks. In Twelfth annual conference of the international speech communication association.10.21437/Interspeech.2011-91Search in Google Scholar

Tanaka, T., Masumura, R., Moriya, T., Oba, T., & Aono, Y. (2019). A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Sharable Knowledge. In INTERSPEECH (pp. 2210-2214).10.21437/Interspeech.2019-2263Search in Google Scholar

Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.Search in Google Scholar

Hori, T., Hori, C., Minami, Y., & Nakamura, A. (2007). Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Transactions on audio, speech, and language processing, 15(4), 1352-1365.10.1109/TASL.2006.889790Search in Google Scholar

Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., ... & Bacchiani, M. (2018, April). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4774-4778). IEEE.10.1109/ICASSP.2018.8462105Search in Google Scholar

Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). IEEE.10.1109/ICASSP.2016.7472621Search in Google Scholar

Schuster, M., & Nakajima, K. (2012, March). Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5149-5152). IEEE.10.1109/ICASSP.2012.6289079Search in Google Scholar

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Dean, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.Search in Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.Search in Google Scholar

Nguyen, T. S., Stueker, S., Niehues, J., & Waibel, A. (2020, May). Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7689-7693). IEEE.10.1109/ICASSP40776.2020.9054130Search in Google Scholar

Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.Search in Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.Search in Google Scholar

Weng, C., Cui, J., Wang, G., Wang, J., Yu, C., Su, D., & Yu, D. (2018, September). Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition. In Interspeech (pp. 761-765).10.21437/Interspeech.2018-1030Search in Google Scholar

Pham, N. Q., Nguyen, T. S., Niehues, J., Müller, M., Stüker, S., & Waibel, A. (2019). Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377.Search in Google Scholar

Xu, H., Ding, S., & Watanabe, S. (2019, May). Improving end-to-end speech recognition with pronunciation-assisted sub-word modeling. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7110-7114). IEEE.10.1109/ICASSP.2019.8682494Search in Google Scholar

Sperber, M., Niehues, J., Neubig, G., Stüker, S., & Waibel, A. (2018). Self-attentional acoustic models. arXiv preprint arXiv:1803.09519.Search in Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.Search in Google Scholar

Liu, A. H., Sung, T. W., Chuang, S. P., Lee, H. Y., & Lee, L. S. (2020, May). Sequence-to-Sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7879-7883). IEEE.10.1109/ICASSP40776.2020.9053324Search in Google Scholar

Unanue, I. J., Borzeshi, E. Z., Esmaili, N., & Piccardi, M. (2019). ReWE: Regressing word embeddings for regularization of neural machine translation systems. arXiv preprint arXiv:1904.02461.Search in Google Scholar

Toshniwal, S., Kannan, A., Chiu, C. C., Wu, Y., Sainath, T. N., & Livescu, K. (2018, December). A comparison of techniques for language model integration in encoder-decoder speech recognition. In 2018 IEEE spoken language technology workshop (SLT) (pp. 369-375). IEEE.10.1109/SLT.2018.8639038Search in Google Scholar

Caruana, R. (1997). Multitask learning. Machine learning, 28(1), 41-75.10.1023/A:1007379606734Search in Google Scholar

Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10), 1345-1359.10.1109/TKDE.2009.191Search in Google Scholar

Inaguma, H., Cho, J., Baskar, M. K., Kawahara, T., & Watanabe, S. (2019, May). Transfer learning of language-independent end-to-end ASR with language model fusion. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6096-6100). IEEE.10.1109/ICASSP.2019.8682918Search in Google Scholar

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.Search in Google Scholar

Zeyer, A., Irie, K., Schlüter, R., & Ney, H. (2018). Improved training of end-to-end attention models for speech recognition. arXiv preprint arXiv:1805.03294.Search in Google Scholar

Sriram, A., Jun, H., Satheesh, S., & Coates, A. (2017). Cold fusion: Training seq2seq models together with language models. arXiv preprint arXiv:1708.06426.Search in Google Scholar

Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H. C., ... & Bengio, Y. (2015). On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.Search in Google Scholar

Denisov, P., & Vu, N. T. (2019). End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning. arXiv preprint arXiv:1908.04737.Search in Google Scholar

Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240-1253.10.1109/JSTSP.2017.2763455Search in Google Scholar

Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv preprint arXiv:1806.04558.Search in Google Scholar

Seki, H., Hori, T., Watanabe, S., Roux, J. L., & Hershey, J. R. (2018). A purely end-to-end system for multi-speaker speech recognition. arXiv preprint arXiv:1805.05826.Search in Google Scholar

Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M. A., Devin, M., & Dean, J. (2013, May). Multilingual acoustic models using distributed deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8619-8623). IEEE.10.1109/ICASSP.2013.6639348Search in Google Scholar

Kubo, Y., & Bacchiani, M. (2020, May). Joint phoneme-grapheme model for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6119-6123). IEEE.10.1109/ICASSP40776.2020.9054557Search in Google Scholar

Lee, J., Mansimov, E., & Cho, K. (2018). Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.Search in Google Scholar

Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T. Y. (2017, December). Deliberation networks: Sequence generation beyond one-pass decoding. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 1782-1792).Search in Google Scholar

Wang, D., & Zheng, T. F. (2015, December). Transfer learning for speech and language processing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (pp. 1225-1237). IEEE.10.1109/APSIPA.2015.7415532Search in Google Scholar

Yu, C., Chen, Y., Li, Y., Kang, M., Xu, S., & Liu, X. (2019). Cross-language end-to-end speech recognition research based on transfer learning for the low-resource Tujia language. Symmetry, 11(2), 179.10.3390/sym11020179Search in Google Scholar

Ozeki, M., & Okatani, T. (2014, November). Understanding convolutional neural networks in terms of category-level attributes. In Asian Conference on Computer Vision (pp. 362-375). Springer, Cham.10.1007/978-3-319-16808-1_25Search in Google Scholar

Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-end speech recognition from the raw waveform. arXiv preprint arXiv:1806.07098.Search in Google Scholar

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173-182). PMLR.Search in Google Scholar

Abad, A., Bell, P., Carmantini, A., & Renais, S. (2020, May). Cross lingual transfer learning for zero-resource domain adaptation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6909-6913). IEEE.10.1109/ICASSP40776.2020.9054468Search in Google Scholar

Hsu, J. Y., Chen, Y. J., & Lee, H. Y. (2020, May). Meta learning for end-to-end low-resource speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7844-7848). IEEE.10.1109/ICASSP40776.2020.9053112Search in Google Scholar

Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., & Hadsell, R. (2018). Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960.Search in Google Scholar

Finn, C., Abbeel, P., & Levine, S. (2017, July). Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (pp. 1126-1135). PMLR.Search in Google Scholar

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).10.1145/1143844.1143891Search in Google Scholar

Matsuura, K., Mimura, M., Sakai, S., & Kawahara, T. (2020). Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition. arXiv preprint arXiv:2005.09256.Search in Google Scholar

Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.Search in Google Scholar

Zhang, S., Do, C. T., Doddipatla, R., & Renals, S. (2020, May). Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7024-7028). IEEE.10.1109/ICASSP40776.2020.9053169Search in Google Scholar

Chen, Z., & Yang, H. (2020, June). Yi Language Speech Recognition using Deep Learning Methods. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (Vol. 1, pp. 1064-1068). IEEE.10.1109/ITNEC48623.2020.9084771Search in Google Scholar

Rashmi, S., Hanumanthappa, M., & Reddy, M. V. (2018). Hidden Markov Model for speech recognition system—a pilot study and a naive approach for speech-to-text model. In Speech and Language Processing for Human-Machine Communications (pp. 77-90). Springer, Singapore.10.1007/978-981-10-6626-9_9Search in Google Scholar

Karáfidt, M., Baskar, M. K., Veselý, K., Grézl, F., Burget, L., & Černocký, J. (2018, April). Analysis of multilingual blstm acoustic model on low and high resource languages. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5789-5793). IEEE.10.1109/ICASSP.2018.8462083Search in Google Scholar

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3), 328-339.10.1109/29.21701Search in Google Scholar

Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4835-4839). IEEE.10.1109/ICASSP.2017.7953075Search in Google Scholar

Aung, M. A. A., & Pa, W. P. (2020, February). Time Delay Neural Network for Myanmar Automatic Speech Recognition. In 2020 IEEE Conference on Computer Applications (ICCA) (pp. 1-4). IEEE.10.1109/ICCA49400.2020.9022808Search in Google Scholar

Ragni, A., Knill, K. M., Rath, S. P., & Gales, M. J. (2014, September). Data augmentation for low resource languages. In INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association (pp. 810-814). International Speech Communication Association (ISCA).10.21437/Interspeech.2014-207Search in Google Scholar

Gokay, R., & Yalcin, H. (2019, March). Improving low Resource Turkish speech recognition with Data Augmentation and TTS. In 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD) (pp. 357-360). IEEE.10.1109/SSD.2019.8893184Search in Google Scholar

Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.10.21437/Interspeech.2015-711Search in Google Scholar

Tachibana, H., Uenoyama, K., & Aihara, S. (2018, April). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4784-4788). IEEE.10.1109/ICASSP.2018.8461829Search in Google Scholar

Availablehttps://github.com/SeanNaren/deepspeech.pytorch accesed 25.05. 2020.Search in Google Scholar

Heafield, K., Pouzyrevsky, I., Clark, J. H., & Koehn, P. (2013, August). Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 690-696).Search in Google Scholar

Thai, B., Jimerson, R., Ptucha, R., & Prud’hommeaux, E. (2020, May). Fully Convolutional ASR for Less-Resourced Endangered Languages. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) (pp. 126-130).Search in Google Scholar

Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., & Waibel, A. (2016, March). An empirical exploration of CTC acoustic models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2623-2627). IEEE.10.1109/ICASSP.2016.7472152Search in Google Scholar

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.Search in Google Scholar

eISSN:: 1857-8462
Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 2 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Allgemein

Zeitschrift RSS Feed

Automatic Speech Recognition: A Comprehensive Survey

Online veröffentlicht: 06. Apr. 2021

Seitenbereich: 86 - 112

DOI: https://doi.org/10.2478/seeur-2020-0019

SchlüsselwörterAutomatic Speech Recognition (ASR), End-to-End Systems, Hybrid Systems, Low Resource Language

© 2020 Amarildo Rista et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Schlüsselwörter
Automatic Speech Recognition (ASR), End-to-End Systems, Hybrid Systems, Low Resource Language