[Yu, D., & Deng, L. (2016). AUTOMATIC SPEECH RECOGNITION. Springer london limited.]Search in Google Scholar
[Ruby, J. (2020). Automatic Speech Recognition and Machine Learning for Robotic Arm in Surgery. American Journal of Clinical Surgery, 2(1), 10-18.]Search in Google Scholar
[Juang, B. H. (1991). Speech recognition in adverse environments. Computer speech & language, 5(3), 275-294.10.1016/0885-2308(91)90011-E]Search in Google Scholar
[Kitchenham, B., Brereton, O. P., Budgen, D., Turner, M., Bailey, J., & Linkman, S. (2009). Systematic literature reviews in software engineering–a systematic literature review. Information and software technology, 51(1), 7-15.10.1016/j.infsof.2008.09.009]Search in Google Scholar
[Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M., & Khalil, M. (2007). Lessons from applying the systematic literature review process within the software engineering domain. Journal of systems and software, 80(4), 571-583.10.1016/j.jss.2006.07.009]Search in Google Scholar
[Shahin, M., Ahmed, B., McKechnie, J., Ballard, K., & Gutierrez-Osuna, R. (2014). A comparison of GMM-HMM and DNN-HMM based pronunciation verification techniques for use in the assessment of childhood apraxia of speech. In Fifteenth Annual Conference of the International Speech Communication Association.10.21437/Interspeech.2014-377]Search in Google Scholar
[Tachbelie, M. Y., Abulimiti, A., Abate, S. T., & Schultz, T. (2020, May). Dnn-based speech recognition for globalphone languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8269-8273). IEEE.10.1109/ICASSP40776.2020.9053144]Search in Google Scholar
[Kadyan, V., & Kaur, M. (2020). SGMM-Based Modeling Classifier for Punjabi Automatic Speech Recognition System. In Smart Computing Paradigms: New Progresses and Challenges (pp. 149-155). Springer, Singapore.]Search in Google Scholar
[Wu, Z., & Cao, Z. (2005). Improved MFCC-based feature for robust speaker identification. Tsinghua Science & Technology, 10(2), 158-161.10.1016/S1007-0214(05)70048-1]Search in Google Scholar
[Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1), 30-42.10.1109/TASL.2011.2134090]Search in Google Scholar
[Tsunoo, E., Kashiwagi, Y., Asakawa, S., & Kumakura, T. (2019). End-to-end adaptation with backpropagation through WFST for on-device speech recognition system. arXiv preprint arXiv:1905.07149.]Search in Google Scholar
[Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007, July). OpenFst: A general and efficient weighted finite-state transducer library. In International Conference on Implementation and Application of Automata (pp. 11-23). Springer, Berlin, Heidelberg.10.1007/978-3-540-76336-9_3]Search in Google Scholar
[Chen, Z., Jain, M., Wang, Y., Seltzer, M. L., & Fuegen, C. (2019, May). End-to-end contextual speech recognition using class language models and a token passing decoder. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6186-6190). IEEE.10.1109/ICASSP.2019.8683573]Search in Google Scholar
[Kneser, R., & Ney, H. (1993). Improved clustering techniques for class-based statistical language modelling. In Third European Conference on Speech Communication and Technology.10.21437/Eurospeech.1993-229]Search in Google Scholar
[Hall, K., Cho, E., Allauzen, C., Beaufays, F., Coccaro, N., Nakajima, K., ... & Zhang, L. (2015). Composition-based on-the-fly rescoring for salient n-gram biasing.10.21437/Interspeech.2015-340]Search in Google Scholar
[Wang, C., Wu, Y., Liu, S., Li, J., Lu, L., Ye, G., & Zhou, M. (2020). Low latency end-to-end streaming speech recognition with a scout network. arXiv preprint arXiv:2003.10369.]Search in Google Scholar
[Sun, R. H., & Chol, R. J. (2020). Subspace Gaussian mixture based language modeling for large vocabulary continuous speech recognition. Speech Communication, 117, 21-27.10.1016/j.specom.2020.01.001]Search in Google Scholar
[Hermansky, H., Ellis, D. P., & Sharma, S. (2000, June). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1635-1638). IEEE.]Search in Google Scholar
[Xu, H., Su, H., Chng, E. S., & Li, H. (2014). Semi-supervised training for bottle-neck feature based DNN-HMM hybrid systems. In Fifteenth annual conference of the international speech communication association.10.21437/Interspeech.2014-472]Search in Google Scholar
[Yu, D., & Seltzer, M. L. (2011). Improved bottleneck features using pretrained deep neural networks. In Twelfth annual conference of the international speech communication association.10.21437/Interspeech.2011-91]Search in Google Scholar
[Tanaka, T., Masumura, R., Moriya, T., Oba, T., & Aono, Y. (2019). A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Sharable Knowledge. In INTERSPEECH (pp. 2210-2214).10.21437/Interspeech.2019-2263]Search in Google Scholar
[Yu, D., & Seltzer, M. L. (2011). Improved bottleneck features using pretrained deep neural networks. In Twelfth annual conference of the international speech communication association.10.21437/Interspeech.2011-91]Search in Google Scholar
[Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.]Search in Google Scholar
[Hori, T., Hori, C., Minami, Y., & Nakamura, A. (2007). Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Transactions on audio, speech, and language processing, 15(4), 1352-1365.10.1109/TASL.2006.889790]Search in Google Scholar
[Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., ... & Bacchiani, M. (2018, April). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4774-4778). IEEE.10.1109/ICASSP.2018.8462105]Search in Google Scholar
[Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). IEEE.10.1109/ICASSP.2016.7472621]Search in Google Scholar
[Schuster, M., & Nakajima, K. (2012, March). Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5149-5152). IEEE.10.1109/ICASSP.2012.6289079]Search in Google Scholar
[Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Dean, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.]Search in Google Scholar
[Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.]Search in Google Scholar
[Nguyen, T. S., Stueker, S., Niehues, J., & Waibel, A. (2020, May). Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7689-7693). IEEE.10.1109/ICASSP40776.2020.9054130]Search in Google Scholar
[Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.]Search in Google Scholar
[Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.]Search in Google Scholar
[Weng, C., Cui, J., Wang, G., Wang, J., Yu, C., Su, D., & Yu, D. (2018, September). Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition. In Interspeech (pp. 761-765).10.21437/Interspeech.2018-1030]Search in Google Scholar
[Pham, N. Q., Nguyen, T. S., Niehues, J., Müller, M., Stüker, S., & Waibel, A. (2019). Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377.]Search in Google Scholar
[Xu, H., Ding, S., & Watanabe, S. (2019, May). Improving end-to-end speech recognition with pronunciation-assisted sub-word modeling. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7110-7114). IEEE.10.1109/ICASSP.2019.8682494]Search in Google Scholar
[Sperber, M., Niehues, J., Neubig, G., Stüker, S., & Waibel, A. (2018). Self-attentional acoustic models. arXiv preprint arXiv:1803.09519.]Search in Google Scholar
[Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.]Search in Google Scholar
[Liu, A. H., Sung, T. W., Chuang, S. P., Lee, H. Y., & Lee, L. S. (2020, May). Sequence-to-Sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7879-7883). IEEE.10.1109/ICASSP40776.2020.9053324]Search in Google Scholar
[Unanue, I. J., Borzeshi, E. Z., Esmaili, N., & Piccardi, M. (2019). ReWE: Regressing word embeddings for regularization of neural machine translation systems. arXiv preprint arXiv:1904.02461.]Search in Google Scholar
[Toshniwal, S., Kannan, A., Chiu, C. C., Wu, Y., Sainath, T. N., & Livescu, K. (2018, December). A comparison of techniques for language model integration in encoder-decoder speech recognition. In 2018 IEEE spoken language technology workshop (SLT) (pp. 369-375). IEEE.10.1109/SLT.2018.8639038]Search in Google Scholar
[Caruana, R. (1997). Multitask learning. Machine learning, 28(1), 41-75.10.1023/A:1007379606734]Search in Google Scholar
[Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10), 1345-1359.10.1109/TKDE.2009.191]Search in Google Scholar
[Inaguma, H., Cho, J., Baskar, M. K., Kawahara, T., & Watanabe, S. (2019, May). Transfer learning of language-independent end-to-end ASR with language model fusion. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6096-6100). IEEE.10.1109/ICASSP.2019.8682918]Search in Google Scholar
[Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.]Search in Google Scholar
[Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.]Search in Google Scholar
[Zeyer, A., Irie, K., Schlüter, R., & Ney, H. (2018). Improved training of end-to-end attention models for speech recognition. arXiv preprint arXiv:1805.03294.]Search in Google Scholar
[Sriram, A., Jun, H., Satheesh, S., & Coates, A. (2017). Cold fusion: Training seq2seq models together with language models. arXiv preprint arXiv:1708.06426.]Search in Google Scholar
[Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H. C., ... & Bengio, Y. (2015). On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.]Search in Google Scholar
[Denisov, P., & Vu, N. T. (2019). End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning. arXiv preprint arXiv:1908.04737.]Search in Google Scholar
[Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240-1253.10.1109/JSTSP.2017.2763455]Search in Google Scholar
[Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv preprint arXiv:1806.04558.]Search in Google Scholar
[Seki, H., Hori, T., Watanabe, S., Roux, J. L., & Hershey, J. R. (2018). A purely end-to-end system for multi-speaker speech recognition. arXiv preprint arXiv:1805.05826.]Search in Google Scholar
[Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M. A., Devin, M., & Dean, J. (2013, May). Multilingual acoustic models using distributed deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8619-8623). IEEE.10.1109/ICASSP.2013.6639348]Search in Google Scholar
[Kubo, Y., & Bacchiani, M. (2020, May). Joint phoneme-grapheme model for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6119-6123). IEEE.10.1109/ICASSP40776.2020.9054557]Search in Google Scholar
[Lee, J., Mansimov, E., & Cho, K. (2018). Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.]Search in Google Scholar
[Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T. Y. (2017, December). Deliberation networks: Sequence generation beyond one-pass decoding. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 1782-1792).]Search in Google Scholar
[Wang, D., & Zheng, T. F. (2015, December). Transfer learning for speech and language processing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (pp. 1225-1237). IEEE.10.1109/APSIPA.2015.7415532]Search in Google Scholar
[Yu, C., Chen, Y., Li, Y., Kang, M., Xu, S., & Liu, X. (2019). Cross-language end-to-end speech recognition research based on transfer learning for the low-resource Tujia language. Symmetry, 11(2), 179.10.3390/sym11020179]Search in Google Scholar
[Ozeki, M., & Okatani, T. (2014, November). Understanding convolutional neural networks in terms of category-level attributes. In Asian Conference on Computer Vision (pp. 362-375). Springer, Cham.10.1007/978-3-319-16808-1_25]Search in Google Scholar
[Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-end speech recognition from the raw waveform. arXiv preprint arXiv:1806.07098.]Search in Google Scholar
[Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173-182). PMLR.]Search in Google Scholar
[Abad, A., Bell, P., Carmantini, A., & Renais, S. (2020, May). Cross lingual transfer learning for zero-resource domain adaptation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6909-6913). IEEE.10.1109/ICASSP40776.2020.9054468]Search in Google Scholar
[Hsu, J. Y., Chen, Y. J., & Lee, H. Y. (2020, May). Meta learning for end-to-end low-resource speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7844-7848). IEEE.10.1109/ICASSP40776.2020.9053112]Search in Google Scholar
[Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., & Hadsell, R. (2018). Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960.]Search in Google Scholar
[Finn, C., Abbeel, P., & Levine, S. (2017, July). Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (pp. 1126-1135). PMLR.]Search in Google Scholar
[Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).10.1145/1143844.1143891]Search in Google Scholar
[Matsuura, K., Mimura, M., Sakai, S., & Kawahara, T. (2020). Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition. arXiv preprint arXiv:2005.09256.]Search in Google Scholar
[Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.]Search in Google Scholar
[Zhang, S., Do, C. T., Doddipatla, R., & Renals, S. (2020, May). Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7024-7028). IEEE.10.1109/ICASSP40776.2020.9053169]Search in Google Scholar
[Chen, Z., & Yang, H. (2020, June). Yi Language Speech Recognition using Deep Learning Methods. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (Vol. 1, pp. 1064-1068). IEEE.10.1109/ITNEC48623.2020.9084771]Search in Google Scholar
[Rashmi, S., Hanumanthappa, M., & Reddy, M. V. (2018). Hidden Markov Model for speech recognition system—a pilot study and a naive approach for speech-to-text model. In Speech and Language Processing for Human-Machine Communications (pp. 77-90). Springer, Singapore.10.1007/978-981-10-6626-9_9]Search in Google Scholar
[Karáfidt, M., Baskar, M. K., Veselý, K., Grézl, F., Burget, L., & Černocký, J. (2018, April). Analysis of multilingual blstm acoustic model on low and high resource languages. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5789-5793). IEEE.10.1109/ICASSP.2018.8462083]Search in Google Scholar
[Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3), 328-339.10.1109/29.21701]Search in Google Scholar
[Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4835-4839). IEEE.10.1109/ICASSP.2017.7953075]Search in Google Scholar
[Aung, M. A. A., & Pa, W. P. (2020, February). Time Delay Neural Network for Myanmar Automatic Speech Recognition. In 2020 IEEE Conference on Computer Applications (ICCA) (pp. 1-4). IEEE.10.1109/ICCA49400.2020.9022808]Search in Google Scholar
[Ragni, A., Knill, K. M., Rath, S. P., & Gales, M. J. (2014, September). Data augmentation for low resource languages. In INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association (pp. 810-814). International Speech Communication Association (ISCA).10.21437/Interspeech.2014-207]Search in Google Scholar
[Gokay, R., & Yalcin, H. (2019, March). Improving low Resource Turkish speech recognition with Data Augmentation and TTS. In 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD) (pp. 357-360). IEEE.10.1109/SSD.2019.8893184]Search in Google Scholar
[Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.10.21437/Interspeech.2015-711]Search in Google Scholar
[Tachibana, H., Uenoyama, K., & Aihara, S. (2018, April). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4784-4788). IEEE.10.1109/ICASSP.2018.8461829]Search in Google Scholar
[Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173-182). PMLR.]Search in Google Scholar
[Availablehttps://github.com/SeanNaren/deepspeech.pytorch accesed 25.05. 2020.]Search in Google Scholar
[Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).10.1145/1143844.1143891]Search in Google Scholar
[Heafield, K., Pouzyrevsky, I., Clark, J. H., & Koehn, P. (2013, August). Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 690-696).]Search in Google Scholar
[Thai, B., Jimerson, R., Ptucha, R., & Prud’hommeaux, E. (2020, May). Fully Convolutional ASR for Less-Resourced Endangered Languages. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) (pp. 126-130).]Search in Google Scholar
[Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., & Waibel, A. (2016, March). An empirical exploration of CTC acoustic models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2623-2627). IEEE.10.1109/ICASSP.2016.7472152]Search in Google Scholar
[Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.]Search in Google Scholar