1. bookVolume 9 (2019): Issue 4 (October 2019)
Journal Details
License
Format
Journal
eISSN
2449-6499
First Published
30 Dec 2014
Publication timeframe
4 times per year
Languages
English
access type Open Access

Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU

Published Online: 30 Aug 2019
Volume & Issue: Volume 9 (2019) - Issue 4 (October 2019)
Page range: 235 - 245
Received: 29 Sep 2018
Accepted: 10 Mar 2019
Journal Details
License
Format
Journal
eISSN
2449-6499
First Published
30 Dec 2014
Publication timeframe
4 times per year
Languages
English
Abstract

Deep Neural Networks (DNN) are nothing but neural networks with many hidden layers. DNNs are becoming popular in automatic speech recognition tasks which combines a good acoustic with a language model. Standard feedforward neural networks cannot handle speech data well since they do not have a way to feed information from a later layer back to an earlier layer. Thus, Recurrent Neural Networks (RNNs) have been introduced to take temporal dependencies into account. However, the shortcoming of RNNs is that long-term dependencies due to the vanishing/exploding gradient problem cannot be handled. Therefore, Long Short-Term Memory (LSTM) networks were introduced, which are a special case of RNNs, that takes long-term dependencies in a speech in addition to short-term dependencies into account. Similarily, GRU (Gated Recurrent Unit) networks are an improvement of LSTM networks also taking long-term dependencies into consideration. Thus, in this paper, we evaluate RNN, LSTM, and GRU to compare their performances on a reduced TED-LIUM speech data set. The results show that LSTM achieves the best word error rates, however, the GRU optimization is faster while achieving word error rates close to LSTM.

Keywords

[1] G. E. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18, 1527-1554, 2006.10.1162/neco.2006.18.7.152716764513Open DOISearch in Google Scholar

[2] A. Rousseau, P. Deléglise, Y. Estève, Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. Proceedings of Sventh Language Resources and Evaluation Conference, 3935-3939, May 2014.Search in Google Scholar

[3] Y. Gaur, F. Metze, J. P. Bigham, Manipulating Word Lattices to Incorporate Human Corrections, Inter-speech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 2016.10.21437/Interspeech.2016-660Search in Google Scholar

[4] E. Busseti, I. Osband, S. Wong, Deep Learning for Time Series Modeling, Seminar on Collaborative Intelligence in the TU Kaiserslautern, Germany, June 2012.Search in Google Scholar

[5] Deep Learning for Sequential Data - Part V: Handling Long Term Temporal Dependencies, https://prateekvjoshi.com/2016/05/31/deep-learning-for-sequential-data-part-v-handling-long-term-temporal-dependencies/, last retrieved July 2017.Search in Google Scholar

[6] Understanding LSTM Networks, http://colah.github.io/posts/2015-08-Understanding-LSTMs/, last retrieved July 2017.Search in Google Scholar

[7] A. Graves, A. R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6645-6649, 2013.Search in Google Scholar

[8] A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling un-segmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, 369-376, ACM, June 2006.10.1145/1143844.1143891Search in Google Scholar

[9] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.Search in Google Scholar

[10] TED-LIUM Corpus, http://www-lium.univlemans.fr/en/content/ted-lium-corpus, last retrieved July 2017.Search in Google Scholar

[11] C. C. Chiu, D. Lawson, Y. Luo, G.Tucker, K. Swersky, I. Sutskever, N. Jaitly, An online sequence-to-sequence model for noisy speech recognition, arXiv preprint arXiv:1706.06428, 2017.Search in Google Scholar

[12] T. Hori, S. Watanabe, Y. Zhang, W. Chan, Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM, arXiv preprint arXiv:1706.02737, 2017.Search in Google Scholar

[13] W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.Search in Google Scholar

[14] T. Mikolov, Statistical language models based on neural networks, PhD thesis, Brno University of Technology, 2012.Search in Google Scholar

[15] W. Zaremba, I. Sutskever, O. Vinyals, Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.Search in Google Scholar

[16] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 3104-3112, 2014.Search in Google Scholar

[17] F. A. Gers, E. Schmidhuber, LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333-1340, 2001.10.1109/72.96376918249962Open DOISearch in Google Scholar

[18] O. Vinyals, S. V. Ravuri, D. Povey, Revisiting recurrent neural networks for robust ASR. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4085-4088, 2012.Search in Google Scholar

[19] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR. Thirteenth Annual Conference of the International Speech Communication Association, 2012.10.21437/Interspeech.2012-6Search in Google Scholar

[20] H. Sak, A. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014.10.21437/Interspeech.2014-80Search in Google Scholar

[21] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5), 602-610, 2005.10.1016/j.neunet.2005.06.04216112549Open DOISearch in Google Scholar

[22] A. Graves, S. Fernández, J. Schmidhuber, Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W., Kacprzyk J., Oja E., Zadro˙zny S. (eds) Artificial Neural Networks: Formal Models and Their Applications – ICANN, Lecture Notes in Computer Science, vol. 3697, Springer, Berlin, Heidelberg, 2005.Search in Google Scholar

[23] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for unconstrained handwriting recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855-868, 2009.10.1109/TPAMI.2008.13719299860Open DOISearch in Google Scholar

[24] A. Graves, N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the 31st International Conference on Machine Learning (ICML-14), 1764-1772, 2014.Search in Google Scholar

[25] A. Graves, N. Jaitly, A. R. Mohamed, Hybrid speech recognition with deep bidirectional LSTM. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 273-278, December 2013.Search in Google Scholar

[26] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates A. Y. Ng. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.Search in Google Scholar

[27] H. Xu, G. Chen, D. Povey, S. Khudanpur, Modeling phonetic context with non-random forests for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association, 2015.10.21437/Interspeech.2015-478Search in Google Scholar

[28] T. Ko, V. Peddinti, D. Povey, S. Khudanpur, Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association, 3586-3589, 2015.10.21437/Interspeech.2015-711Search in Google Scholar

[29] G. Chen, H. Xu, M. Wu, D. Povey, S. Khudanpur, Pronunciation and silence probability modeling for ASR. Sixteenth Annual Conference of the International Speech Communication Association, 2015.10.21437/Interspeech.2015-198Search in Google Scholar

[30] Y. Gaur, F. Metze, J. P. Bigham, Manipulating Word Lattices to Incorporate Human Corrections. Seventeenth Annual Conference of the International Speech Communication Association, 3062-3065, 2016.10.21437/Interspeech.2016-660Search in Google Scholar

[31] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.10.3115/v1/W14-4012Search in Google Scholar

[32] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, Y. Bengio, End-to-end attention-based large vocabulary speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4945-4949, March 2016.Search in Google Scholar

[33] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory. Neural Comput. 9, 8, 1735-1780, November 1997.10.1162/neco.1997.9.8.17359377276Search in Google Scholar

[34] K. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, Technical report, arXiv preprint arXiv:1409.0473, 2014.Search in Google Scholar

[35] D. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Search in Google Scholar

[36] Reduced TED-LIUM release 2 corpus (11.7 GB), http://www.cs.ndsu.nodak.edu/~siludwig/data/TEDLIUM_release2.zip, last retrieved July 2017.Search in Google Scholar

[37] Speech recognition performance, https://en.wikipedia.org/wiki/Speech_recognition#Performance, last retrieved July 2017.Search in Google Scholar

[38] Levenshtein distance, https://en.wikipedia.org/wiki/Levenshtein_distance, last retrieved July 2017.Search in Google Scholar

[39] A. C. Morris, V. Maier, P. Green, From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. Eighth International Conference on Spoken Language Processing, 2004.10.21437/Interspeech.2004-668Search in Google Scholar

[40] Word error rate, https://en.wikipedia.org/wiki/Word_error_rate, last retrieved July 2017.Search in Google Scholar

[41] A. Marzal, E. Vidal, Computation of normalized edit distance and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9), 926-932, 1993.10.1109/34.232078Open DOISearch in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo