1. bookVolume 68 (2017): Issue 2 (December 2017)
Journal Details
License
Format
Journal
eISSN
1338-4287
ISSN
0021-5597
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
access type Open Access

TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus

Published Online: 24 Jan 2018
Volume & Issue: Volume 68 (2017) - Issue 2 (December 2017)
Page range: 346 - 354
Journal Details
License
Format
Journal
eISSN
1338-4287
ISSN
0021-5597
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
Abstract

This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.

Keywords

[1] Koctúr, T., Juhár, J., Viszlay, P., Staš, J., and Lojka, M. (2016). Unsupervised speech transcription and alignment based on two complementary ASR systems. In Proceedings of RADIOELEKTRONIKA 2016, pages 358–362, Košice, Slovakia.10.1109/RADIOELEK.2016.7477435Search in Google Scholar

[2] Rosseau, A., Deléglise, P., and Estève, Y. (2012). TED-LIUM: An automatic speech recognition dedicated corpus. In Proceedings of LREC 2012, pages 125–129, Istanbul, Turkey.Search in Google Scholar

[3] Deléglise, P., Estève, Y., Meignier, S., and Merlin, T. (2009). Improvements to the LIUM French ASR system based on CMU Sphinx: What helps to significantly reduce the word error rate? In Proceedings of INTERSPEECH 2009, pages 2123–2126, Brighton, UK.10.21437/Interspeech.2009-607Search in Google Scholar

[4] Žgank, A., Maučec, M. S., Verdonik, D. (2016). The SI TEDx-UM speech database: A new Slovenian spoken language resource. In Proceedings of LREC 2016, pages 4670–4673, Portorož, Slovenia.Search in Google Scholar

[5] Rosseau, A., Deléglise, P., and Estève, Y. (2014). Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In Proceedings of LREC 2014, pages 3935–3939, Reykjavik, Iceland.Search in Google Scholar

[6] Leeuwis, E., Federico, M., and Cettolo, M. (2003). Language modeling and transcription of the TED corpus lectures. In Proceedings of ICASSP 2003, pages 232–235, Hong Kong, China.10.1109/ICASSP.2003.1198760Search in Google Scholar

[7] Cettolo, M., Brugnara, F. and Federico, M. (2004). Advances in the automatic transcription of lectures. In Proceedings of ICASSP 2004, pages 769–772, Montreal, Canada.10.1109/ICASSP.2004.1326099Search in Google Scholar

[8] Niesler, T. and Willet, D. (2002). Unsupervised language model adaptation for lecture speech transcription. In Proceedings of ICSLP 2002, pages 1413–1416, Denver, Colorado, USA.Search in Google Scholar

[9] Wölfel, M. and Berger, S. (2005). The ISL baseline lecture transcription system for the TED corpus. Tech. Rep., Karlsruhe University, Germany.Search in Google Scholar

[10] Naptali, W. and Kawahara, T. (2012). Automatic transcription of TED talks. In Proceedings of the 6th Spoken Document Processing Workshop, SDPWS 2012, Toyohashi, Japan.Search in Google Scholar

[11] Bell, P., Yamamoto, H., Swietojanski, P., Wu, Y., McInnes, F., Hori, Ch., and Renals, S. (2013). A lecture transcription system combining neural network acoustic and language models. In Proceedings of INTERSPEECH 2013, pages 3081–3091, Lyon, France.10.21437/Interspeech.2013-673Search in Google Scholar

[12] Nanjo, H., Shitaoka, K., and Kawahara, T. (2003). Automatic transformation of lecture transcription into document style using statistical framework. In Proceedings of ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, SSPR 2003, Tokyo, Japan.Search in Google Scholar

[13] Hsu, B.-J. and Glass, J. (2009). Language model parameter estimation using user transcriptions. In Proceedings of ICASSP 2009, pages 4805–4808, Taipei, Taiwan.10.1109/ICASSP.2009.4960706Search in Google Scholar

[14] Akita, Y., Watanabe, M., and Kawahara, T. (2012). Automatic transcription of lecture speech using language model based on speaking-style transformation of proceedings texts. In Proceedings of INTERSPEECH 2012, pages 2326–2329, Portland, Oregon, USA.10.21437/Interspeech.2012-610Search in Google Scholar

[15] Viszlay, P., Staš, J., Koctúr, T., Lojka, M., and Juhár, J. (2016). An extension of the Slovak broadcast news corpus based on semi-automatic annotation. In Proceedings of LREC 2016, pages 4684–4687, Portorož, Slovenia.Search in Google Scholar

[16] Vavrek, J., Viszlay, P., Kiktová, E., Lojka, M., Juhár, J., and Čižmár, A. (2014). Query-by-example retrieval via fast sequential dynamic time warping algorithm. In Proceedings of the 37th International Conference on Telecommunications and Signal Processing, TSP 2014, pages 453–457, Berlin, Germany.Search in Google Scholar

[17] Staš, J., Viszlay, P., Lojka, M., Koctúr, T., Hládek, D., Kiktová, E., Pleva, M., and Juhár, J. (2015). Automatic subtitling system for transcription, archiving and indexing of Slovak audiovisual recordings. In Proceedings of the 7th Language & Technology Conference, LTC 2015, pages 186–191, Poznań, Poland.Search in Google Scholar

[18] Lee, A., Kawahara, T., and Shikano, K. (2001). Julius – An open source real-time large vocabulary recognition engine. In Proceedings of EUROSPEECH 2001, pages 1691–1694, Aalborg, Denmark.Search in Google Scholar

[19] Lojka, M., Ondáš, S., Pleva, M., and Juhár, J. (2014). Multi-threaded parallel speech recognition for mobile applications. Journal of Electrical and Electronics Engineering, 7(1):81–86.Search in Google Scholar

[20] Rusko, M., Juhár, J., Trnka, M., Staš, J., Darjaa, S., Hládek, D., Sabo, R., Pleva, M., Ritomský, M., and Ondáš, S. (2016). Advances in the Slovak judicial domain dictation system. In Vertulani, Z., Uszkoreit, H., and Kubis, M., editors, Human Language Technology: Challenges for Computer Science and Linguistics, LNAI 9561, pages 55–67, Springer International Publishing Switzerland.Search in Google Scholar

[21] Koctúr, T., Staš, J., and Juhár, J. (2016). Unsupervised acoustic corpora building based on variable confidence measure thresholding. In Proceedings of the 58th International Symposium ELMAR 2016, pages 31–34, Zadar, Croatia.10.1109/ELMAR.2016.7731748Search in Google Scholar

[22] Darjaa, S., Cerňak, M., Trnka, M., and Rusko, M. (2011). Effective triphone mapping for acoustic modeling in speech recognition. In Proceedings of INTERSPEECH 2011, pages 1717–1720, Florence, Italy.10.21437/Interspeech.2011-190Search in Google Scholar

[23] Stolcke, A. (2002). SRILM – An extensible language modeling toolkit. In Proceedings of ICSLP 2002, pages 901–904, Denver, Colorado, USA.Search in Google Scholar

[24] Staš, J. and Juhár, J. (2015). Modeling of the Slovak language for broadcast news transcription. Journal of Electrical and Electronics Engineering, 8(2):43–46.Search in Google Scholar

[25] Hládek, D., Ondáš, S., and Staš, J. (2014). Online natural language processing of the Slovak language. In Proceedings of the 5th IEEE International Conference on Cognitive InfoCommunications, CogInfoCom 2014, pages 315–316, Vietri sul Mare, Italy.10.1109/CogInfoCom.2014.7020469Search in Google Scholar

[26] Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In Proceedings of ASRU 1997, pages 347–352, Santa Barbara, CA, USA.10.1109/ASRU.1997.659110Search in Google Scholar

[27] Lojka, M. and Juhár, J. (2014). Hypothesis combination for Slovak dictation speech recognition. In Proceedings of the 56th International Symposium ELMAR 2014, pages 43–46, Zadar, Croatia.10.1109/ELMAR.2014.6923311Search in Google Scholar

[28] Staš, J., Hládek, D, and Juhár, J. (2016). Adding filled pauses and disfluent events into language models for speech recognition. In Proceedings of the 7th IEEE International Conference on Cognitive InfoCommunications, CogInfoCom 2016, Wroclaw, Poland.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo