Automatic statistical evaluation of quality of unit selection speech synthesis with different prosody manipulations

[1] A. Zelenik and Z. Kacic, “Multi-Resolution Feature Extraction Algorithm in Emotional Speech Recognition”, Elektronika ir Elektrotechnika, vol. 21, no. 5, pp. 54–58, 2015, DOI: 10.5755/j01.eee.21.5.13328.10.5755/j01.eee.21.5.13328Search in Google Scholar

[2] M. Grůber and J. Matoušek, “Listening-Test-Based Annotation of Communicative Functions for Expressive Speech Synthesis”, P. Sojka, A. Horak, I. Kopecek, K. Pala (eds.): Text, Speech, and Dialogue (TSD) 2010, LNCS, vol. 6231, pp. 283–290, Springer 2010.Search in Google Scholar

[3] P. C. Loizou, “Speech Quality Assessment”, W. Tao, et al.(eds): Multimedia Analysis, Processing and Communications. Studies Computational Intelligence, vol. 346, pp. 623–654, Springer, Berlin, Heidelberg, 2011, DOI:10.1007/978-3-642-19551-8_23.10.1007/978-3-642-19551-8_23Search in Google Scholar

[4] H. Ye and S. Young, “High Quality Voice Morphing”, ICASSP 2004 Proceedings. IEEE International Conference on Acoustics, Speech, and Signal Processing, 17-21 May 2004, Montreal, Canada, DOI:10.1109/ICASSP.2004.1325909.10.1109/ICASSP.2004.1325909Search in Google Scholar

[5] M. Adiban, B. BabaAli and S. Shehnepoor, “Statistical Feature Embedding for Heart Sound Classification”, Journal of Electrical Engineering, vol. 70, no. 4, pp. 259–272, 2019, DOI: 10.2478/jee-2019-0056.10.2478/jee-2019-0056Search in Google Scholar

[6] B. Boilović, B. M. Todorović and M. Obradović, “Text-Independent Speaker Recognition using Two-Dimensional Information Entropy”, Journal of Electrical Engineering, vol. 66, no. 3, pp. 169–173, 2015, DOI: 10.1515/jee-2015-0027.Search in Google Scholar

[7] C. Y. Lee and Z. J. Lee, “A Novel Algorithm Applied to Classify Unbalanced Data”, Applied Soft Computing, vol. 12, pp. 2481–2485, 2012, DOI: 10.1016/j.asoc.2012.03.051.10.1016/j.asoc.2012.03.051Search in Google Scholar

[8] R. Vích, J. Nouza and M. Vondra, “Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems”, A. Esposito et al. (eds.): Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction, LNCS, vol. 5042, pp. 136–148, Springer 2008.Search in Google Scholar

[9] M. Cerňak, M. Rusko and M. Trnka, “Diagnostic Evaluation of Synthetic Speech using Speech Recognition”, Procs. of the 16th International Congress on Sound and Vibration (ICSV16), Kraków, Poland, 5-9 July, p. 6, 2009, https://pdfs.semanticscholar.org/502b/f1d8bfb0cc90cd3defcc9d479d9a97b23b66.pdf.Search in Google Scholar

[10] S. Möller, and J. Heimansberg, “Estimation of TTS Quality Telephone Environments Using a Reference-free Quality Prediction Model”, Second ISCA/DEGA Tutorial and Research Workshop on Perceptual Quality of Systems, Berlin, Germany, September 2006, pp. 56–60, ISCA Archive, http://www.isca-speech.org/archive_open/pqs2006.Search in Google Scholar

[11] D.-Y. Huang, “Prediction of Perceived Sound Quality of Synthetic Speech”, Procs. of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2011 Xi’an, China, October 18-21, 2011, p. 6, http://www.apsipa.org/proceedings2011/pdf/APSIPA100.pdf.Search in Google Scholar

[12] S. Möller et al, “Comparison of Approaches for Instrumentally Predicting the Quality of Text-To-Speech Systems”, 2010, INTERSPEECH- 2010, pp. 1325–1328, https://www.isca-speech.org/archive/archive_papers/interspeech_2010/i10_1325.pdf.10.21437/Interspeech.2010-413Search in Google Scholar

[13] F. Hinterleitner et al, “Predicting the Quality of Synthesized Speech using Reference-Based Prediction Measures”, Studientexte zur Sprachkommunikation: Elektronische Sprachsignalver-arbeitung, Session: Sprachsynthese-Evaluation und Prosodie, 2011, pp. 99–106, TUDpress, Dresden, http://www.essv.de/paper.php?id=14.Search in Google Scholar

[14] J. P. H. van Santen, “Segmental Duration and Speech Timing”, Y. Sagisaka, N.Campbell, N.Higuchi (eds.): Computing Prosody, Springer, New York, NY, pp. 225–248, 1997.10.1007/978-1-4612-2258-3_15Search in Google Scholar

[15] C. M. Bishop, “Pattern Recognition and Machine Learning”, Springer, 2006.Search in Google Scholar

[16] V. Rodellar-Biarge, D. Palacios-Alonso, V. Nieto-Lluis, and P. Gomez-Vilda, “Towards the search of detection speech-relevant features for stress”, Expert Systems, vol. 32, no.6, pp. 710-718, 2015.DOI: 10.1111/exsy.12109.10.1111/exsy.12109Search in Google Scholar

[17] A. J. Hunt and A. W. Black, “Unit Selection a Concatenative Speech Synthesis System using a Large Speech Database”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Atlanta (Georgia, USA), pp. 373–376, 1996, DOI: 10.1109/ICASSP.1996.541110.10.1109/ICASSP.1996.541110Search in Google Scholar

[18] J. Kala and J. Matoušek, “Very Fast Unit Selection using Viterbi Search with Zero-Concatenation-Cost Chains”, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), Florence, Italy, pp. 2569–2573, 2014.Search in Google Scholar

[19] M. Jůzová, D. Tihelka and R. Skarnitzl, “Last Syllable Unit Penalization Unit Selection TTS”, K. Ekstein and V. Matousek (eds.): Text, Speech, and Dialogue (TSD 2017), LNAI vol. 10415, pp. 317–325, 2017, DOI: 10.1007/978-3-319-64206-2 36.10.1007/978-3-319-64206-2Search in Google Scholar

[20] D. Tihelka, Z. Hanzlíček, M. Jůzová, J. Vít, J. Matoušek and M. Grůber, “Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies”, P. Sojka, A.Horák, I.Kopeček, and K. Pala (eds): Text, Speech, and Dialogue (TSD 2018), LNAI 11107, pp. 369–378, 2018, DOI: doi.org/10.1007/978-3-030-00794-2_40.Search in Google Scholar

[21] Z. Hanzlíček, J. Vít, and D. Tihelka, “WaveNet-Based Speech Synthesis Applied to Czech – A Comparison with the Traditional Synthesis Methods”, P. Sojka, A.Horák, I.Kopeček, and K. Pala (eds): Text, Speech, and Dialogue (TSD 2018), LNAI 11107, pp. 445–452, 2018, DOI: 10.1007/978-3-030-00794-2_48.10.1007/978-3-030-00794-2_48Search in Google Scholar

[22] J. Vít, Z. Hanzlíček and J. Matoušek, “Czech Speech Synthesis with Generative Neural Vocoder”, K. Ekštein (ed.): Text, Speech, and Dialogue (TSD 2019), LNAI 11697, pp. 307–315, 2019, DOI: 10.1007/978-3-030-27947-9_26.10.1007/978-3-030-27947-9_26Search in Google Scholar

[23] J. Matoušek, D. Tihelka and J. Psutka, “New Slovak Unit-Selection Speech Synthesis ARTIC TTS System”, Proceedings of the International Multiconference of Engineers and Computer Scientists (IMECS), San Francisco, USA, 2011.Search in Google Scholar

eISSN:: 1339-309X
Language:: English

Publication timeframe:: 6 times per year
Journal Subjects:: Engineering, Introductions and Overviews, other

Journal RSS Feed

Automatic statistical evaluation of quality of unit selection speech synthesis with different prosody manipulations

Published Online: May 13, 2020

Page range: 78 - 86

Received: Oct 01, 2019

DOI: https://doi.org/10.2478/jee-2020-0012

Keywordslistening test, objective and subjective evaluation, quality of synthetic speech, statistical analysis

© 2020 Jiří Přibil et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Keywords
listening test, objective and subjective evaluation, quality of synthetic speech, statistical analysis