Accès libre

Interpreting Convolutional Layers in DNN Model Based on Time–Frequency Representation of Emotional Speech

À propos de cet article

Citez

A. Karim, A. Mishra, M. H. Newton, and A. Sattar, Machine learning interpretability: A science rather than a tool, vol. abs/1807.06722, 2018. Search in Google Scholar

M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 818–833. Search in Google Scholar

S. Das, N. N. Lønfeldt, A. K. Pagsberg, and L. H. Clemmensen, Towards interpretable and transferable speech emotion recognition: Latent representation based analysis of features, methods and corpora, 2021. Search in Google Scholar

Q. Zhang, Y. N. Wu, and S.-C. Zhu, Interpretable convolutional neural networks, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018-06. Search in Google Scholar

K. V. V. Girish, S. Konjeti, and J. Vepa, Inter-pretabilty of speech emotion recognition modelled using self-supervised speech and text pre-trained embeddings, in Proc. Interspeech 2022, 2022, pp. 4496–4500. Search in Google Scholar

M. Colussi and S. Ntalampiras, Interpreting deep urban sound classification using layer-wise relevance propagation, CoRR, vol. abs/2111.10235, 2021. Search in Google Scholar

E. Jing, Y. Liu, Y. Chai, J. Sun, S. Samtani, Y. Jiang, and Y. Qian, A deep interpretable representation learning method for speech emotion recognition, Information Processing and Management, vol. 60, no. 6, p. 103501, 2023. Search in Google Scholar

G. Beguš and A. Zhou, Interpreting intermediate convolutional layers in unsupervised acoustic word classification, in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8207–8211. Search in Google Scholar

G. Begus and A. Zhou, Interpreting intermediate convolutional layers of CNNs trained on raw speech, CoRR, vol. abs/2104.09489, 2021. Search in Google Scholar

T. Nguyen, M. Raghu, and S. Kornblith, Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth, in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021. Search in Google Scholar

P. Tzirakis, G. Trigeorgis, M. Nicolaou, B. Schuller, and S. Zafeiriou, End-to-end multimodal emotion recognition using deep neural networks, IEEE Journal of Selected Topics in Signal Processing, vol. PP, 04 2017. Search in Google Scholar

G. Begus and A. Zhou, Interpreting intermediate convolutional layers of generative cnns trained on waveforms, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 3214–3229, 2022. Search in Google Scholar

L. Smietanka and T. Maka, DNN architectures and audio representations comparison for emotional speech classification, in 2021 International Conference on Software, Telecommunications and Computer Networks (SoftCOM). Split, Hvar, Croatia: IEEE, sep 2021. Search in Google Scholar

F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, A database of german emotional speech, in in Proceedings of Interspeech, Lissabon, 2005, pp. 1517–1520. Search in Google Scholar

S. R. Livingstone and F. A. Russo, The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english, PLOS ONE, 2018. Search in Google Scholar

T. Lidy and A. Schindler, Cqt-based convolutional neural networks for audio scene classification. Budapest, Hungary: DCASE, 09 2016. Search in Google Scholar

J. C. Brown, Calculation of a constant Q spectral transform, The Journal of the Acoustical Society of America, vol. 89, no. 1, pp. 425–434, January 1991. Search in Google Scholar

P. Boersma, Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, vol. 17, no. 1193, pp. 97–110, 1993. Search in Google Scholar

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. Search in Google Scholar

M. Abadi et al., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, 2016. Search in Google Scholar

A. F. Agarap, Deep learning using rectified linear units (relu), arXiv preprint arXiv:1803.08375, 2018. Search in Google Scholar

S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, Multitask learning from augmented auxiliary data for improving speech emotion recognition, IEEE Transactions on Affective Computing, pp. 1–13, 2022. Search in Google Scholar

Y. Liu, H. Sun, W. Guan, Y. Xia, Y. Li, M. Unoki, and Z. Zhao, A discriminative feature representation method based on cascaded attention network with adversarial strategy for speech emotion recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1063–1074, 2023. Search in Google Scholar

E. Guizzo, T. Weyde, S. Scardapane, and D. Comminiello, Learning speech emotion representations in the quaternion domain, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1200–1212, 2023. Search in Google Scholar

N. T. Pham, D. N. M. Dang, and S. D. Nguyen, Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition, 2021. Search in Google Scholar

Y. L. Bouali, O. B. Ahmed, and S. Mazouzi, Cross-modal learning for audio-visual emotion recognition in acted speech, in 2022 6th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2022, pp. 1–6. Search in Google Scholar

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database, Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, Dec. 2008. Search in Google Scholar

S. Kakouros, T. Stafylakis, L. Mošner, and L. Burget, Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing, in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. Search in Google Scholar

K. Dupuis and M. Kathleen Pichora-Fuller, Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set, Canadian Acoustics, vol. 39, no. 3, p. 182–183, Sep. 2011. Search in Google Scholar

S. Jothimani and K. Premalatha, Mff-saug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network, Chaos, Solitons & Fractals, vol. 162, p. 112512, 2022. Search in Google Scholar

J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cireşan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, and L. M. Gambardella, Max-pooling convolutional neural networks for vision-based hand gesture recognition, in 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), 2011, pp. 342–347. Search in Google Scholar

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, in International Conference on Learning Representations – ICLR’2015, 2015. Search in Google Scholar

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition – CVPR’2016, 2016, pp. 770–778. Search in Google Scholar

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017. Search in Google Scholar

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, in IEEE Conference on Computer Vision and Pattern Recognition – CVPR’2016, 2016, pp. 2818–2826. Search in Google Scholar

U. Masashi, K. Miho, K. Maori, K. Shunsuke, and A. Masato, How the temporal amplitude envelope of speech contributes to urgency perception, in Proceedings of the 23rd International Congress on Acoustics, ser. Proceedings of the International Congress on Acoustics. Aachen, Germany: International Commission for Acoustics (ICA), 2019, pp. 1739–1744. Search in Google Scholar

P. Ríos-López, M. T. Molnar, M. Lizarazu, and M. Lallier, The role of slow speech amplitude envelope for speech processing and reading development, Frontiers in Psychology, no. 8, 2017. Search in Google Scholar

K. Stevens, Acoustic Phonetics, ser. Current Studies in Linguistics. London: MIT Press, 2000. Search in Google Scholar

N. Hellbernd and D. Sammler, Prosody conveys speaker’s intentions: Acoustic cues for speech act perception, Journal of Memory and Language, vol. 88, pp. 70–86, 2016. Search in Google Scholar

S. Pearsell and D. Pape, The effects of different voice qualities on the perceived personality of a speaker, Frontiers in Communication, vol. 7, 2023. Search in Google Scholar

M. Nishio and S. Niimi, Changes in speaking fundamental frequency characteristics with aging, The Japan Journal of Logopedics and Phoniatrics, vol. 46, pp. 136–144, 04 2005. Search in Google Scholar

H. Deng and D. O’Shaughnessy, Voiced-unvoiced-silence speech sound classification based on unsupervised learning, in 2007 IEEE International Conference on Multimedia and Expo, 2007, pp. 176–179. Search in Google Scholar

eISSN:
2449-6499
Langue:
Anglais
Périodicité:
4 fois par an
Sujets de la revue:
Computer Sciences, Databases and Data Mining, Artificial Intelligence