November 27, 2022


The Internet Generation

Speech Emotion Recognition using Self-Supervised Features

Speech emotion recognition (SER) can be used in call center conversation investigation, psychological health, or spoken dialogue methods.

Audio recordings can also be used for automated speech emotion recognition.

Audio recordings can also be applied for automatic speech emotion recognition. Image credit rating: Alex Regan by using Wikimedia, CC-BY-2.

A new paper revealed on formulates the SER issue as a mapping from the continuous speech area into the discrete domain of categorical labels of emotion.

Researchers use the Upstream + Downstream architecture design paradigm to make it possible for simple use/integration of a huge wide range of self-supervised attributes. The Upstream is pre-skilled in a self-supervised vogue is responsible for attribute extraction. The Downstream is a task-dependent design, which classifies the capabilities produced by the Upstream product into categorical labels of emotion.

Experimental results demonstrate that inspite of utilizing only the speech modality, the proposed procedure can get to outcomes similar to individuals obtained by multimodal systems, which use the two Speech and Text modalities.

Self-supervised pre-experienced functions have persistently delivered condition-of-artwork benefits in the field of purely natural language processing (NLP) nonetheless, their merits in the area of speech emotion recognition (SER) continue to will need further more investigation. In this paper we introduce a modular Stop-to- Close (E2E) SER procedure dependent on an Upstream + Downstream architecture paradigm, which permits simple use/integration of a significant wide range of self-supervised characteristics. A number of SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are done. These experiments investigate interactions amongst fine-tuning of self-supervised characteristic styles, aggregation of frame-degree options into utterance-amount capabilities and back-end classification networks. The proposed monomodal speechonly dependent technique not only achieves SOTA final results, but also brings light-weight to the chance of impressive and perfectly finetuned self-supervised acoustic functions that achieve outcomes related to the results realized by SOTA multimodal systems utilizing equally Speech and Text modalities.

Investigate paper: Morais, E., Hoory, R., Zhu, W., Gat, I., Damasceno, M., and Aronowitz, H., “Speech Emotion Recognition working with Self-Supervised Features”, 2022. Website link: muscles/2202.03896