QCNN-SER: A Noise-Robust Quantum Convolutional Neural Network with Enhanced Cross-Domain Generalization for Speech Emotion Recognition

Published in 28th International Conference on Computer and Information Technology (ICCIT), Cox's Bazar, Bangladesh, 2025

Emotion recognition in speech is essential for human-computer interaction. This study introduces a refined hybrid Quantum Convolutional Neural Network (QCNN) frame- work designed to substantially enhance Speech Emotion Recog- nition (SER). It addresses major challenges such as speaker variability, noise robustness, and cross-domain applicability that often limit classical deep learning models. By Utilizing quan- tum principles like superposition and entanglement within a 6-qubit, 8-layer parameterized circuit, the model derives high- dimensional, noise-resistant features from speech signals. The method includes sophisticated preprocessing steps, such as band- pass filtering and energy-based Voice Activity Detection (VAD), followed by Mel-Frequency Cepstral Coefficients (MFCC) ex- traction. Evaluated on a diverse dataset of 4,515 samples across eight emotions (Angry, Calm, Disgust, Fear, Happy, Neutral, Sad, Surprised), the QCNN achieved an accuracy of 86%, setting a new benchmark. It surpasses classical approaches such as a CNN with attention (77%), a traditional QCNN (77.87%), and a hybrid quantum-classical network (76%). The model demon- strated balanced performance, with precision scores ranging from 0.79 to 1.0 (including a perfect 1.0 for Fear) and recall scores from 0.82 to 0.94. These results underscore the promise of quantum-enhanced neural networks in capturing complex, non- linear speech emotion patterns and establishing a new standard for robust, generalizable SER systems.

Recommended citation:
Download Paper