Attention-Based Deep Learning for Scalable Speech Emotion Recognition with Synthetic Bone-Conducted Speech

Published in IEEE 2nd International Conference on Computing, Applications and Systems (COMPAS 2025), Kushtia, Bangladesh, 2025

Speech emotion recognition (SER) is a critical tech- nology that supports advances in human-computer interaction, mental health assessment, and personalized education. Despite significant progress, conventional SER methods relying on air- conducted (AC) speech remain vulnerable to environmental noise and recording inconsistencies, which impede robust real- world deployment. A promising alternative, bone-conducted (BC) speech, captures internal vocal tract vibrations and demonstrates inherent resilience to ambient noise. However, the scarcity of authentic BC datasets and the prohibitive cost of BC sensors severely constrain its practical utilization in SER research. Furthermore, existing efforts have yet to establish scalable approaches to harness BC speech characteristics without spe- cialized acquisition hardware, limiting the generalizability of current models. To address these challenges, we propose a novel framework that synthesizes BC-like speech from standard AC recordings via a carefully designed Infinite Impulse Response (IIR) filter. This method enables cost-effective, large-scale aug- mentation of training data, circumventing the need for specialized BC sensors. The core of our approach is an attention-augmented convolutional neural network that effectively integrates local spectral feature extraction with long-range temporal model- ing. To counteract class imbalance and improve generalization, the model employs class-weighted loss combined with label smoothing. Evaluated on the benchmark RAVDESS dataset, our framework achieves state-of-the-art results: a validation accuracy of 95.51%, balanced accuracy of 95.32%, Matthews Correlation Coefficient of 0.9487, and flawless class-wise AUC scores. A high mean Intersection over Union (IoU) of 0.9163 further confirms the model’s precision and robustness across diverse emotional categories.

Recommended citation: M. I. Shihab Shad, S. Khan, M. S. Hosain, A. Mahdi, M. C. Chanda and M. R. Hossain, "Attention-Based Deep Learning for Scalable Speech Emotion Recognition with Synthetic Bone-Conducted Speech," 2025 IEEE 2nd International Conference on Computing, Applications and Systems (COMPAS), Kushtia, Bangladesh, 2025, pp. 1-6, doi: 10.1109/COMPAS67506.2025.11381631.
Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Md. Rifat Hossen

Share on