Enhancing Robustness and Accuracy of Bone-Conducted Speech Emotion Recognition via Transformer Models

Date: November 13, 2025

In this talk, we explore the implementation of our research titled “Enhancing Robustness and Accuracy of Bone-Conducted Speech Emotion Recognition via Transformer Models”

Motivation

Speech Emotion Recognition suffers from:
- Information loss
- Performance degradation
Need better temporal modeling
Transformer models capture long-range dependencies
Goal: Bone-conducted speech emotion recognition system

Research Question

How can a Transformer model improve emotion recognition using bone-conducted speech?

Objectives

Develop a Transformer-based model
Optimize model performance
Evaluate using EmoBone dataset
Compare results with existing methods

Introduction

Speech conveys information and emotions
Emotion recognition important for:
- Human-computer interaction
- Healthcare
- Call centers
- Education and security
Speech Emotion Recognition (SER):
- Detect emotions from speech signals
- Uses acoustic features like pitch and energy

Dataset Preparation

EmoBone dataset used
Emotion categories:
- Happy
- Angry
- Sad
- Calm
- Neutral
- Fear
- Surprise
- Disgust

Methodology

Audio preprocessing using torchaudio
Feature extraction using MFCC
Model architecture:
- CNN feature extraction
- Transformer encoder
- Dense layer with Softmax
Training:
- Cross-entropy loss
- Learning rate optimization
Evaluation:
- Accuracy
- Confusion matrix
- Classification report

Results

Transformer shows steady accuracy improvement
Confusion matrix indicates strong classification
ROC curve and per-class accuracy analyzed

Discussion

Balanced dataset improves performance
Transformer handles temporal patterns effectively
Some confusion between similar emotions
Overall accuracy achieved: 99%

Achievements

State-of-the-art accuracy on EmoBone dataset
Improved classification of bone-conducted speech
Reduced information loss and degradation

Conclusion

Transformer model significantly improves SER performance
Effective for bone-conducted speech analysis
Suitable for real-world emotion recognition systems

Future Scope

Transfer learning
Multi-modal fusion
Larger datasets
Real-time emotion recognition systems

Thank You

Feel free to contribute, raise issues, or suggest improvements!

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Md. Rifat Hossen