Enhancing Robustness and Accuracy of Bone-Conducted Speech Emotion Recognition via Transformer Models
Date:
In this talk, we explore the implementation of our research titled “Enhancing Robustness and Accuracy of Bone-Conducted Speech Emotion Recognition via Transformer Models”
Motivation
- Speech Emotion Recognition suffers from:
- Information loss
- Performance degradation
- Need better temporal modeling
- Transformer models capture long-range dependencies
- Goal: Bone-conducted speech emotion recognition system
Research Question
- How can a Transformer model improve emotion recognition using bone-conducted speech?
Objectives
- Develop a Transformer-based model
- Optimize model performance
- Evaluate using EmoBone dataset
- Compare results with existing methods
Introduction
- Speech conveys information and emotions
- Emotion recognition important for:
- Human-computer interaction
- Healthcare
- Call centers
- Education and security
- Speech Emotion Recognition (SER):
- Detect emotions from speech signals
- Uses acoustic features like pitch and energy
Dataset Preparation
- EmoBone dataset used
- Emotion categories:
- Happy
- Angry
- Sad
- Calm
- Neutral
- Fear
- Surprise
- Disgust
Methodology
- Audio preprocessing using torchaudio
- Feature extraction using MFCC
- Model architecture:
- CNN feature extraction
- Transformer encoder
- Dense layer with Softmax
- Training:
- Cross-entropy loss
- Learning rate optimization
- Evaluation:
- Accuracy
- Confusion matrix
- Classification report
Results
- Transformer shows steady accuracy improvement
- Confusion matrix indicates strong classification
- ROC curve and per-class accuracy analyzed
Discussion
- Balanced dataset improves performance
- Transformer handles temporal patterns effectively
- Some confusion between similar emotions
- Overall accuracy achieved: 99%
Achievements
- State-of-the-art accuracy on EmoBone dataset
- Improved classification of bone-conducted speech
- Reduced information loss and degradation
Conclusion
- Transformer model significantly improves SER performance
- Effective for bone-conducted speech analysis
- Suitable for real-world emotion recognition systems
Future Scope
- Transfer learning
- Multi-modal fusion
- Larger datasets
- Real-time emotion recognition systems
Thank You
Feel free to contribute, raise issues, or suggest improvements!
