Bilingual Speech Emotion Recognition Model (Urdu + English)
Model Overview
This model performs bilingual Speech Emotion Recognition (SER) from audio input in both Urdu and English languages. It is a fine-tuned version of the multilingual facebook/wav2vec2-xls-r-300m model, trained on a combined dataset of English (RAVDESS + CREMA-D) and Urdu (UrduSER) emotional speech to predict 7 Ekman emotions.
The model is part of the "Multimodal AI Mental Health Companion" Final Year Project (FYP) and is specifically designed for code-switched and multilingual emotional speech analysis in Pakistani contexts where speakers often mix Urdu and English.
Supported Emotions (7 Ekman Classes)
| Label ID | Emotion |
|---|---|
| 0 | anger |
| 1 | disgust |
| 2 | fear |
| 3 | joy |
| 4 | neutral |
| 5 | sadness |
| 6 | surprise |
Language Capabilities
| Language | Supported | Training Data | Notes |
|---|---|---|---|
| Urdu | โ Yes | UrduSER (~3,500 samples) | Primary target language |
| English | โ Yes | RAVDESS + CREMA-D (~8,900 samples) | Strong performance |
| Code-Switched | โ ๏ธ Partial | Not explicitly trained | May work due to bilingual base model |
Model Details
- Model ID:
muhammadsuleman1533/urdu-ser-model - Task: Audio Classification / Speech Emotion Recognition
- Languages: Urdu & English (Bilingual)
- Base Model:
facebook/wav2vec2-xls-r-300m(Pre-trained on 128 languages) - Framework: PyTorch + Hugging Face Transformers
- Model Size: 0.3B Parameters
- Developed by: Muhammad Suleman (Team Leader: Muhammad)
- License: MIT
Why wav2vec2 XLS-R for Bilingual SER?
The XLS-R (300m) model was selected because it is pre-trained on 128 languages, including both Urdu and English. This makes it uniquely suited for:
- Cross-lingual transfer: Knowledge learned from high-resource English emotional speech (RAVDESS, CREMA-D) transfers to improve Urdu emotion recognition
- Bilingual robustness: The shared multilingual representations help handle code-switched Urdu-English speech common in urban Pakistani populations
- Low-resource adaptation: Leverages pre-trained Urdu speech features despite limited Urdu SER data availability
Dataset Information
The model was trained on a bilingual hybrid corpus combining English and Urdu emotional speech to maximize generalization for both languages.
Dataset Composition
| Dataset | Language | Samples | Speakers | Type | Emotion Labels |
|---|---|---|---|---|---|
| RAVDESS | English | ~1,440 | 24 (12M/12F) | Acted | 8 emotions |
| CREMA-D | English | ~7,442 | 91 (48M/43F) | Acted | 6 emotions |
| UrduSER | Urdu | ~3,500 | ~40 | Acted | 7 emotions |
| TOTAL | Bilingual | 12,376 | ~155 | Acted | 7 (standardized) |
Language Distribution
| Language | Total Samples | Percentage |
|---|---|---|
| English | ~8,882 | 71.8% |
| Urdu | ~3,494 | 28.2% |
| Total | 12,376 | 100% |
Data Split (Speaker-Disjoint)
A GroupShuffleSplit was used based on speaker_id to ensure zero speaker overlap between train and test sets. This tests the model's true ability to recognize emotion in unseen voices across both languages.
| Split | Samples | English | Urdu |
|---|---|---|---|
| Training | 9,619 | ~6,900 | ~2,719 |
| Testing | 2,757 | ~1,982 | ~775 |
| Total | 12,376 | ~8,882 | ~3,494 |
Emotion Distribution & Class Imbalance
The combined dataset is moderately imbalanced with surprise being heavily under-represented in both languages.
| Emotion | Total Samples | English | Urdu | Class Weight |
|---|---|---|---|---|
| anger | ~1,963 | ~1,400 | ~563 | 0.898 |
| disgust | ~1,963 | ~1,400 | ~563 | 0.898 |
| fear | ~1,963 | ~1,400 | ~563 | 0.899 |
| joy | ~1,963 | ~1,400 | ~563 | 0.898 |
| neutral | ~2,375 | ~1,800 | ~575 | 0.759 |
| sadness | ~1,963 | ~1,400 | ~563 | 0.899 |
| surprise | ~192 | ~192 | 0 | 8.588 |
Important Notes:
- Surprise class: Only present in English datasets (RAVDESS). UrduSER does not contain surprise samples.
- Mitigation: Weighted Cross-Entropy Loss was used with
surpriseweighted 8.5x higher to compensate for extreme under-representation.
Training Details
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 10 (with Early Stopping Patience=3) |
| Batch Size | 2 (Effective 16 via Gradient Accumulation ร8) |
| Learning Rate | 2e-5 |
| Optimizer | AdamW |
| Loss Function | Weighted Cross-Entropy Loss |
| LR Scheduler | Cosine with 10% Warmup |
| Weight Decay | 0.01 |
| Max Audio Length | 10 seconds (160,000 samples @ 16kHz) |
| Mixed Precision | FP16 |
| Hardware | NVIDIA T4 GPU (Google Colab) |
| Training Duration | ~7-8 hours |
Freezing Strategy
To prevent catastrophic forgetting of pre-trained multilingual speech representations:
- Frozen: CNN feature extractor layers (
wav2vec2.feature_extractor) - Trainable: Transformer encoder layers + Classification head
Preprocessing Pipeline
- Audio Loading: All files loaded with
librosaat 16kHz mono - Duration Filtering: Files outside 0.5-10 seconds filtered out (removed 6 corrupted files)
- Feature Extraction:
Wav2Vec2FeatureExtractorfromfacebook/wav2vec2-xls-r-300m - Label Encoding: 7 emotions mapped to numeric IDs (0-6)
- Standardization: All emotions mapped to 7 Ekman classes (e.g., "calm" โ "neutral", "happy" โ "joy")
Evaluation Metrics
Overall Performance (Final Model)
The model was trained for 10 epochs with early stopping (patience=3). Training converged at epoch 6.7.
| Metric | Value |
|---|---|
| Accuracy | 0.573 (57.3%) |
| Weighted F1 | 0.554 |
| Macro F1 | 0.561 |
| Validation Loss | 1.208 |
Per-Language Performance (Estimated)
Since the test set contains both Urdu and English samples without separate labels, these are conservative estimates based on dataset composition:
| Language | Estimated Accuracy | Notes |
|---|---|---|
| English | ~60-65% | More training data (72% of corpus), better representation |
| Urdu | ~45-50% | Less data (28% of corpus) but benefits from multilingual transfer |
| Code-Switched | Unknown | Not explicitly trained, performance may vary |
Class-wise Performance (Actual Results)
The following table shows the actual per-class performance on the 2,757 test samples:
| Emotion | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| anger | 0.690 | 0.880 | 0.774 | 433 |
| disgust | 0.387 | 0.281 | 0.325 | 431 |
| fear | 0.660 | 0.430 | 0.520 | 433 |
| joy | 0.593 | 0.524 | 0.556 | 433 |
| neutral | 0.510 | 0.842 | 0.635 | 562 |
| sadness | 0.688 | 0.367 | 0.479 | 433 |
| surprise | 0.464 | 1.000 | 0.634 | 32 |
| | Weighted Avg | 0.583 | 0.573 | 0.554 | 2,757 |
Key Observations
Strong Performance (F1 > 0.60):
- Anger (0.774): Excellent recall (88%) - the model rarely misses anger when present. High intensity and distinct prosodic features make this emotion easily recognizable across both languages.
- Neutral (0.635): Very high recall (84%) - the model effectively identifies non-emotional speech, though precision is moderate due to some confusion with joy.
- Surprise (0.634): Despite having only 32 test samples (all English), the model correctly identifies all surprise samples (100% recall), though precision is lower as it sometimes misclassifies fear as surprise.
Moderate Performance (F1 0.45-0.60):
- Joy (0.556): Moderate performance with balanced precision and recall. Some confusion with neutral speech.
- Fear (0.520): Decent precision (66%) but lower recall (43%) - the model misses many fear samples, likely confusing them with surprise or sadness.
Weak Performance (F1 < 0.45):
- Sadness (0.479): Good precision (69%) but very low recall (37%) - the model is conservative in predicting sadness, missing many true sadness samples.
- Disgust (0.325): The most challenging emotion. Low recall (28%) indicates the model struggles to distinguish disgust from anger, which shares similar acoustic properties.
Training Progress
| Step | Training Loss | Validation Loss | Accuracy | F1 Weighted | F1 Macro |
|---|---|---|---|---|---|
| 300 | 11.562 | 1.926 | 0.157 | 0.043 | 0.039 |
| 900 | 10.939 | 1.806 | 0.351 | 0.269 | 0.226 |
| 1500 | 9.505 | 1.514 | 0.413 | 0.329 | 0.314 |
| 2100 | 8.083 | 1.445 | 0.435 | 0.367 | 0.365 |
| 2700 | 7.821 | 1.304 | 0.493 | 0.447 | 0.440 |
| 3300 | 7.096 | 1.420 | 0.479 | 0.421 | 0.406 |
| 3900 | 6.836 | 1.241 | 0.549 | 0.520 | 0.532 |
| 4500 | 6.129 | 1.208 | 0.573 | 0.554 | 0.560 |
| 5400 | 5.999 | 1.286 | 0.556 | 0.529 | 0.524 |
The model showed consistent improvement throughout training, with the best F1-weighted score (0.554) achieved at step 4500 (epoch ~5.6).
How to Use the Model
Installation
pip install transformers torch librosa soundfile
- Downloads last month
- 11
Evaluation results
- accuracy on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)self-reported0.573
- f1-weighted on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)self-reported0.554
- f1-macro on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)self-reported0.561