Bilingual Speech Emotion Recognition Model (Urdu + English)

Model Overview

This model performs bilingual Speech Emotion Recognition (SER) from audio input in both Urdu and English languages. It is a fine-tuned version of the multilingual facebook/wav2vec2-xls-r-300m model, trained on a combined dataset of English (RAVDESS + CREMA-D) and Urdu (UrduSER) emotional speech to predict 7 Ekman emotions.

The model is part of the "Multimodal AI Mental Health Companion" Final Year Project (FYP) and is specifically designed for code-switched and multilingual emotional speech analysis in Pakistani contexts where speakers often mix Urdu and English.

Supported Emotions (7 Ekman Classes)

Label ID Emotion
0 anger
1 disgust
2 fear
3 joy
4 neutral
5 sadness
6 surprise

Language Capabilities

Language Supported Training Data Notes
Urdu โœ… Yes UrduSER (~3,500 samples) Primary target language
English โœ… Yes RAVDESS + CREMA-D (~8,900 samples) Strong performance
Code-Switched โš ๏ธ Partial Not explicitly trained May work due to bilingual base model

Model Details

  • Model ID: muhammadsuleman1533/urdu-ser-model
  • Task: Audio Classification / Speech Emotion Recognition
  • Languages: Urdu & English (Bilingual)
  • Base Model: facebook/wav2vec2-xls-r-300m (Pre-trained on 128 languages)
  • Framework: PyTorch + Hugging Face Transformers
  • Model Size: 0.3B Parameters
  • Developed by: Muhammad Suleman (Team Leader: Muhammad)
  • License: MIT

Why wav2vec2 XLS-R for Bilingual SER?

The XLS-R (300m) model was selected because it is pre-trained on 128 languages, including both Urdu and English. This makes it uniquely suited for:

  1. Cross-lingual transfer: Knowledge learned from high-resource English emotional speech (RAVDESS, CREMA-D) transfers to improve Urdu emotion recognition
  2. Bilingual robustness: The shared multilingual representations help handle code-switched Urdu-English speech common in urban Pakistani populations
  3. Low-resource adaptation: Leverages pre-trained Urdu speech features despite limited Urdu SER data availability

Dataset Information

The model was trained on a bilingual hybrid corpus combining English and Urdu emotional speech to maximize generalization for both languages.

Dataset Composition

Dataset Language Samples Speakers Type Emotion Labels
RAVDESS English ~1,440 24 (12M/12F) Acted 8 emotions
CREMA-D English ~7,442 91 (48M/43F) Acted 6 emotions
UrduSER Urdu ~3,500 ~40 Acted 7 emotions
TOTAL Bilingual 12,376 ~155 Acted 7 (standardized)

Language Distribution

Language Total Samples Percentage
English ~8,882 71.8%
Urdu ~3,494 28.2%
Total 12,376 100%

Data Split (Speaker-Disjoint)

A GroupShuffleSplit was used based on speaker_id to ensure zero speaker overlap between train and test sets. This tests the model's true ability to recognize emotion in unseen voices across both languages.

Split Samples English Urdu
Training 9,619 ~6,900 ~2,719
Testing 2,757 ~1,982 ~775
Total 12,376 ~8,882 ~3,494

Emotion Distribution & Class Imbalance

The combined dataset is moderately imbalanced with surprise being heavily under-represented in both languages.

Emotion Total Samples English Urdu Class Weight
anger ~1,963 ~1,400 ~563 0.898
disgust ~1,963 ~1,400 ~563 0.898
fear ~1,963 ~1,400 ~563 0.899
joy ~1,963 ~1,400 ~563 0.898
neutral ~2,375 ~1,800 ~575 0.759
sadness ~1,963 ~1,400 ~563 0.899
surprise ~192 ~192 0 8.588

Important Notes:

  • Surprise class: Only present in English datasets (RAVDESS). UrduSER does not contain surprise samples.
  • Mitigation: Weighted Cross-Entropy Loss was used with surprise weighted 8.5x higher to compensate for extreme under-representation.

Training Details

Training Configuration

Parameter Value
Epochs 10 (with Early Stopping Patience=3)
Batch Size 2 (Effective 16 via Gradient Accumulation ร—8)
Learning Rate 2e-5
Optimizer AdamW
Loss Function Weighted Cross-Entropy Loss
LR Scheduler Cosine with 10% Warmup
Weight Decay 0.01
Max Audio Length 10 seconds (160,000 samples @ 16kHz)
Mixed Precision FP16
Hardware NVIDIA T4 GPU (Google Colab)
Training Duration ~7-8 hours

Freezing Strategy

To prevent catastrophic forgetting of pre-trained multilingual speech representations:

  • Frozen: CNN feature extractor layers (wav2vec2.feature_extractor)
  • Trainable: Transformer encoder layers + Classification head

Preprocessing Pipeline

  1. Audio Loading: All files loaded with librosa at 16kHz mono
  2. Duration Filtering: Files outside 0.5-10 seconds filtered out (removed 6 corrupted files)
  3. Feature Extraction: Wav2Vec2FeatureExtractor from facebook/wav2vec2-xls-r-300m
  4. Label Encoding: 7 emotions mapped to numeric IDs (0-6)
  5. Standardization: All emotions mapped to 7 Ekman classes (e.g., "calm" โ†’ "neutral", "happy" โ†’ "joy")

Evaluation Metrics

Overall Performance (Final Model)

The model was trained for 10 epochs with early stopping (patience=3). Training converged at epoch 6.7.

Metric Value
Accuracy 0.573 (57.3%)
Weighted F1 0.554
Macro F1 0.561
Validation Loss 1.208

Per-Language Performance (Estimated)

Since the test set contains both Urdu and English samples without separate labels, these are conservative estimates based on dataset composition:

Language Estimated Accuracy Notes
English ~60-65% More training data (72% of corpus), better representation
Urdu ~45-50% Less data (28% of corpus) but benefits from multilingual transfer
Code-Switched Unknown Not explicitly trained, performance may vary

Class-wise Performance (Actual Results)

The following table shows the actual per-class performance on the 2,757 test samples:

Emotion Precision Recall F1-Score Support
anger 0.690 0.880 0.774 433
disgust 0.387 0.281 0.325 431
fear 0.660 0.430 0.520 433
joy 0.593 0.524 0.556 433
neutral 0.510 0.842 0.635 562
sadness 0.688 0.367 0.479 433
surprise 0.464 1.000 0.634 32

| | Weighted Avg | 0.583 | 0.573 | 0.554 | 2,757 |

Key Observations

Strong Performance (F1 > 0.60):

  • Anger (0.774): Excellent recall (88%) - the model rarely misses anger when present. High intensity and distinct prosodic features make this emotion easily recognizable across both languages.
  • Neutral (0.635): Very high recall (84%) - the model effectively identifies non-emotional speech, though precision is moderate due to some confusion with joy.
  • Surprise (0.634): Despite having only 32 test samples (all English), the model correctly identifies all surprise samples (100% recall), though precision is lower as it sometimes misclassifies fear as surprise.

Moderate Performance (F1 0.45-0.60):

  • Joy (0.556): Moderate performance with balanced precision and recall. Some confusion with neutral speech.
  • Fear (0.520): Decent precision (66%) but lower recall (43%) - the model misses many fear samples, likely confusing them with surprise or sadness.

Weak Performance (F1 < 0.45):

  • Sadness (0.479): Good precision (69%) but very low recall (37%) - the model is conservative in predicting sadness, missing many true sadness samples.
  • Disgust (0.325): The most challenging emotion. Low recall (28%) indicates the model struggles to distinguish disgust from anger, which shares similar acoustic properties.

Training Progress

Step Training Loss Validation Loss Accuracy F1 Weighted F1 Macro
300 11.562 1.926 0.157 0.043 0.039
900 10.939 1.806 0.351 0.269 0.226
1500 9.505 1.514 0.413 0.329 0.314
2100 8.083 1.445 0.435 0.367 0.365
2700 7.821 1.304 0.493 0.447 0.440
3300 7.096 1.420 0.479 0.421 0.406
3900 6.836 1.241 0.549 0.520 0.532
4500 6.129 1.208 0.573 0.554 0.560
5400 5.999 1.286 0.556 0.529 0.524

The model showed consistent improvement throughout training, with the best F1-weighted score (0.554) achieved at step 4500 (epoch ~5.6).


How to Use the Model

Installation

pip install transformers torch librosa soundfile
Downloads last month
11
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • accuracy on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)
    self-reported
    0.573
  • f1-weighted on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)
    self-reported
    0.554
  • f1-macro on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)
    self-reported
    0.561