Bilingual Speech Emotion Recognition Model (Urdu + English)

Model Overview

This model performs bilingual Speech Emotion Recognition (SER) from audio input in both Urdu and English languages. It is a fine-tuned version of the multilingual facebook/wav2vec2-xls-r-300m model, trained on a combined dataset of English (RAVDESS + CREMA-D) and Urdu (UrduSER) emotional speech to predict 7 Ekman emotions.

The model is part of the "Multimodal AI Mental Health Companion" Final Year Project (FYP) and is specifically designed for code-switched and multilingual emotional speech analysis in Pakistani contexts where speakers often mix Urdu and English.

Supported Emotions (7 Ekman Classes)

Label ID	Emotion
0	anger
1	disgust
2	fear
3	joy
4	neutral
5	sadness
6	surprise

Language Capabilities

Language	Supported	Training Data	Notes
Urdu	✅ Yes	UrduSER (~3,500 samples)	Primary target language
English	✅ Yes	RAVDESS + CREMA-D (~8,900 samples)	Strong performance
Code-Switched	⚠️ Partial	Not explicitly trained	May work due to bilingual base model

Model Details

Model ID: muhammadsuleman1533/urdu-ser-model
Task: Audio Classification / Speech Emotion Recognition
Languages: Urdu & English (Bilingual)
Base Model: facebook/wav2vec2-xls-r-300m (Pre-trained on 128 languages)
Framework: PyTorch + Hugging Face Transformers
Model Size: 0.3B Parameters
Developed by: Muhammad Suleman (Team Leader: Muhammad)
License: MIT

Why wav2vec2 XLS-R for Bilingual SER?

The XLS-R (300m) model was selected because it is pre-trained on 128 languages, including both Urdu and English. This makes it uniquely suited for:

Cross-lingual transfer: Knowledge learned from high-resource English emotional speech (RAVDESS, CREMA-D) transfers to improve Urdu emotion recognition
Bilingual robustness: The shared multilingual representations help handle code-switched Urdu-English speech common in urban Pakistani populations
Low-resource adaptation: Leverages pre-trained Urdu speech features despite limited Urdu SER data availability

Dataset Information

The model was trained on a bilingual hybrid corpus combining English and Urdu emotional speech to maximize generalization for both languages.

Dataset Composition

Dataset	Language	Samples	Speakers	Type	Emotion Labels
RAVDESS	English	~1,440	24 (12M/12F)	Acted	8 emotions
CREMA-D	English	~7,442	91 (48M/43F)	Acted	6 emotions
UrduSER	Urdu	~3,500	~40	Acted	7 emotions
TOTAL	Bilingual	12,376	~155	Acted	7 (standardized)

Language Distribution

Language	Total Samples	Percentage
English	~8,882	71.8%
Urdu	~3,494	28.2%
Total	12,376	100%

Data Split (Speaker-Disjoint)

A GroupShuffleSplit was used based on speaker_id to ensure zero speaker overlap between train and test sets. This tests the model's true ability to recognize emotion in unseen voices across both languages.

Split	Samples	English	Urdu
Training	9,619	~6,900	~2,719
Testing	2,757	~1,982	~775
Total	12,376	~8,882	~3,494

Emotion Distribution & Class Imbalance

The combined dataset is moderately imbalanced with surprise being heavily under-represented in both languages.

Emotion	Total Samples	English	Urdu	Class Weight
anger	~1,963	~1,400	~563	0.898
disgust	~1,963	~1,400	~563	0.898
fear	~1,963	~1,400	~563	0.899
joy	~1,963	~1,400	~563	0.898
neutral	~2,375	~1,800	~575	0.759
sadness	~1,963	~1,400	~563	0.899
surprise	~192	~192	0	8.588

Important Notes:

Surprise class: Only present in English datasets (RAVDESS). UrduSER does not contain surprise samples.
Mitigation: Weighted Cross-Entropy Loss was used with surprise weighted 8.5x higher to compensate for extreme under-representation.

Training Details

Training Configuration

Parameter	Value
Epochs	10 (with Early Stopping Patience=3)
Batch Size	2 (Effective 16 via Gradient Accumulation ×8)
Learning Rate	2e-5
Optimizer	AdamW
Loss Function	Weighted Cross-Entropy Loss
LR Scheduler	Cosine with 10% Warmup
Weight Decay	0.01
Max Audio Length	10 seconds (160,000 samples @ 16kHz)
Mixed Precision	FP16
Hardware	NVIDIA T4 GPU (Google Colab)
Training Duration	~7-8 hours

Freezing Strategy

To prevent catastrophic forgetting of pre-trained multilingual speech representations:

Frozen: CNN feature extractor layers (wav2vec2.feature_extractor)
Trainable: Transformer encoder layers + Classification head

Preprocessing Pipeline

Audio Loading: All files loaded with librosa at 16kHz mono
Duration Filtering: Files outside 0.5-10 seconds filtered out (removed 6 corrupted files)
Feature Extraction: Wav2Vec2FeatureExtractor from facebook/wav2vec2-xls-r-300m
Label Encoding: 7 emotions mapped to numeric IDs (0-6)
Standardization: All emotions mapped to 7 Ekman classes (e.g., "calm" → "neutral", "happy" → "joy")

Evaluation Metrics

Overall Performance (Final Model)

The model was trained for 10 epochs with early stopping (patience=3). Training converged at epoch 6.7.

Metric	Value
Accuracy	0.573 (57.3%)
Weighted F1	0.554
Macro F1	0.561
Validation Loss	1.208

Per-Language Performance (Estimated)

Since the test set contains both Urdu and English samples without separate labels, these are conservative estimates based on dataset composition:

Language	Estimated Accuracy	Notes
English	~60-65%	More training data (72% of corpus), better representation
Urdu	~45-50%	Less data (28% of corpus) but benefits from multilingual transfer
Code-Switched	Unknown	Not explicitly trained, performance may vary

Class-wise Performance (Actual Results)

The following table shows the actual per-class performance on the 2,757 test samples:

Emotion	Precision	Recall	F1-Score	Support
anger	0.690	0.880	0.774	433
disgust	0.387	0.281	0.325	431
fear	0.660	0.430	0.520	433
joy	0.593	0.524	0.556	433
neutral	0.510	0.842	0.635	562
sadness	0.688	0.367	0.479	433
surprise	0.464	1.000	0.634	32

| | Weighted Avg | 0.583 | 0.573 | 0.554 | 2,757 |

Key Observations

Strong Performance (F1 > 0.60):

Anger (0.774): Excellent recall (88%) - the model rarely misses anger when present. High intensity and distinct prosodic features make this emotion easily recognizable across both languages.
Neutral (0.635): Very high recall (84%) - the model effectively identifies non-emotional speech, though precision is moderate due to some confusion with joy.
Surprise (0.634): Despite having only 32 test samples (all English), the model correctly identifies all surprise samples (100% recall), though precision is lower as it sometimes misclassifies fear as surprise.

Moderate Performance (F1 0.45-0.60):

Joy (0.556): Moderate performance with balanced precision and recall. Some confusion with neutral speech.
Fear (0.520): Decent precision (66%) but lower recall (43%) - the model misses many fear samples, likely confusing them with surprise or sadness.

Weak Performance (F1 < 0.45):

Sadness (0.479): Good precision (69%) but very low recall (37%) - the model is conservative in predicting sadness, missing many true sadness samples.
Disgust (0.325): The most challenging emotion. Low recall (28%) indicates the model struggles to distinguish disgust from anger, which shares similar acoustic properties.

Training Progress

Step	Training Loss	Validation Loss	Accuracy	F1 Weighted	F1 Macro
300	11.562	1.926	0.157	0.043	0.039
900	10.939	1.806	0.351	0.269	0.226
1500	9.505	1.514	0.413	0.329	0.314
2100	8.083	1.445	0.435	0.367	0.365
2700	7.821	1.304	0.493	0.447	0.440
3300	7.096	1.420	0.479	0.421	0.406
3900	6.836	1.241	0.549	0.520	0.532
4500	6.129	1.208	0.573	0.554	0.560
5400	5.999	1.286	0.556	0.529	0.524

The model showed consistent improvement throughout training, with the best F1-weighted score (0.554) achieved at step 4500 (epoch ~5.6).

How to Use the Model

Installation

pip install transformers torch librosa soundfile

Downloads last month: 11

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

accuracy on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)
self-reported

0.573
f1-weighted on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)
self-reported

0.554
f1-macro on UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English)
self-reported

0.561