DiSER v2.0 β Speech Emotion Recognition
Fine-tuned microsoft/wavlm-base-plus for
7-class speech emotion recognition, achieving 74.10% accuracy and
72.92% macro-F1 on a combined RAVDESS + TESS + CREMA-D + SAVEE corpus.
Architecture
Raw waveform (4s @ 16kHz)
β WavLM Feature Extractor CNN (frozen)
β WavLM Transformer Γ 12 (fine-tuned, discriminative LR)
β Weighted Layer Aggregation [Yang et al. 2021, SUPERB]
β Attentive Statistics Pooling [Okabe et al. 2018] β ΞΌ β Ο
β Classification Head (Linear β GELU β Dropout β LayerNorm Γ 2 β Linear)
β 7-class emotion logits
Performance
| Metric | Score |
|---|---|
| Test Accuracy | 74.10% |
| Macro F1 | 72.92% |
| Weighted F1 | 73.23% |
| ONNX CPU latency | 844 ms / sample |
Per-class recall
| angry | 84.1% | | disgust | 77.7% | | fear | 82.4% | | happy | 30.6% | | neutral | 74.3% | | sad | 82.1% | | surprise | 95.7% |
Training Details
- Training set: 11,491 samples (stratified speaker-independent split)
- Test set: 4,911 samples (no speaker overlap with train)
- Optimizer: AdamW with discriminative LRs (backbone=5e-06, head=3e-04)
- Scheduler: Cosine annealing with 8% linear warmup
- Loss: Focal Loss (Ξ³=2.0) + effective-number class weights
- Augmentation: Gaussian noise, time stretching, pitch shift, Mixup
- Epochs trained: 4 (early stopping on macro-F1)
Scientific References
| Method | Reference |
|---|---|
| WavLM backbone | Chen et al. (2022), IEEE JSTSP |
| Weighted layer aggregation | Yang et al. (2021), INTERSPEECH |
| Attentive statistics pooling | Okabe et al. (2018), INTERSPEECH |
| Focal Loss | Lin et al. (2017), ICCV |
| Effective-number weighting | Cui et al. (2019), CVPR |
| Mixup augmentation | Zhang et al. (2018), ICLR |
Usage
import torch, librosa
from transformers import AutoFeatureExtractor
# Load
feature_extractor = AutoFeatureExtractor.from_pretrained("shrey416/DiSER")
model = ... # load from best_model.pt
# Infer
y, sr = librosa.load("audio.wav", sr=16000)
inputs = feature_extractor(y, sampling_rate=16000, return_tensors="pt",
padding="max_length", max_length=64000, truncation=True)
with torch.no_grad():
logits = model(inputs["input_values"])
emotions = ["angry","disgust","fear","happy","neutral","sad","surprise"]
print(emotions[logits.argmax(-1)])
- Downloads last month
- 62