DiSER v2.0 — Speech Emotion Recognition

Fine-tuned microsoft/wavlm-base-plus for 7-class speech emotion recognition, achieving 74.10% accuracy and 72.92% macro-F1 on a combined RAVDESS + TESS + CREMA-D + SAVEE corpus.

Architecture

Raw waveform (4s @ 16kHz)
→ WavLM Feature Extractor CNN (frozen)
→ WavLM Transformer × 12 (fine-tuned, discriminative LR)
→ Weighted Layer Aggregation [Yang et al. 2021, SUPERB]
→ Attentive Statistics Pooling [Okabe et al. 2018] → μ ⊕ σ
→ Classification Head (Linear → GELU → Dropout → LayerNorm × 2 → Linear)
→ 7-class emotion logits

Performance

Metric	Score
Test Accuracy	74.10%
Macro F1	72.92%
Weighted F1	73.23%
ONNX CPU latency	844 ms / sample

Per-class recall

| angry | 84.1% | | disgust | 77.7% | | fear | 82.4% | | happy | 30.6% | | neutral | 74.3% | | sad | 82.1% | | surprise | 95.7% |

Training Details

Training set: 11,491 samples (stratified speaker-independent split)
Test set: 4,911 samples (no speaker overlap with train)
Optimizer: AdamW with discriminative LRs (backbone=5e-06, head=3e-04)
Scheduler: Cosine annealing with 8% linear warmup
Loss: Focal Loss (γ=2.0) + effective-number class weights
Augmentation: Gaussian noise, time stretching, pitch shift, Mixup
Epochs trained: 4 (early stopping on macro-F1)

Scientific References

Method	Reference
WavLM backbone	Chen et al. (2022), IEEE JSTSP
Weighted layer aggregation	Yang et al. (2021), INTERSPEECH
Attentive statistics pooling	Okabe et al. (2018), INTERSPEECH
Focal Loss	Lin et al. (2017), ICCV
Effective-number weighting	Cui et al. (2019), CVPR
Mixup augmentation	Zhang et al. (2018), ICLR

Usage

import torch, librosa
from transformers import AutoFeatureExtractor
 
# Load
feature_extractor = AutoFeatureExtractor.from_pretrained("shrey416/DiSER")
model = ...  # load from best_model.pt
 
# Infer
y, sr = librosa.load("audio.wav", sr=16000)
inputs = feature_extractor(y, sampling_rate=16000, return_tensors="pt",
                           padding="max_length", max_length=64000, truncation=True)
with torch.no_grad():
    logits = model(inputs["input_values"])
emotions = ["angry","disgust","fear","happy","neutral","sad","surprise"]
print(emotions[logits.argmax(-1)])

Downloads last month: 62

shrey416
/

DiSER