DiSER v2.0 β€” Speech Emotion Recognition

Fine-tuned microsoft/wavlm-base-plus for 7-class speech emotion recognition, achieving 74.10% accuracy and 72.92% macro-F1 on a combined RAVDESS + TESS + CREMA-D + SAVEE corpus.

Architecture

Raw waveform (4s @ 16kHz)
β†’ WavLM Feature Extractor CNN (frozen)
β†’ WavLM Transformer Γ— 12 (fine-tuned, discriminative LR)
β†’ Weighted Layer Aggregation [Yang et al. 2021, SUPERB]
β†’ Attentive Statistics Pooling [Okabe et al. 2018] β†’ ΞΌ βŠ• Οƒ
β†’ Classification Head (Linear β†’ GELU β†’ Dropout β†’ LayerNorm Γ— 2 β†’ Linear)
β†’ 7-class emotion logits

Performance

Metric Score
Test Accuracy 74.10%
Macro F1 72.92%
Weighted F1 73.23%
ONNX CPU latency 844 ms / sample

Per-class recall

| angry | 84.1% | | disgust | 77.7% | | fear | 82.4% | | happy | 30.6% | | neutral | 74.3% | | sad | 82.1% | | surprise | 95.7% |

Training Details

  • Training set: 11,491 samples (stratified speaker-independent split)
  • Test set: 4,911 samples (no speaker overlap with train)
  • Optimizer: AdamW with discriminative LRs (backbone=5e-06, head=3e-04)
  • Scheduler: Cosine annealing with 8% linear warmup
  • Loss: Focal Loss (Ξ³=2.0) + effective-number class weights
  • Augmentation: Gaussian noise, time stretching, pitch shift, Mixup
  • Epochs trained: 4 (early stopping on macro-F1)

Scientific References

Method Reference
WavLM backbone Chen et al. (2022), IEEE JSTSP
Weighted layer aggregation Yang et al. (2021), INTERSPEECH
Attentive statistics pooling Okabe et al. (2018), INTERSPEECH
Focal Loss Lin et al. (2017), ICCV
Effective-number weighting Cui et al. (2019), CVPR
Mixup augmentation Zhang et al. (2018), ICLR

Usage

import torch, librosa
from transformers import AutoFeatureExtractor
 
# Load
feature_extractor = AutoFeatureExtractor.from_pretrained("shrey416/DiSER")
model = ...  # load from best_model.pt
 
# Infer
y, sr = librosa.load("audio.wav", sr=16000)
inputs = feature_extractor(y, sampling_rate=16000, return_tensors="pt",
                           padding="max_length", max_length=64000, truncation=True)
with torch.no_grad():
    logits = model(inputs["input_values"])
emotions = ["angry","disgust","fear","happy","neutral","sad","surprise"]
print(emotions[logits.argmax(-1)])
Downloads last month
62
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using shrey416/DiSER 2