DiSER v2.0 โ Speech Emotion Recognition (ONNX Model)
This is the ONNX export of the fine-tuned microsoft/wavlm-base-plus model for
7-class speech emotion recognition.
Performance (ONNX Runtime on CPU)
| Metric | Score | |---| | ONNX Test Accuracy | 15.17% | | ONNX Macro F1 | 3.92% | | ONNX CPU latency | 1162 ms / sample |
Original PyTorch Model Performance (on Test Set)
| Metric | Score | |---| | Test Accuracy | 75.73% | | Macro F1 | 73.59% | | Weighted F1 | 73.97% |
Per-class recall (from original PyTorch model)
| angry | 92.2% | | disgust | 84.0% | | fear | 77.7% | | happy | 24.6% | | neutral | 91.9% | | sad | 70.9% | | surprise | 97.0% |
Training Details (Original PyTorch Model)
- Training set: 11,491 samples (stratified speaker-independent split)
- Test set: 4,911 samples (no speaker overlap with train)
- Optimizer: AdamW with discriminative LRs (backbone=2e-06, head=5e-05)
- Scheduler: Cosine annealing with 15% linear warmup
- Loss: Focal Loss (ฮณ=2.0) + effective-number class weights
- Augmentation: Gaussian noise, time stretching, pitch shift, Mixup
- Epochs trained: 41 (early stopping on macro-F1)
Scientific References
| Method | Reference | |---| | WavLM backbone | Chen et al. (2022), IEEE JSTSP | | Weighted layer aggregation | Yang et al. (2021), INTERSPEECH | | Attentive statistics pooling | Okabe et al. (2018), INTERSPEECH | | Focal Loss | Lin et al. (2017), ICCV | | Effective-number weighting | Cui et al. (2019), CVPR | | Mixup augmentation | Zhang et al. (2018), ICLR |
Usage
import onnxruntime as ort
import numpy as np
import librosa
from transformers import AutoFeatureExtractor # Needed for feature extraction
# Define constants (should match training config)
SAMPLE_RATE = {config_data.get("sample_rate", 16000)}
MAX_DURATION = {config_data.get("max_duration", 4.0)}
MAX_LEN = int(SAMPLE_RATE * MAX_DURATION)
EMOTIONS = {emotions!r} # ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise'] will be replaced by actual list
# Load ONNX model
session = ort.InferenceSession("ser_wavlm.onnx", providers=['CPUExecutionProvider'])
feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/wavlm-base-plus")
def load_and_preprocess(path: str) -> np.ndarray:
y, sr = librosa.load(path, sr=SAMPLE_RATE, mono=True, duration=MAX_DURATION + 1.0)
if len(y) < 100:
return np.zeros(MAX_LEN, dtype=np.float32)
y_trim, _ = librosa.effects.trim(y, top_db=25)
if len(y_trim) > 100:
y = y_trim
if len(y) >= MAX_LEN:
start = (len(y) - MAX_LEN) // 2
y = y[start: start + MAX_LEN]
else:
pad = MAX_LEN - len(y)
y = np.pad(y, (pad // 2, pad - pad // 2), mode="reflect" if len(y) > 1 else "constant")
rms = np.sqrt(np.mean(y ** 2)) + 1e-9
y = (y / rms) * 0.1
return y.astype(np.float32)
def predict_onnx(audio_path: str):
waveform = load_and_preprocess(audio_path)
inputs = feature_extractor(
waveform,
sampling_rate=SAMPLE_RATE,
return_tensors="np", # ONNX Runtime expects numpy arrays
padding="max_length",
max_length=MAX_LEN,
truncation=True,
)
onnx_inputs = {"input_values": inputs["input_values"].astype(np.float32)}
onnx_outputs = session.run(None, onnx_inputs)
logits = onnx_outputs[0]
probs = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
predicted_id = np.argmax(probs, axis=1)[0]
return EMOTIONS[predicted_id], probs[0, predicted_id]
# Example Usage:
# emotion, confidence = predict_onnx("path/to/your/audio.wav")
# print(f"Predicted emotion: {emotion} with confidence {confidence:.2f}")
- Downloads last month
- 21