DiSER v2.0 — Speech Emotion Recognition (ONNX Model)

This is the ONNX export of the fine-tuned microsoft/wavlm-base-plus model for 7-class speech emotion recognition.

Performance (ONNX Runtime on CPU)

| Metric | Score | |---| | ONNX Test Accuracy | 15.17% | | ONNX Macro F1 | 3.92% | | ONNX CPU latency | 1162 ms / sample |

Original PyTorch Model Performance (on Test Set)

| Metric | Score | |---| | Test Accuracy | 75.73% | | Macro F1 | 73.59% | | Weighted F1 | 73.97% |

Per-class recall (from original PyTorch model)

| angry | 92.2% | | disgust | 84.0% | | fear | 77.7% | | happy | 24.6% | | neutral | 91.9% | | sad | 70.9% | | surprise | 97.0% |

Training Details (Original PyTorch Model)

Training set: 11,491 samples (stratified speaker-independent split)
Test set: 4,911 samples (no speaker overlap with train)
Optimizer: AdamW with discriminative LRs (backbone=2e-06, head=5e-05)
Scheduler: Cosine annealing with 15% linear warmup
Loss: Focal Loss (γ=2.0) + effective-number class weights
Augmentation: Gaussian noise, time stretching, pitch shift, Mixup
Epochs trained: 41 (early stopping on macro-F1)

Scientific References

| Method | Reference | |---| | WavLM backbone | Chen et al. (2022), IEEE JSTSP | | Weighted layer aggregation | Yang et al. (2021), INTERSPEECH | | Attentive statistics pooling | Okabe et al. (2018), INTERSPEECH | | Focal Loss | Lin et al. (2017), ICCV | | Effective-number weighting | Cui et al. (2019), CVPR | | Mixup augmentation | Zhang et al. (2018), ICLR |

Usage

import onnxruntime as ort
import numpy as np
import librosa
from transformers import AutoFeatureExtractor # Needed for feature extraction

# Define constants (should match training config)
SAMPLE_RATE = {config_data.get("sample_rate", 16000)}
MAX_DURATION = {config_data.get("max_duration", 4.0)}
MAX_LEN = int(SAMPLE_RATE * MAX_DURATION)
EMOTIONS = {emotions!r} # ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise'] will be replaced by actual list

# Load ONNX model
session = ort.InferenceSession("ser_wavlm.onnx", providers=['CPUExecutionProvider'])
feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/wavlm-base-plus")

def load_and_preprocess(path: str) -> np.ndarray:
    y, sr = librosa.load(path, sr=SAMPLE_RATE, mono=True, duration=MAX_DURATION + 1.0)
    if len(y) < 100:
        return np.zeros(MAX_LEN, dtype=np.float32)
    y_trim, _ = librosa.effects.trim(y, top_db=25)
    if len(y_trim) > 100:
        y = y_trim
    if len(y) >= MAX_LEN:
        start = (len(y) - MAX_LEN) // 2
        y = y[start: start + MAX_LEN]
    else:
        pad = MAX_LEN - len(y)
        y = np.pad(y, (pad // 2, pad - pad // 2), mode="reflect" if len(y) > 1 else "constant")
    rms = np.sqrt(np.mean(y ** 2)) + 1e-9
    y = (y / rms) * 0.1
    return y.astype(np.float32)

def predict_onnx(audio_path: str):
    waveform = load_and_preprocess(audio_path)
    inputs = feature_extractor(
        waveform,
        sampling_rate=SAMPLE_RATE,
        return_tensors="np", # ONNX Runtime expects numpy arrays
        padding="max_length",
        max_length=MAX_LEN,
        truncation=True,
    )
    onnx_inputs = {"input_values": inputs["input_values"].astype(np.float32)}
    onnx_outputs = session.run(None, onnx_inputs)
    logits = onnx_outputs[0]
    probs = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
    predicted_id = np.argmax(probs, axis=1)[0]
    return EMOTIONS[predicted_id], probs[0, predicted_id]

# Example Usage:
# emotion, confidence = predict_onnx("path/to/your/audio.wav")
# print(f"Predicted emotion: {emotion} with confidence {confidence:.2f}")

Downloads last month: 21