gafiatulin
/

vibevoice-semantic-encoder-mlpackage

+---
+license: mit
+tags:
+  - coreml
+  - tts
+  - vibevoice
+  - apple-silicon
+  - semantic-encoder
+---
+# VibeVoice Semantic Encoder (CoreML)
+Streaming semantic encoder for [VibeVoice](https://huggingface.co/microsoft/VibeVoice-1.5B) TTS, exported as a stateful CoreML MLPackage.
+Shared between 1.5B and 7B models (identical encoder weights, 128-dim output).
+## Usage
+Auto-downloaded by [vibevoice-mlx](https://github.com/gafiatulin/vibevoice-mlx) when CoreML is available:
+```bash
+pip install mlx coremltools soundfile transformers huggingface_hub safetensors
+git clone https://github.com/gafiatulin/vibevoice-mlx && cd vibevoice-mlx
+# CoreML semantic encoder is auto-downloaded on first use
+python run/e2e_pipeline.py --model microsoft/VibeVoice-1.5B --text "Hello!" --output hello.wav
+```
+Without CoreML (Linux, or no coremltools), the pipeline falls back to a pure MLX semantic encoder.
+## Architecture
+- **Type**: Causal σ-VAE encoder with streaming conv caches
+- **Input**: 3200 audio samples (one speech frame at 24kHz)
+- **Output**: 128-dim semantic features
+- **State**: 34 conv cache buffers (ct.StateType, requires iOS 18+)
+- **Compute units**: CPU_AND_GPU (ANE not supported for stateful models)
+- **Size**: 657 MB (fp16 weights)
+## Performance
+| Backend | Latency | Pipeline RTF (1.5B INT8) |
+|---------|---------|--------------------------|
+| CoreML  | 4.8ms/frame | 3.1x |
+| Pure MLX | 11.5ms/frame | 2.6x |
+## Source
+Built from [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B) using [vibevoice-coreml](https://github.com/gafiatulin/vibevoice-coreml) conversion scripts.