KWS Zipformer 3M β CoreML INT8
Streaming, zero-shot, open-vocabulary keyword spotting for iOS / macOS / visionOS. Exported from icefall's KWS-finetuned Zipformer transducer (gigaspeech, 3.49M parameters) to CoreML with INT8 palettized weights, FP16 compute, and an iOS 17+ minimum deployment target.
Given an arbitrary list of English keywords at runtime (no retraining required), the model emits a match when it hears one. Runs ~26Γ real-time on Apple Silicon CPU + Neural Engine.
Model
| Architecture | Zipformer2 encoder + stateless transducer decoder + joiner |
| Parameters | 3.49M |
| Quantization | INT8 k-means palettization (encoder + joiner); decoder FP16 |
| Compute precision | FP16 |
| Format | .mlmodelc (pre-compiled, ship-ready) |
| Min deployment target | iOS 17 / macOS 14 / visionOS 1 |
| Sample rate | 16 kHz |
| Feature | 80-dim Kaldi fbank, 25 / 10 ms |
| Chunk size | 320 ms (16 output frames Γ 20 ms each) |
| Left context | 64 subsampled frames (~2.5 s) |
| Vocab | 500 BPE tokens |
Files
| file | size | description |
|---|---|---|
encoder.mlmodelc |
3.3 MB | Zipformer2 streaming encoder (45 Γ 80 mel β 8 Γ 320 joiner-space) with 36 layer cache tensors + ConvNeXt pad + processed-lens state |
decoder.mlmodelc |
525 KB | Stateless 2-token-context predictor + decoder_proj |
joiner.mlmodelc |
160 KB | output_linear(tanh(enc + dec)) β 500 logits |
bpe.model |
239 KB | SentencePiece BPE-500 tokenizer (icefall gigaspeech) |
tokens.txt |
4.9 KB | Token id β subword map |
commands_small.txt |
0.2 KB | Example keyword list (20 short commands) |
commands_large.txt |
5.9 KB | Example keyword list (248 commands) |
config.json |
3.8 KB | Fbank params, encoder cache shapes, default KWS thresholds |
Performance
Measured on Apple Silicon (CPU + Neural Engine), tuned defaults ac_threshold=0.15 context_score=0.5 num_trailing_blanks=1.
Latency (per call)
| component | latency |
|---|---|
| Encoder (45 Γ 80 mel β 8 Γ 320 joiner-space) | 5.4 ms / 320 ms chunk |
| Decoder (1 step, 2-token context) | 0.08 ms |
| Joiner (1 frame β 500 logits) | 0.13 ms |
| RTF (encoder only, streaming) | 0.038 (~26Γ real-time) |
Accuracy (LibriSpeech test-clean, 12 keywords, 158 positive + 60 negative utterances)
| keyword | N | recall |
|---|---|---|
| WHICH | 27 | 1.00 |
| LITTLE | 17 | 0.94 |
| BEFORE | 18 | 1.00 |
| GREAT | 19 | 0.95 |
| LIGHT | 15 | 1.00 |
| YOUNG | 15 | 1.00 |
| THROUGH | 15 | 0.93 |
| WORLD | 15 | 0.93 |
| ALWAYS | 15 | 0.93 |
| PEOPLE | 15 | 1.00 |
| THOUGHT | 15 | 0.87 |
| MISTER | 17 | 0.00 |
| TOTAL | 203 | 0.88 |
False positive rate: 0.27 / utterance on 60 random negative utterances. CoreML INT8 output agrees with the PyTorch FP32 reference on 99% of utterances (FP16 palettization drift is not systematic).
Note on "MISTER": SentencePiece tokenizes it as
[βMI, S, TER](3 tokens). Stateless transducers rarely lock onto 3-token sequences in beam search. For production wake words, prefer single-token keywords.
Memory
Loaded all three .mlmodelc models: peak RSS delta β 63 MB on macOS. After 218-utterance streaming workload: +127 MB total.
Usage
Python (coremltools)
import json
from pathlib import Path
import coremltools as ct
import numpy as np
model_dir = Path("./KWS-Zipformer-3M-CoreML-INT8")
config = json.loads((model_dir / "config.json").read_text())
encoder = ct.models.CompiledMLModel(str(model_dir / "encoder.mlmodelc"))
decoder = ct.models.CompiledMLModel(str(model_dir / "decoder.mlmodelc"))
joiner = ct.models.CompiledMLModel(str(model_dir / "joiner.mlmodelc"))
# Build zero state from config
state = {}
for name, shape in zip(config["encoder"]["layerStateNames"],
config["encoder"]["layerStateShapes"]):
state[name] = np.zeros(shape, dtype=np.float32)
state["cached_embed_left_pad"] = np.zeros(
config["encoder"]["cachedEmbedLeftPadShape"], dtype=np.float32)
state["processed_lens"] = np.zeros((1,), dtype=np.int32)
# Feed 45 mel frames (80-dim fbank) per chunk
x = np.zeros((1, 45, 80), dtype=np.float32) # replace with real fbank
out = encoder.predict({"x": x, **state})
encoder_out = out["encoder_out"] # (1, 8, 320) in joiner space
# Stream: feed encoder_out[0, t] into decoder/joiner + beam search
Swift (speech-swift)
import SpeechSwift
let model = try await KWSZipformerModel.fromPretrained(
"aufklarer/KWS-Zipformer-3M-CoreML-INT8"
)
let stream = model.streamingSession(keywords: ["HEY SONIQO", "STOP", "LIGHTS ON"])
try stream.feed(audio: pcm16k)
for match in stream.emissions {
print("matched: \(match.phrase) at \(match.timestamp)")
}
Reference Python decoder
The upstream Aho-Corasick ContextGraph + boost-cancellation beam search algorithm is available as a dependency-free pure-Python reference for porting to other runtimes.
Source
Exported from icefall-kws-zipformer-gigaspeech-20240219 (Apache-2.0), specifically the exp-finetune/pretrained.pt KWS-finetuned checkpoint, published in pkufool/keyword-spotting-models v0.11.
The reference ONNX export is csukuangfj/sherpa-onnx-kws-zipformer-gigaspeech-3.3M-2024-01-01; this CoreML bundle matches its encoder / decoder / joiner outputs within FP16 tolerance (β€1e-3 on encoder, β€1e-4 on decoder, β€1e-3 on joiner).
Links
- speech-swift β Apple SDK
- soniqo.audio β website
- blog
- Downloads last month
- 162