KWS Zipformer 3M — CoreML INT8

Streaming, zero-shot, open-vocabulary keyword spotting for iOS / macOS / visionOS. Exported from icefall's KWS-finetuned Zipformer transducer (gigaspeech, 3.49M parameters) to CoreML with INT8 palettized weights, FP16 compute, and an iOS 17+ minimum deployment target.

Given an arbitrary list of English keywords at runtime (no retraining required), the model emits a match when it hears one. Runs ~26× real-time on Apple Silicon CPU + Neural Engine.

Model


Architecture	Zipformer2 encoder + stateless transducer decoder + joiner
Parameters	3.49M
Quantization	INT8 k-means palettization (encoder + joiner); decoder FP16
Compute precision	FP16
Format	`.mlmodelc` (pre-compiled, ship-ready)
Min deployment target	iOS 17 / macOS 14 / visionOS 1
Sample rate	16 kHz
Feature	80-dim Kaldi fbank, 25 / 10 ms
Chunk size	320 ms (16 output frames × 20 ms each)
Left context	64 subsampled frames (~2.5 s)
Vocab	500 BPE tokens

Files

file	size	description
`encoder.mlmodelc`	3.3 MB	Zipformer2 streaming encoder (45 × 80 mel → 8 × 320 joiner-space) with 36 layer cache tensors + ConvNeXt pad + processed-lens state
`decoder.mlmodelc`	525 KB	Stateless 2-token-context predictor + `decoder_proj`
`joiner.mlmodelc`	160 KB	`output_linear(tanh(enc + dec))` → 500 logits
`bpe.model`	239 KB	SentencePiece BPE-500 tokenizer (icefall gigaspeech)
`tokens.txt`	4.9 KB	Token id → subword map
`commands_small.txt`	0.2 KB	Example keyword list (20 short commands)
`commands_large.txt`	5.9 KB	Example keyword list (248 commands)
`config.json`	3.8 KB	Fbank params, encoder cache shapes, default KWS thresholds

Performance

Measured on Apple Silicon (CPU + Neural Engine), tuned defaults ac_threshold=0.15 context_score=0.5 num_trailing_blanks=1.

Latency (per call)

component	latency
Encoder (45 × 80 mel → 8 × 320 joiner-space)	5.4 ms / 320 ms chunk
Decoder (1 step, 2-token context)	0.08 ms
Joiner (1 frame → 500 logits)	0.13 ms
RTF (encoder only, streaming)	0.038 (~26× real-time)

Accuracy (LibriSpeech test-clean, 12 keywords, 158 positive + 60 negative utterances)

keyword	N	recall
WHICH	27	1.00
LITTLE	17	0.94
BEFORE	18	1.00
GREAT	19	0.95
LIGHT	15	1.00
YOUNG	15	1.00
THROUGH	15	0.93
WORLD	15	0.93
ALWAYS	15	0.93
PEOPLE	15	1.00
THOUGHT	15	0.87
MISTER	17	0.00
TOTAL	203	0.88

False positive rate: 0.27 / utterance on 60 random negative utterances. CoreML INT8 output agrees with the PyTorch FP32 reference on 99% of utterances (FP16 palettization drift is not systematic).

Note on "MISTER": SentencePiece tokenizes it as [▁MI, S, TER] (3 tokens). Stateless transducers rarely lock onto 3-token sequences in beam search. For production wake words, prefer single-token keywords.

Memory

Loaded all three .mlmodelc models: peak RSS delta ≈ 63 MB on macOS. After 218-utterance streaming workload: +127 MB total.

Usage

Python (coremltools)

import json
from pathlib import Path

import coremltools as ct
import numpy as np

model_dir = Path("./KWS-Zipformer-3M-CoreML-INT8")
config = json.loads((model_dir / "config.json").read_text())

encoder = ct.models.CompiledMLModel(str(model_dir / "encoder.mlmodelc"))
decoder = ct.models.CompiledMLModel(str(model_dir / "decoder.mlmodelc"))
joiner = ct.models.CompiledMLModel(str(model_dir / "joiner.mlmodelc"))

# Build zero state from config
state = {}
for name, shape in zip(config["encoder"]["layerStateNames"],
                       config["encoder"]["layerStateShapes"]):
    state[name] = np.zeros(shape, dtype=np.float32)
state["cached_embed_left_pad"] = np.zeros(
    config["encoder"]["cachedEmbedLeftPadShape"], dtype=np.float32)
state["processed_lens"] = np.zeros((1,), dtype=np.int32)

# Feed 45 mel frames (80-dim fbank) per chunk
x = np.zeros((1, 45, 80), dtype=np.float32)  # replace with real fbank
out = encoder.predict({"x": x, **state})
encoder_out = out["encoder_out"]                  # (1, 8, 320) in joiner space
# Stream: feed encoder_out[0, t] into decoder/joiner + beam search

Swift (speech-swift)

import SpeechSwift

let model = try await KWSZipformerModel.fromPretrained(
    "aufklarer/KWS-Zipformer-3M-CoreML-INT8"
)
let stream = model.streamingSession(keywords: ["HEY SONIQO", "STOP", "LIGHTS ON"])
try stream.feed(audio: pcm16k)
for match in stream.emissions {
    print("matched: \(match.phrase) at \(match.timestamp)")
}

Reference Python decoder

The upstream Aho-Corasick ContextGraph + boost-cancellation beam search algorithm is available as a dependency-free pure-Python reference for porting to other runtimes.

Source

Exported from icefall-kws-zipformer-gigaspeech-20240219 (Apache-2.0), specifically the exp-finetune/pretrained.pt KWS-finetuned checkpoint, published in pkufool/keyword-spotting-models v0.11.

The reference ONNX export is csukuangfj/sherpa-onnx-kws-zipformer-gigaspeech-3.3M-2024-01-01; this CoreML bundle matches its encoder / decoder / joiner outputs within FP16 tolerance (≤1e-3 on encoder, ≤1e-4 on decoder, ≤1e-3 on joiner).

Collection including aufklarer/KWS-Zipformer-3M-CoreML-INT8

CoreML Speech Models

Collection

Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 19 items • Updated about 3 hours ago • 1

aufklarer
/

KWS-Zipformer-3M-CoreML-INT8