KWS Zipformer 3M β€” CoreML INT8

Streaming, zero-shot, open-vocabulary keyword spotting for iOS / macOS / visionOS. Exported from icefall's KWS-finetuned Zipformer transducer (gigaspeech, 3.49M parameters) to CoreML with INT8 palettized weights, FP16 compute, and an iOS 17+ minimum deployment target.

Given an arbitrary list of English keywords at runtime (no retraining required), the model emits a match when it hears one. Runs ~26Γ— real-time on Apple Silicon CPU + Neural Engine.

Model

Architecture Zipformer2 encoder + stateless transducer decoder + joiner
Parameters 3.49M
Quantization INT8 k-means palettization (encoder + joiner); decoder FP16
Compute precision FP16
Format .mlmodelc (pre-compiled, ship-ready)
Min deployment target iOS 17 / macOS 14 / visionOS 1
Sample rate 16 kHz
Feature 80-dim Kaldi fbank, 25 / 10 ms
Chunk size 320 ms (16 output frames Γ— 20 ms each)
Left context 64 subsampled frames (~2.5 s)
Vocab 500 BPE tokens

Files

file size description
encoder.mlmodelc 3.3 MB Zipformer2 streaming encoder (45 Γ— 80 mel β†’ 8 Γ— 320 joiner-space) with 36 layer cache tensors + ConvNeXt pad + processed-lens state
decoder.mlmodelc 525 KB Stateless 2-token-context predictor + decoder_proj
joiner.mlmodelc 160 KB output_linear(tanh(enc + dec)) β†’ 500 logits
bpe.model 239 KB SentencePiece BPE-500 tokenizer (icefall gigaspeech)
tokens.txt 4.9 KB Token id β†’ subword map
commands_small.txt 0.2 KB Example keyword list (20 short commands)
commands_large.txt 5.9 KB Example keyword list (248 commands)
config.json 3.8 KB Fbank params, encoder cache shapes, default KWS thresholds

Performance

Measured on Apple Silicon (CPU + Neural Engine), tuned defaults ac_threshold=0.15 context_score=0.5 num_trailing_blanks=1.

Latency (per call)

component latency
Encoder (45 Γ— 80 mel β†’ 8 Γ— 320 joiner-space) 5.4 ms / 320 ms chunk
Decoder (1 step, 2-token context) 0.08 ms
Joiner (1 frame β†’ 500 logits) 0.13 ms
RTF (encoder only, streaming) 0.038 (~26Γ— real-time)

Accuracy (LibriSpeech test-clean, 12 keywords, 158 positive + 60 negative utterances)

keyword N recall
WHICH 27 1.00
LITTLE 17 0.94
BEFORE 18 1.00
GREAT 19 0.95
LIGHT 15 1.00
YOUNG 15 1.00
THROUGH 15 0.93
WORLD 15 0.93
ALWAYS 15 0.93
PEOPLE 15 1.00
THOUGHT 15 0.87
MISTER 17 0.00
TOTAL 203 0.88

False positive rate: 0.27 / utterance on 60 random negative utterances. CoreML INT8 output agrees with the PyTorch FP32 reference on 99% of utterances (FP16 palettization drift is not systematic).

Note on "MISTER": SentencePiece tokenizes it as [▁MI, S, TER] (3 tokens). Stateless transducers rarely lock onto 3-token sequences in beam search. For production wake words, prefer single-token keywords.

Memory

Loaded all three .mlmodelc models: peak RSS delta β‰ˆ 63 MB on macOS. After 218-utterance streaming workload: +127 MB total.

Usage

Python (coremltools)

import json
from pathlib import Path

import coremltools as ct
import numpy as np

model_dir = Path("./KWS-Zipformer-3M-CoreML-INT8")
config = json.loads((model_dir / "config.json").read_text())

encoder = ct.models.CompiledMLModel(str(model_dir / "encoder.mlmodelc"))
decoder = ct.models.CompiledMLModel(str(model_dir / "decoder.mlmodelc"))
joiner = ct.models.CompiledMLModel(str(model_dir / "joiner.mlmodelc"))

# Build zero state from config
state = {}
for name, shape in zip(config["encoder"]["layerStateNames"],
                       config["encoder"]["layerStateShapes"]):
    state[name] = np.zeros(shape, dtype=np.float32)
state["cached_embed_left_pad"] = np.zeros(
    config["encoder"]["cachedEmbedLeftPadShape"], dtype=np.float32)
state["processed_lens"] = np.zeros((1,), dtype=np.int32)

# Feed 45 mel frames (80-dim fbank) per chunk
x = np.zeros((1, 45, 80), dtype=np.float32)  # replace with real fbank
out = encoder.predict({"x": x, **state})
encoder_out = out["encoder_out"]                  # (1, 8, 320) in joiner space
# Stream: feed encoder_out[0, t] into decoder/joiner + beam search

Swift (speech-swift)

import SpeechSwift

let model = try await KWSZipformerModel.fromPretrained(
    "aufklarer/KWS-Zipformer-3M-CoreML-INT8"
)
let stream = model.streamingSession(keywords: ["HEY SONIQO", "STOP", "LIGHTS ON"])
try stream.feed(audio: pcm16k)
for match in stream.emissions {
    print("matched: \(match.phrase) at \(match.timestamp)")
}

Reference Python decoder

The upstream Aho-Corasick ContextGraph + boost-cancellation beam search algorithm is available as a dependency-free pure-Python reference for porting to other runtimes.

Source

Exported from icefall-kws-zipformer-gigaspeech-20240219 (Apache-2.0), specifically the exp-finetune/pretrained.pt KWS-finetuned checkpoint, published in pkufool/keyword-spotting-models v0.11.

The reference ONNX export is csukuangfj/sherpa-onnx-kws-zipformer-gigaspeech-3.3M-2024-01-01; this CoreML bundle matches its encoder / decoder / joiner outputs within FP16 tolerance (≀1e-3 on encoder, ≀1e-4 on decoder, ≀1e-3 on joiner).

Links

Downloads last month
162
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including aufklarer/KWS-Zipformer-3M-CoreML-INT8