MiniMax M2.7 — JANGTQ (MLX)
TurboQuant codebook quantization of MiniMax's 228B agentic MoE — routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine.
Model Details
| Property | Value |
|---|---|
| Base Model | MiniMaxAI/MiniMax-M2.7 |
| Architecture | MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE |
| Total Parameters | 228.7 B |
| Active per Token | ~1.4 B |
| Profile | JANGTQ |
| Format | JANGTQ (codebook + Hadamard) — weight_format: mxtq in jang_config.json |
| Avg bits/param | ~2.15 |
| Disk | ~57 GB |
| Context length | 192 K tokens |
| Chat template | Always-reasoning (<think> opened at assistant start) |
What is JANGTQ?
JANGTQ (JANG TurboQuant) is a codebook-based quantization format for MoE
models on Apple Silicon. Routed expert weights stay in a compact codebook +
Hadamard-rotated form at runtime — no decompression to affine — and the
matmul path uses custom Metal kernels that read packed uint32 weights, look
up centroids in a small codebook, and accumulate dot products against a
Hadamard-rotated input (QuIP# rotate-input-once math).
Result vs uniform 2-bit affine: smaller on disk, higher quality, runs at ~89 % of affine 2-bit speed.
Bit Allocation
| Component | Bits | Format |
|---|---|---|
| Routed expert MLP (gate / up / down) | 2 | JANGTQ codebook + Hadamard |
| Attention (Q / K / V / O) | 8 | Affine (nn.QuantizedLinear, group_size=64) |
| Shared expert | 8 | Affine |
| Embed tokens / LM head | 8 | Affine |
| Router gate | fp16 | Unquantized nn.Linear |
| RMSNorms / RoPE / biases | fp16 | Unquantized |
The routed experts are 98 % of parameters and the natural compression target. Everything else stays at 8-bit affine so the quality-critical hot path runs at full precision.
Important Settings
MiniMax M2.7 is an always-reasoning model. The chat template
unconditionally opens <think> at each assistant turn.
| Setting | Value | Notes |
|---|---|---|
| Temperature | 1.0 | Required — temp=0 can cause thinking loops |
| Top-P | 0.95 | |
| Top-K | 40 | |
| Repetition Penalty | 1.1 | Optional, helps prevent loops |
max_tokens |
≥ 8192 | Give reasoning room to converge |
Strip <think>…</think> from the response before using the final answer.
Usage
This model requires the jang-tools loader — stock mlx_lm.load() does not
recognize weight_format: mxtq. The loader applies Metal kernel
monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block
Hadamard, router compile, QKV fusion).
pip install jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ")
model, tokenizer = load_jangtq_model(model_path)
messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600,
temperature=1.0, verbose=True)
Swift — Osaurus / MLX Studio
Both clients auto-detect the JANGTQ runtime from jang_config.json and route
through the MiniMaxJANGTQModel class. Just load the repo — no extra flags.
What's In This Repo
| File | Role |
|---|---|
model-*.safetensors (61 shards, ~57 GB) |
Weights — 2-bit routed TQ + 8-bit affine |
model.safetensors.index.json |
Shard index |
jangtq_runtime.safetensors |
Codebooks + Hadamard signs sidecar (Swift loader) |
jang_config.json |
JANG metadata + Tier-1 capabilities stamp (reasoning=qwen3, tool=minimax) |
config.json |
HF model config (minimax_m2, weight_format=mxtq, mxtq_bits=2) |
chat_template.jinja, tokenizer.*, vocab.json, merges.txt |
Tokenizer + chat template |
configuration_minimax_m2.py, modeling_minimax_m2.py |
HF custom code (untouched from upstream) |
osaurus-x-banner.png, mlx-studio-logo.png |
Branding assets |
Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx)
{
"reasoning_parser": "qwen3",
"tool_parser": "minimax",
"think_in_template": true,
"supports_tools": true,
"supports_thinking": true,
"family": "minimax_m2",
"modality": "text",
"cache_type": "kv"
}
<think> and <tool_call> are non-special tokens by design — the
application layer parses them. Osaurus and vmlx CapabilityDetector read
this block verbatim and wire the qwen3 reasoning parser + minimax tool
parser automatically, so streamed responses route reasoning_content and
tool_calls into the OpenAI-compatible SSE fields instead of leaking into
content.
License
MIT — see LICENSE.
Credits
Created by Jinho Jang — eric@jangq.ai
Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.
- Downloads last month
- 594
Quantized
Model tree for OsaurusAI/MiniMax-M2.7-JANGTQ
Base model
MiniMaxAI/MiniMax-M2.7