MiniMax-M2.7-NVFP4-GB10-AC

Agentic + Coder recalibration of MiniMax-M2.7 NVFP4-GB10. Same architecture and quantization scheme as saricles/MiniMax-M2.7-NVFP4-GB10, but calibrated on a 7-dataset mix targeted at agentic tool-use and code-generation workloads instead of general chat. The two are parallel variants of the same quant approach — sibling releases, not a version chain.

Custom GB10 NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 (230B, 256 MoE experts, top-K=8) targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. 141.05 GB on disk across 29 shards.

Why `-AC`? Why re-calibrate?

Post-training NVFP4 quantization depends on a calibration dataset to set per-layer activation scales (amax values). A 4-bit float format has 16 representable values — calibration determines how the full BF16 activation range at each layer is mapped to those 16 bins.

If calibration data doesn't match the target workload, real-world activations outside the calibrated range get clipped → quality loss on those inputs.

NVFP4-GB10 calibrated on HuggingFaceH4/ultrachat_200k (general multi-turn English chat, 64 samples)
NVFP4-GB10-AC calibrated on a 7-dataset agentic + coder mix (896 samples queued, 888 after length filtering)

The -AC calibration mix is designed to align activation scales with the workloads the model will actually serve when deployed in agent frameworks like OpenClaw, Aider, or Claude Code-style assistants.

Calibration mix

128 samples per dataset, 49,152 (48K) max sequence length:

Dataset	Samples	Domain
`theblackcat102/evol-codealpaca-v1`	128	Code generation
`Salesforce/xlam-function-calling-60k`	128	Tool calling / function invocation
`open-r1/Mixture-of-Thoughts` (code)	128	Code reasoning
`open-r1/Mixture-of-Thoughts` (math)	128	Mathematical reasoning
`open-r1/Mixture-of-Thoughts` (science)	128	Scientific reasoning
`SWE-bench/SWE-smith-trajectories` (tool split)	128	Software-engineering agent trajectories
`HuggingFaceH4/ultrachat_200k` (train_sft)	128	General multi-turn chat coverage
Total queued	896	—
Tokenized (post length-filter)	888	8 dropped as too-short after tokenization

The 7th dataset (ultrachat_200k) is intentional: without a general-chat anchor, calibration would bias exclusively toward code/tool/math distributions and degrade plain conversational quality. The mix preserves chat capability while shifting activation scales toward the agentic/coder workloads this quant is built for.

Model Details


Base Model	MiniMaxAI/MiniMax-M2.7
Architecture	MiniMaxM2ForCausalLM (MoE, 256 experts, top-K=8)
Total Parameters	230B
Active Parameters	~10B per token
Hidden Layers	62
Hidden Size	3,072
Vocab Size	200,064
Max Position Embeddings	196,608 (192K context)
Quantization	NVFP4 (4-bit floating point) with GB10-tuned ignore list
Format	compressed-tensors (safetensors)
Size on Disk	141.05 GB across 29 shards
Deployment	2× DGX Spark (does not fit in a single 128 GB Spark)
License	Non-commercial, inherited from MiniMaxAI/MiniMax-M2.7. See Use & License.

Quantization Details

Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.29.0)
Transformers: 4.57.6 (with Conv1D compatibility shim for post-4.57 module relocation)
Scheme: mtq.NVFP4_DEFAULT_CFG (algorithm=max, group_size=16) + GB10-tuned disable list applied post-calibration
Calibration: 7-dataset agentic + coder mix (see table above), 896 samples queued / 888 tokenized @ 49,152 max-seq
Ignore list (kept in BF16, from published hf_quant_config.json):
- lm_head, *embed_tokens*
- *block_sparse_moe.gate — MoE router gate (not per-expert gates)
- *model.layers.0.* — first transformer block
- *model.layers.61.* — last transformer block
Quantizer counts: 143,967 TensorQuantizer modules inserted, 51,327 disabled via ignore list, 92,640 active during calibration
GB10 specialization: self_attn stays QUANTIZED (vs. the standard NVFP4 reference configuration which keeps attention BF16) — the GB10 ignore list only covers the items listed above
Calibration run: Hugging Face Jobs, 8× NVIDIA A100 80 GB, ~10 hours wall-clock, single-phase (no wallclock-cap, no deferred samples, no OOMs)
Starvation check: 0 starved experts at end of calibration (every active quantizer received enough token traffic to produce a valid amax)
Recipe script: quantize-ac-protected.py — full three-phase recipe with OOM-defer protection, amax-only checkpointing, and inline export

Running on 2× DGX Spark (Tensor Parallel)

At 141.05 GB this model does not fit in a single DGX Spark's 128 GB unified memory. It runs with tensor-parallel-size=2 across two Sparks connected via their ConnectX-7 200 GbE link, orchestrated by Ray. The community reference container is eugr/spark-vllm-docker.

Quick start: run_vllm.sh is a ready-to-run wrapper — exports the tuned environment variables and invokes vllm serve with the working flag set.

Full deployment reference: DEPLOYMENT.md — the two deployment profiles I tested, measured numbers, and hardware/framework quirks specific to GB10 (SM 12.1) and multi-node Ray TP.

The short version: on GB10 the fastest NVFP4 MoE path is the Marlin backend (VLLM_NVFP4_GEMM_BACKEND=marlin, VLLM_USE_FLASHINFER_MOE_FP4=0), and if your workload is agentic (tool-calling, code generation, repeated-token-heavy) you should additionally enable ngram speculative decoding. See DEPLOYMENT.md for the full rationale and benchmark data.

Client-side tips

Every client that calls this endpoint should set max_tokens ≥ 16384. The OpenAI SDK's default of 4096 will silently truncate tool-call JSON mid-string, which appears as "model forgot how to use tools" but is actually just a clipped response. Bump it.

When to choose `-AC` vs `NVFP4-GB10`

Use -AC for: agent frameworks (OpenClaw, Aider, Claude Code-style), tool-calling workloads, code-generation assistants, multi-turn reasoning over code/math.
Use NVFP4-GB10 for: general chat applications, scenarios where the calibration-dataset provenance matches the published NVFP4-GB10 benchmarks exactly.

Both variants are mechanically compatible (same vLLM invocation, same compressed-tensors format). Only the per-layer NVFP4 activation scales differ — size on disk, architecture, ignore list, and deployment are unchanged.

Performance

Benchmarked on 2× NVIDIA DGX Spark (GB10), TP=2 via Ray over QSFP56 RoCE, using llama-benchy v0.3.3. Measured 2026-04-19 with the tuned config shown above (including VLLM_USE_FLASHINFER_MOE_FP4=1, SoC firmware ≥2.148.24, --gpu-memory-utilization 0.88).

Measured on 2× NVIDIA DGX Spark (GB10, SM 12.1), TP=2 over QSFP56 RoCE, post-firmware SoC 2.148.24. vLLM 0.19.1rc1.dev241 via the eugr/spark-vllm-docker nightly image. Two deployment profiles documented in DEPLOYMENT.md. Numbers below are observed on this rig; your mileage depends on build, image, and workload.

Profile 1 — Throughput-stable (Marlin NVFP4 MoE, no speculation)

Benchmarked with llama-benchy v0.3.3, 3 runs per config, warm model, single client.

Prompt (tok)	Gen (tok)	Prefill (tok/s)	Decode (tok/s)	TTFT (ms)
512	128	1,128	35.44	454
512	256	1,248	35.86	410
1024	128	2,049	35.03	500
1024	256	2,132	34.50	480
4096	128	2,817	33.76	1,454
4096	256	3,314	33.45	1,236

API latency: 1.50 ms. Peak decode: 35.86 tok/s.

Profile 2 — Agentic (Marlin NVFP4 MoE + ngram speculative decoding)

Measured on a 12-prompt agent-flavored set (code generation, tool calls, short chat) — not a standard benchmark; it approximates real agent-framework traffic. Same hardware, same sampling, only the serving config differs.

Metric	Throughput-stable profile	Agentic profile
Average decode across 12 prompts (tok/s)	25.20	36.44
Peak decode (tok/s)	35.86	48.34 (code-04: async-pattern)
Total wall-clock for full prompt set (s)	250.8	162.7
Wall-clock speedup (Agentic vs Throughput-stable)	—	1.54×

Per-task wall-clock table, DEPLOYMENT.md has the full breakdown: code-02 (MBPP-style) 2.13× faster; code-04 (async pattern) 1.90× faster; chat-03 (creative writing) 2.06× faster; tool-04 (don't-call-tool trap) 1.96× faster.

Why the two profiles differ: ngram speculative decoding wins big when responses contain repeated tokens (tool names, file paths, variable names, JSON keys reappearing) — which agent/code workloads have abundantly. On synthetic benchmarks with low token repetition (like llama-benchy's generated prompts), ngram's overhead slightly exceeds its savings and decode regresses. DEPLOYMENT.md documents this tradeoff and when to pick each profile.

Qualitative — agentic behavior vs `NVFP4-GB10` sibling

Same prompt set, same sampling, compared against the general-chat-calibrated sibling variant:

Task	`NVFP4-GB10` tokens	`NVFP4-GB10-AC` tokens	Wall-clock speedup (AC)
"Answer directly; don't call the provided tool" (trap)	718	44	14.7×
Multi-step meeting booking (3 tools)	385	81	4.6×
Weather (single tool)	73	51	2.5×
Parallel stock prices (parallel tool calls)	176	121	1.4×

AC is measurably more decisive on tool-use tasks — it emits cleaner, shorter tool calls and, crucially, doesn't over-invoke tools when direct answers suffice. Raw throughput (decode/prefill) is within noise of NVFP4-GB10 as expected — quant format is identical, only activation scales differ. The meaningful delta between AC and GB10 is qualitative on agentic tasks, not numeric on raw throughput.

Notes

See DEPLOYMENT.md for the full environment, flags, caveats, and why Marlin is the right MoE backend on SM 12.1 GB10 today.
For published standardized benchmarks (HumanEval, BFCL, MT-Bench, WildClawBench), see forthcoming evaluation runs.

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Target Hardware

Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore list was tuned for Blackwell and will leave some performance on the table.

If you only have one DGX Spark

At 141.05 GB this model does not fit in a single Spark's 128 GB unified memory — it requires 2× Spark with tensor parallelism. If you have only one Spark, consider the REAP-pruned variant: saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 (98.9 GB, single-node deployment).

Use & License

This derivative inherits the license terms of the base model, MiniMaxAI/MiniMax-M2.7. The full license text is distributed in the LICENSE file in this repo.

Permitted free uses (from §5 of the base license): personal use — including self-hosted deployment for coding, development of applications, agents, tools, integrations, research, experimentation, or other personal purposes; use by non-profit organizations, academic institutions, and researchers for non-commercial research or educational purposes; and modifications for the uses above.

Commercial use requires authorization directly from MiniMax. If you intend to use this model (or any derivative) for commercial purposes — including offering products/services to third parties for a fee, commercial-product APIs, or commercial deployment — you must:

Obtain prior written authorization from MiniMax by emailing api@minimax.io with subject line "M2.7 licensing", and
Prominently display "Built with MiniMax M2.7" on the related website, user interface, blogpost, about page, or product documentation.

Prohibited uses (from the license appendix) — by using this model you agree not to use it to: generate or disseminate content prohibited by applicable laws, support any military purpose, exploit or harm minors, generate harmful misinformation intended to deceive, or promote discrimination or hate speech.

This quantization pipeline and the recipe script in this repo (quantize-ac-protected.py) are released under the same terms as the base model, as a derivative work.

Acknowledgments

Base model by MiniMax
Quantization tooling: NVIDIA TensorRT Model Optimizer
GB10 quantization profile guidance: Scott Glover (scottgl)
Multi-Spark runtime tuning: the eugr/spark-vllm-docker project and the NVIDIA Developer Forum community
Agentic calibration mix inspired by the Cerebras REAP paper (CerebrasResearch/reap) — a similar multi-domain blend used there for expert-pruning calibration inspired the blend used here for NVFP4 activation-scale calibration

Reproducibility

Full recipe script: quantize-ac-protected.py

The script implements a three-phase protected calibration pipeline:

Phase A — Calibration with per-sample OOM defer, amax-only checkpoints every N samples (60 MB each, versus ~460 GB per checkpoint if saving full state), optional two-phase bucket commit with sha256 markers, wallclock watchdog (soft + hard exit). Inline export at end on successful completion.
Phase B (fallback) — Resume from the latest good checkpoint, process deferred samples on a larger-memory GPU flavor, rescue starved experts, export.
Phase C (recovery only) — Re-export from a saved checkpoint if Phases A/B completed calibration but crashed during export.

Env vars consumed by the recipe:

PHASE = A | B | C
INPUT_DIR — path to the BF16 source model
OUTPUT_DIR — export target (Phase A inline export + Phase B/C export)
TARGET_REPO_ID — HF Hub repo to publish the quantized model to
BUCKET_REPO_ID — HF Hub dataset repo used as a workspace for checkpoints (optional; remove Phase B/C if you don't want a bucket)
BUCKET_PREFIX — path prefix inside the bucket repo
NUM_CALIB_PER_DS (default 128)
MAX_SEQ (default 49152)
CKPT_EVERY (default 50)
WALLCLOCK_BUDGET_S (default 21600 = 6h; Phase A exits cleanly before cap)
STARVED_EXPERT_PCT_ABORT (default 1.0%)

Run for this release:

Job: HF Jobs a100x8, single Phase A invocation
Duration: ~10 hours wall-clock (01:40 UTC start → 11:41 UTC Phase A DONE → inline export → publish)
Outcome: status=complete-published, deferred=0, starved=0
14 amax-only checkpoints written during calibration (60 MB each), ckpt 14 is the final post-rescue state

Downloads last month: 1,024

Safetensors

Model size

119B params

Tensor type

BF16

F8_E4M3

Model tree for saricles/MiniMax-M2.7-NVFP4-GB10-AC

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

(18)

this model