MiniMax-M2.7-NVFP4-GB10-AC

Agentic + Coder recalibration of MiniMax-M2.7 NVFP4-GB10. Same architecture and quantization scheme as saricles/MiniMax-M2.7-NVFP4-GB10, but calibrated on a 7-dataset mix targeted at agentic tool-use and code-generation workloads instead of general chat. The two are parallel variants of the same quant approach — sibling releases, not a version chain.

Custom GB10 NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 (230B, 256 MoE experts, top-K=8) targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. 141.05 GB on disk across 29 shards.

Why -AC? Why re-calibrate?

Post-training NVFP4 quantization depends on a calibration dataset to set per-layer activation scales (amax values). A 4-bit float format has 16 representable values — calibration determines how the full BF16 activation range at each layer is mapped to those 16 bins.

If calibration data doesn't match the target workload, real-world activations outside the calibrated range get clipped → quality loss on those inputs.

  • NVFP4-GB10 calibrated on HuggingFaceH4/ultrachat_200k (general multi-turn English chat, 64 samples)
  • NVFP4-GB10-AC calibrated on a 7-dataset agentic + coder mix (896 samples queued, 888 after length filtering)

The -AC calibration mix is designed to align activation scales with the workloads the model will actually serve when deployed in agent frameworks like OpenClaw, Aider, or Claude Code-style assistants.

Calibration mix

128 samples per dataset, 49,152 (48K) max sequence length:

Dataset Samples Domain
theblackcat102/evol-codealpaca-v1 128 Code generation
Salesforce/xlam-function-calling-60k 128 Tool calling / function invocation
open-r1/Mixture-of-Thoughts (code) 128 Code reasoning
open-r1/Mixture-of-Thoughts (math) 128 Mathematical reasoning
open-r1/Mixture-of-Thoughts (science) 128 Scientific reasoning
SWE-bench/SWE-smith-trajectories (tool split) 128 Software-engineering agent trajectories
HuggingFaceH4/ultrachat_200k (train_sft) 128 General multi-turn chat coverage
Total queued 896
Tokenized (post length-filter) 888 8 dropped as too-short after tokenization

The 7th dataset (ultrachat_200k) is intentional: without a general-chat anchor, calibration would bias exclusively toward code/tool/math distributions and degrade plain conversational quality. The mix preserves chat capability while shifting activation scales toward the agentic/coder workloads this quant is built for.

Model Details

Base Model MiniMaxAI/MiniMax-M2.7
Architecture MiniMaxM2ForCausalLM (MoE, 256 experts, top-K=8)
Total Parameters 230B
Active Parameters ~10B per token
Hidden Layers 62
Hidden Size 3,072
Vocab Size 200,064
Max Position Embeddings 196,608 (192K context)
Quantization NVFP4 (4-bit floating point) with GB10-tuned ignore list
Format compressed-tensors (safetensors)
Size on Disk 141.05 GB across 29 shards
Deployment 2× DGX Spark (does not fit in a single 128 GB Spark)
License Non-commercial, inherited from MiniMaxAI/MiniMax-M2.7. See Use & License.

Quantization Details

  • Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.29.0)
  • Transformers: 4.57.6 (with Conv1D compatibility shim for post-4.57 module relocation)
  • Scheme: mtq.NVFP4_DEFAULT_CFG (algorithm=max, group_size=16) + GB10-tuned disable list applied post-calibration
  • Calibration: 7-dataset agentic + coder mix (see table above), 896 samples queued / 888 tokenized @ 49,152 max-seq
  • Ignore list (kept in BF16, from published hf_quant_config.json):
    • lm_head, *embed_tokens*
    • *block_sparse_moe.gate — MoE router gate (not per-expert gates)
    • *model.layers.0.* — first transformer block
    • *model.layers.61.* — last transformer block
  • Quantizer counts: 143,967 TensorQuantizer modules inserted, 51,327 disabled via ignore list, 92,640 active during calibration
  • GB10 specialization: self_attn stays QUANTIZED (vs. the standard NVFP4 reference configuration which keeps attention BF16) — the GB10 ignore list only covers the items listed above
  • Calibration run: Hugging Face Jobs, 8× NVIDIA A100 80 GB, ~10 hours wall-clock, single-phase (no wallclock-cap, no deferred samples, no OOMs)
  • Starvation check: 0 starved experts at end of calibration (every active quantizer received enough token traffic to produce a valid amax)
  • Recipe script: quantize-ac-protected.py — full three-phase recipe with OOM-defer protection, amax-only checkpointing, and inline export

Running on 2× DGX Spark (Tensor Parallel)

At 141.05 GB this model does not fit in a single DGX Spark's 128 GB unified memory. It runs with tensor-parallel-size=2 across two Sparks connected via their ConnectX-7 200 GbE link, orchestrated by Ray. The community reference container is eugr/spark-vllm-docker.

Quick start: run_vllm.sh is a ready-to-run wrapper — exports the tuned environment variables and invokes vllm serve with the working flag set.

Full deployment reference: DEPLOYMENT.md — the two deployment profiles I tested, measured numbers, and hardware/framework quirks specific to GB10 (SM 12.1) and multi-node Ray TP.

The short version: on GB10 the fastest NVFP4 MoE path is the Marlin backend (VLLM_NVFP4_GEMM_BACKEND=marlin, VLLM_USE_FLASHINFER_MOE_FP4=0), and if your workload is agentic (tool-calling, code generation, repeated-token-heavy) you should additionally enable ngram speculative decoding. See DEPLOYMENT.md for the full rationale and benchmark data.

Client-side tips

Every client that calls this endpoint should set max_tokens ≥ 16384. The OpenAI SDK's default of 4096 will silently truncate tool-call JSON mid-string, which appears as "model forgot how to use tools" but is actually just a clipped response. Bump it.

When to choose -AC vs NVFP4-GB10

  • Use -AC for: agent frameworks (OpenClaw, Aider, Claude Code-style), tool-calling workloads, code-generation assistants, multi-turn reasoning over code/math.
  • Use NVFP4-GB10 for: general chat applications, scenarios where the calibration-dataset provenance matches the published NVFP4-GB10 benchmarks exactly.

Both variants are mechanically compatible (same vLLM invocation, same compressed-tensors format). Only the per-layer NVFP4 activation scales differ — size on disk, architecture, ignore list, and deployment are unchanged.

Performance

Benchmarked on 2× NVIDIA DGX Spark (GB10), TP=2 via Ray over QSFP56 RoCE, using llama-benchy v0.3.3. Measured 2026-04-19 with the tuned config shown above (including VLLM_USE_FLASHINFER_MOE_FP4=1, SoC firmware ≥2.148.24, --gpu-memory-utilization 0.88).

Measured on 2× NVIDIA DGX Spark (GB10, SM 12.1), TP=2 over QSFP56 RoCE, post-firmware SoC 2.148.24. vLLM 0.19.1rc1.dev241 via the eugr/spark-vllm-docker nightly image. Two deployment profiles documented in DEPLOYMENT.md. Numbers below are observed on this rig; your mileage depends on build, image, and workload.

Profile 1 — Throughput-stable (Marlin NVFP4 MoE, no speculation)

Benchmarked with llama-benchy v0.3.3, 3 runs per config, warm model, single client.

Prompt (tok) Gen (tok) Prefill (tok/s) Decode (tok/s) TTFT (ms)
512 128 1,128 35.44 454
512 256 1,248 35.86 410
1024 128 2,049 35.03 500
1024 256 2,132 34.50 480
4096 128 2,817 33.76 1,454
4096 256 3,314 33.45 1,236

API latency: 1.50 ms. Peak decode: 35.86 tok/s.

Profile 2 — Agentic (Marlin NVFP4 MoE + ngram speculative decoding)

Measured on a 12-prompt agent-flavored set (code generation, tool calls, short chat) — not a standard benchmark; it approximates real agent-framework traffic. Same hardware, same sampling, only the serving config differs.

Metric Throughput-stable profile Agentic profile
Average decode across 12 prompts (tok/s) 25.20 36.44
Peak decode (tok/s) 35.86 48.34 (code-04: async-pattern)
Total wall-clock for full prompt set (s) 250.8 162.7
Wall-clock speedup (Agentic vs Throughput-stable) 1.54×

Per-task wall-clock table, DEPLOYMENT.md has the full breakdown: code-02 (MBPP-style) 2.13× faster; code-04 (async pattern) 1.90× faster; chat-03 (creative writing) 2.06× faster; tool-04 (don't-call-tool trap) 1.96× faster.

Why the two profiles differ: ngram speculative decoding wins big when responses contain repeated tokens (tool names, file paths, variable names, JSON keys reappearing) — which agent/code workloads have abundantly. On synthetic benchmarks with low token repetition (like llama-benchy's generated prompts), ngram's overhead slightly exceeds its savings and decode regresses. DEPLOYMENT.md documents this tradeoff and when to pick each profile.

Qualitative — agentic behavior vs NVFP4-GB10 sibling

Same prompt set, same sampling, compared against the general-chat-calibrated sibling variant:

Task NVFP4-GB10 tokens NVFP4-GB10-AC tokens Wall-clock speedup (AC)
"Answer directly; don't call the provided tool" (trap) 718 44 14.7×
Multi-step meeting booking (3 tools) 385 81 4.6×
Weather (single tool) 73 51 2.5×
Parallel stock prices (parallel tool calls) 176 121 1.4×

AC is measurably more decisive on tool-use tasks — it emits cleaner, shorter tool calls and, crucially, doesn't over-invoke tools when direct answers suffice. Raw throughput (decode/prefill) is within noise of NVFP4-GB10 as expected — quant format is identical, only activation scales differ. The meaningful delta between AC and GB10 is qualitative on agentic tasks, not numeric on raw throughput.

Notes

  • See DEPLOYMENT.md for the full environment, flags, caveats, and why Marlin is the right MoE backend on SM 12.1 GB10 today.
  • For published standardized benchmarks (HumanEval, BFCL, MT-Bench, WildClawBench), see forthcoming evaluation runs.

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Target Hardware

Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore list was tuned for Blackwell and will leave some performance on the table.

If you only have one DGX Spark

At 141.05 GB this model does not fit in a single Spark's 128 GB unified memory — it requires 2× Spark with tensor parallelism. If you have only one Spark, consider the REAP-pruned variant: saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 (98.9 GB, single-node deployment).

Use & License

This derivative inherits the license terms of the base model, MiniMaxAI/MiniMax-M2.7. The full license text is distributed in the LICENSE file in this repo.

Permitted free uses (from §5 of the base license): personal use — including self-hosted deployment for coding, development of applications, agents, tools, integrations, research, experimentation, or other personal purposes; use by non-profit organizations, academic institutions, and researchers for non-commercial research or educational purposes; and modifications for the uses above.

Commercial use requires authorization directly from MiniMax. If you intend to use this model (or any derivative) for commercial purposes — including offering products/services to third parties for a fee, commercial-product APIs, or commercial deployment — you must:

  1. Obtain prior written authorization from MiniMax by emailing api@minimax.io with subject line "M2.7 licensing", and
  2. Prominently display "Built with MiniMax M2.7" on the related website, user interface, blogpost, about page, or product documentation.

Prohibited uses (from the license appendix) — by using this model you agree not to use it to: generate or disseminate content prohibited by applicable laws, support any military purpose, exploit or harm minors, generate harmful misinformation intended to deceive, or promote discrimination or hate speech.

This quantization pipeline and the recipe script in this repo (quantize-ac-protected.py) are released under the same terms as the base model, as a derivative work.

Acknowledgments

Reproducibility

Full recipe script: quantize-ac-protected.py

The script implements a three-phase protected calibration pipeline:

  • Phase A — Calibration with per-sample OOM defer, amax-only checkpoints every N samples (60 MB each, versus ~460 GB per checkpoint if saving full state), optional two-phase bucket commit with sha256 markers, wallclock watchdog (soft + hard exit). Inline export at end on successful completion.
  • Phase B (fallback) — Resume from the latest good checkpoint, process deferred samples on a larger-memory GPU flavor, rescue starved experts, export.
  • Phase C (recovery only) — Re-export from a saved checkpoint if Phases A/B completed calibration but crashed during export.

Env vars consumed by the recipe:

  • PHASE = A | B | C
  • INPUT_DIR — path to the BF16 source model
  • OUTPUT_DIR — export target (Phase A inline export + Phase B/C export)
  • TARGET_REPO_ID — HF Hub repo to publish the quantized model to
  • BUCKET_REPO_ID — HF Hub dataset repo used as a workspace for checkpoints (optional; remove Phase B/C if you don't want a bucket)
  • BUCKET_PREFIX — path prefix inside the bucket repo
  • NUM_CALIB_PER_DS (default 128)
  • MAX_SEQ (default 49152)
  • CKPT_EVERY (default 50)
  • WALLCLOCK_BUDGET_S (default 21600 = 6h; Phase A exits cleanly before cap)
  • STARVED_EXPERT_PCT_ABORT (default 1.0%)

Run for this release:

  • Job: HF Jobs a100x8, single Phase A invocation
  • Duration: ~10 hours wall-clock (01:40 UTC start → 11:41 UTC Phase A DONE → inline export → publish)
  • Outcome: status=complete-published, deferred=0, starved=0
  • 14 amax-only checkpoints written during calibration (60 MB each), ckpt 14 is the final post-rescue state
Downloads last month
1,024
Safetensors
Model size
119B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for saricles/MiniMax-M2.7-NVFP4-GB10-AC

Finetuned
(18)
this model