quant-olmoe-2bit

2-bit weight-only quantization of allenai/OLMoE-1B-7B-0125 via per-expert routing-conditioned Hessian calibration on top of trellis-coded quantization (QTIP). Produced end-to-end on a single 12 GB consumer GPU (RTX 4080 Laptop, ~18 h).

Code, full training/eval pipeline, and paper: https://github.com/Venugopalan2610/quant-olmoe

What this repo contains

This repo ships only the 2-bit packed payloads and the HYB codebook LUT. It is not a standalone transformers-loadable model; you still need the base allenai/OLMoE-1B-7B-0125 checkpoint for the router, embeddings, lm_head, and layer norms, which are not quantized.

quantized/
  L00/
    attn_{q,k,v,o}_proj.pt
    expert_{00..63}_{gate,up,down}_proj.pt
  L01/
  ...
  L15/
codes/
  hyb_lut_init.npy

On-disk footprint: 3.49 GB (the full set of 2-bit .pt bitstreams + LUT). Combined with the base model's unquantized router + embeddings + lm_head, the installed runtime footprint is ~14 GB bf16 (Phase A reference install path dequantizes at load time; native 2-bit kernels are planned as a follow-up).

Usage

git clone https://github.com/Venugopalan2610/quant-olmoe.git
cd quant-olmoe
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
git clone https://github.com/Dao-AILab/fast-hadamard-transform
pip install ./fast-hadamard-transform

# Downloads this repo into cache/quantized/ and cache/codes/
python -m src.eval.install_quantized --hf-repo Venugopalan2610/quant-olmoe-2bit

# Run PPL
python -m src.eval.run_ppl --config 2bit_noft --dataset wikitext2
python -m src.eval.run_ppl --config 2bit_noft --dataset c4

PPL eval is ~10 minutes for both datasets on an RTX 4080 Laptop.

Results (Table 1 of the paper)

Config Bits WikiText-2 PPL C4 PPL
fp16 baseline 16 6.65 12.24
Per-expert H (this repo) 2 9.09 14.16
Per-layer H, unweighted mean 2 9.21 14.43
Per-layer H, token-weighted mean 2 9.18 14.44

Zero-shot downstream (lm-eval-harness 0.4.11): HellaSwag acc_norm 64.8 (fp16 68.1), ARC-c 44.3 (46.1), ARC-e 67.9 (70.4), PIQA 77.9 (79.6).

Method (one-paragraph summary)

For each expert in each MoE layer, a routing-conditioned Hessian H_e = E[x x^T | expert e selected] is collected from the tokens the router actually dispatches to that expert during calibration (2048 seqs × 1024 tokens of RedPajama). Each expert's three projections are then quantized independently with QTIP's BlockLDLQ + HYB bitshift codebook (L=16, K=2, V=2, Q=9, T=16) against its own H_e. Nothing about the trellis quantizer, codebook, or LDL decomposition is modified. The only change from the dense QTIP recipe is the per-expert Hessian. See the paper for ablations against per-layer-mean and per-layer-token-weighted Hessians.

Hardware / software for reproduction

Requirement Tested
GPU NVIDIA RTX 4080 Laptop, 12 GB, compute 8.9
Min VRAM 12 GB (≥ 11.5 GB asserted at startup)
CUDA toolkit 12.8
OS Ubuntu 22.04 under WSL2 (native Linux also ok)
Python 3.11

Other ≥12 GB Ampere/Ada cards (3060 12 GB, 3090, 4090, A10, L4) are expected to work without changes.

Limitations

  • Not native 2-bit at runtime. The Phase A install path dequantizes to bf16 at load time, so the live memory footprint is ~14 GB, not 3.3 GB. Native 2-bit MoE inference kernels are planned as a companion follow-up report.
  • Requires the base model. The router, embeddings, and lm_head are not shipped here — you still need ~14 GB of bf16 weights from allenai/OLMoE-1B-7B-0125.
  • English-only evaluation. All reported numbers are WikiText-2, C4 (English), and standard English zero-shot tasks.

License

Apache-2.0, matching the code repository. Base-model weights are governed by the allenai/OLMoE-1B-7B-0125 license; this repo distributes only transformed weights derived from that checkpoint.

Citation (Publish pending)

@article{iyengar2026qtipmoe,
  title   = {2-Bit MoE Quantization on Consumer GPUs via Per-Expert
             Hessian Calibration},
  author  = {Iyengar, Venugopalan},
  year    = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Venugopalan2610/quant-olmoe-2bit

Finetuned
(21)
this model

Paper for Venugopalan2610/quant-olmoe-2bit