quant-olmoe-2bit
2-bit weight-only quantization of
allenai/OLMoE-1B-7B-0125
via per-expert routing-conditioned Hessian calibration on top of trellis-coded
quantization (QTIP). Produced end-to-end on
a single 12 GB consumer GPU (RTX 4080 Laptop, ~18 h).
Code, full training/eval pipeline, and paper: https://github.com/Venugopalan2610/quant-olmoe
What this repo contains
This repo ships only the 2-bit packed payloads and the HYB codebook LUT.
It is not a standalone transformers-loadable model; you still need the base
allenai/OLMoE-1B-7B-0125 checkpoint for the router, embeddings, lm_head,
and layer norms, which are not quantized.
quantized/
L00/
attn_{q,k,v,o}_proj.pt
expert_{00..63}_{gate,up,down}_proj.pt
L01/
...
L15/
codes/
hyb_lut_init.npy
On-disk footprint: 3.49 GB (the full set of 2-bit .pt bitstreams +
LUT). Combined with the base model's unquantized router + embeddings +
lm_head, the installed runtime footprint is ~14 GB bf16 (Phase A
reference install path dequantizes at load time; native 2-bit kernels are
planned as a follow-up).
Usage
git clone https://github.com/Venugopalan2610/quant-olmoe.git
cd quant-olmoe
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
git clone https://github.com/Dao-AILab/fast-hadamard-transform
pip install ./fast-hadamard-transform
# Downloads this repo into cache/quantized/ and cache/codes/
python -m src.eval.install_quantized --hf-repo Venugopalan2610/quant-olmoe-2bit
# Run PPL
python -m src.eval.run_ppl --config 2bit_noft --dataset wikitext2
python -m src.eval.run_ppl --config 2bit_noft --dataset c4
PPL eval is ~10 minutes for both datasets on an RTX 4080 Laptop.
Results (Table 1 of the paper)
| Config | Bits | WikiText-2 PPL | C4 PPL |
|---|---|---|---|
| fp16 baseline | 16 | 6.65 | 12.24 |
| Per-expert H (this repo) | 2 | 9.09 | 14.16 |
| Per-layer H, unweighted mean | 2 | 9.21 | 14.43 |
| Per-layer H, token-weighted mean | 2 | 9.18 | 14.44 |
Zero-shot downstream (lm-eval-harness 0.4.11): HellaSwag acc_norm 64.8 (fp16 68.1), ARC-c 44.3 (46.1), ARC-e 67.9 (70.4), PIQA 77.9 (79.6).
Method (one-paragraph summary)
For each expert in each MoE layer, a routing-conditioned Hessian
H_e = E[x x^T | expert e selected] is collected from the tokens the router
actually dispatches to that expert during calibration (2048 seqs × 1024
tokens of RedPajama). Each expert's three projections are then quantized
independently with QTIP's BlockLDLQ + HYB bitshift codebook (L=16, K=2, V=2,
Q=9, T=16) against its own H_e. Nothing about the trellis quantizer,
codebook, or LDL decomposition is modified. The only change from the dense
QTIP recipe is the per-expert Hessian. See the paper for ablations against
per-layer-mean and per-layer-token-weighted Hessians.
Hardware / software for reproduction
| Requirement | Tested |
|---|---|
| GPU | NVIDIA RTX 4080 Laptop, 12 GB, compute 8.9 |
| Min VRAM | 12 GB (≥ 11.5 GB asserted at startup) |
| CUDA toolkit | 12.8 |
| OS | Ubuntu 22.04 under WSL2 (native Linux also ok) |
| Python | 3.11 |
Other ≥12 GB Ampere/Ada cards (3060 12 GB, 3090, 4090, A10, L4) are expected to work without changes.
Limitations
- Not native 2-bit at runtime. The Phase A install path dequantizes to bf16 at load time, so the live memory footprint is ~14 GB, not 3.3 GB. Native 2-bit MoE inference kernels are planned as a companion follow-up report.
- Requires the base model. The router, embeddings, and
lm_headare not shipped here — you still need ~14 GB of bf16 weights fromallenai/OLMoE-1B-7B-0125. - English-only evaluation. All reported numbers are WikiText-2, C4 (English), and standard English zero-shot tasks.
License
Apache-2.0, matching the code repository. Base-model weights are governed
by the allenai/OLMoE-1B-7B-0125 license; this repo distributes only
transformed weights derived from that checkpoint.
Citation (Publish pending)
@article{iyengar2026qtipmoe,
title = {2-Bit MoE Quantization on Consumer GPUs via Per-Expert
Hessian Calibration},
author = {Iyengar, Venugopalan},
year = {2026}
}
Model tree for Venugopalan2610/quant-olmoe-2bit
Base model
allenai/OLMoE-1B-7B-0125