1.58-Bit FLUX Reproduction: Ternary Quantization + LoRA

Reproduction of 1.58-bit FLUX (Yang et al., 2024). Ternary ({-1, 0, +1}) quantization of FLUX.1-dev transformer with LoRA compensation, trained via offline flow-matching distillation.

Results

Model	LoRA Rank	OOD CLIP (% of BF16)	Aesthetic	LPIPS
BF16 (baseline)	-	100%	5.842	ref
V9b (best r64)	64	90.0%	5.939	0.664
V10b (best r128)	128	90.4%	5.686	0.719

30.8% inference VRAM reduction (33.83 → 23.41 GB peak)
Evaluated on 20 out-of-distribution diverse prompts
All experiments on single NVIDIA A100-SXM4-80GB

Checkpoints

Model Weights (Ternary + LoRA)

File	Version	Rank	Steps	OOD CLIP
`ternary_distilled_r64_res1024_s4000_fm_lpips1e-01.pt`	V7	64	4,000	88.9%
`ternary_distilled_r64_res1024_s6000_fm_lpips1e-01.pt`	V9b	64	6,000	90.0%
`ternary_distilled_r64_res1024_s8000_fm_lpips1e-01.pt`	V9c	64	8,000	88.8%
`ternary_distilled_r128_res1024_s6000_fm_lpips1e-01.pt`	V10	128	6,000	87.7%
`ternary_distilled_r128_res1024_s12000_fm_lpips1e-01.pt`	V10b	128	12,000	90.4%

Training Datasets

File	Prompts	Images	Description
`teacher_dataset_v7.pt`	1,002	1,374	V7 teacher latents
`teacher_dataset_v9b_combined.pt`	2,132	2,504	V9b combined (V7 + new 1,130)
`teacher_dataset_v9c_combined.pt`	4,007	4,379	V9c combined (V9b + new 1,875)

Usage

from diffusers import FluxPipeline
from models.ternary import quantize_to_ternary
import torch

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",
                                     torch_dtype=torch.bfloat16).to("cuda")

# Quantize to ternary + LoRA
quantize_to_ternary(pipe.transformer, lora_rank=128, svd_init=False)

# Load checkpoint
ckpt = torch.load("ternary_distilled_r128_res1024_s12000_fm_lpips1e-01.pt",
                   map_location="cuda", weights_only=True)
state = {k: v for k, v in pipe.transformer.named_parameters()}
for name, tensor in ckpt.items():
    if name in state:
        state[name].data.copy_(tensor.to(torch.bfloat16))

# Generate
image = pipe("A majestic lion resting on a savanna at golden hour",
             height=1024, width=1024, num_inference_steps=30,
             guidance_scale=3.5).images[0]

Key Findings

Offline FM distillation with pre-generated teacher latents is strictly superior to online methods
Data scaling follows log₂ law up to LoRA capacity ceiling: OOD CLIP % = 0.0115 × log₂(prompts) + 0.7744
Scaling law breaks at ~4,000 prompts for rank-64 (capacity saturation)
Rank-128 LoRA breaks the ceiling (90.0% → 90.4%) but needs 2× training steps from cold start

Citation

@misc{ugonfor2026ternaryflux,
  title={Reproducing 1.58-Bit FLUX: Ternary Quantization with LoRA Compensation},
  author={Ugon For},
  year={2026},
  url={https://github.com/ugonfor/1.58bit-flux}
}

Acknowledgments

Based on 1.58-bit FLUX by Yang et al. Base model: FLUX.1-dev by Black Forest Labs.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for ugonfor/1.58bit-flux-reproduction

1.58-bit FLUX

Paper • 2412.18653 • Published Dec 24, 2024 • 86