1.58-Bit FLUX Reproduction: Ternary Quantization + LoRA

Reproduction of 1.58-bit FLUX (Yang et al., 2024). Ternary ({-1, 0, +1}) quantization of FLUX.1-dev transformer with LoRA compensation, trained via offline flow-matching distillation.

Results

Model LoRA Rank OOD CLIP (% of BF16) Aesthetic LPIPS
BF16 (baseline) - 100% 5.842 ref
V9b (best r64) 64 90.0% 5.939 0.664
V10b (best r128) 128 90.4% 5.686 0.719
  • 30.8% inference VRAM reduction (33.83 → 23.41 GB peak)
  • Evaluated on 20 out-of-distribution diverse prompts
  • All experiments on single NVIDIA A100-SXM4-80GB

Checkpoints

Model Weights (Ternary + LoRA)

File Version Rank Steps OOD CLIP
ternary_distilled_r64_res1024_s4000_fm_lpips1e-01.pt V7 64 4,000 88.9%
ternary_distilled_r64_res1024_s6000_fm_lpips1e-01.pt V9b 64 6,000 90.0%
ternary_distilled_r64_res1024_s8000_fm_lpips1e-01.pt V9c 64 8,000 88.8%
ternary_distilled_r128_res1024_s6000_fm_lpips1e-01.pt V10 128 6,000 87.7%
ternary_distilled_r128_res1024_s12000_fm_lpips1e-01.pt V10b 128 12,000 90.4%

Training Datasets

File Prompts Images Description
teacher_dataset_v7.pt 1,002 1,374 V7 teacher latents
teacher_dataset_v9b_combined.pt 2,132 2,504 V9b combined (V7 + new 1,130)
teacher_dataset_v9c_combined.pt 4,007 4,379 V9c combined (V9b + new 1,875)

Usage

from diffusers import FluxPipeline
from models.ternary import quantize_to_ternary
import torch

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",
                                     torch_dtype=torch.bfloat16).to("cuda")

# Quantize to ternary + LoRA
quantize_to_ternary(pipe.transformer, lora_rank=128, svd_init=False)

# Load checkpoint
ckpt = torch.load("ternary_distilled_r128_res1024_s12000_fm_lpips1e-01.pt",
                   map_location="cuda", weights_only=True)
state = {k: v for k, v in pipe.transformer.named_parameters()}
for name, tensor in ckpt.items():
    if name in state:
        state[name].data.copy_(tensor.to(torch.bfloat16))

# Generate
image = pipe("A majestic lion resting on a savanna at golden hour",
             height=1024, width=1024, num_inference_steps=30,
             guidance_scale=3.5).images[0]

Key Findings

  1. Offline FM distillation with pre-generated teacher latents is strictly superior to online methods
  2. Data scaling follows log₂ law up to LoRA capacity ceiling: OOD CLIP % = 0.0115 × log₂(prompts) + 0.7744
  3. Scaling law breaks at ~4,000 prompts for rank-64 (capacity saturation)
  4. Rank-128 LoRA breaks the ceiling (90.0% → 90.4%) but needs 2× training steps from cold start

Citation

@misc{ugonfor2026ternaryflux,
  title={Reproducing 1.58-Bit FLUX: Ternary Quantization with LoRA Compensation},
  author={Ugon For},
  year={2026},
  url={https://github.com/ugonfor/1.58bit-flux}
}

Acknowledgments

Based on 1.58-bit FLUX by Yang et al. Base model: FLUX.1-dev by Black Forest Labs.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ugonfor/1.58bit-flux-reproduction