LLaMA 11.7B - 4-bit Quantized

πŸ“š Paper β€’ 🏠 GitHub

This is one of the checkpoints supplementing the paper 1-Bit-Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization. Instructions on how to use the model for inference can be found in the corresponding repository.

⚠️ IMPORTANT: This model is intended for research purposes only. It is provided as-is without warranties for production use.

Model Details

  • Architecture: LLaMA
  • Size: 11.7B (11,701,129,216 parameters)
  • Quantization: 4-bit k-means with per-64 block absmax scaling
  • Centroids: Defined per Layer

Directory Structure

.
β”œβ”€β”€ config.json                  # HuggingFace model config
β”œβ”€β”€ quantization_config.json     # Quantization parameters
β”œβ”€β”€ generation_config.json       # Default generation settings
β”œβ”€β”€ tokenizer.json              # Tokenizer files
└── model.safetensors           # Weights (quantized linear layers + bf16 embeddings/norms)

Weight Format

Linear layers are quantized:

  • *.weight_packed: Packed uint8 indices (2 elements per byte)
  • *.scales: Per-block scale factors (bfloat16)
  • *.centroids: K-means centroids for this layer (float32)

Embeddings and norms are stored in bfloat16.

Downloads last month
-
Safetensors
Model size
6B params
Tensor type
F32
Β·
BF16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support