LLaMA 11.7B - 4-bit Quantized

📚 Paper • 🏠 GitHub

This is one of the checkpoints supplementing the paper 1-Bit-Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization. Instructions on how to use the model for inference can be found in the corresponding repository.

⚠️ IMPORTANT: This model is intended for research purposes only. It is provided as-is without warranties for production use.

Model Details

Architecture: LLaMA
Size: 11.7B (11,701,129,216 parameters)
Quantization: 4-bit k-means with per-64 block absmax scaling
Centroids: Defined per Layer

Directory Structure

.
├── config.json                  # HuggingFace model config
├── quantization_config.json     # Quantization parameters
├── generation_config.json       # Default generation settings
├── tokenizer.json              # Tokenizer files
└── model.safetensors           # Weights (quantized linear layers + bf16 embeddings/norms)

Weight Format

Linear layers are quantized:

*.weight_packed: Packed uint8 indices (2 elements per byte)
*.scales: Per-block scale factors (bfloat16)
*.centroids: K-means centroids for this layer (float32)

Embeddings and norms are stored in bfloat16.

Downloads last month: -

Safetensors

Model size

6B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support