LLaMA 11.7B - 4-bit Quantized
This is one of the checkpoints supplementing the paper 1-Bit-Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization. Instructions on how to use the model for inference can be found in the corresponding repository.
β οΈ IMPORTANT: This model is intended for research purposes only. It is provided as-is without warranties for production use.
Model Details
- Architecture: LLaMA
- Size: 11.7B (11,701,129,216 parameters)
- Quantization: 4-bit k-means with per-64 block absmax scaling
- Centroids: Defined per Layer
Directory Structure
.
βββ config.json # HuggingFace model config
βββ quantization_config.json # Quantization parameters
βββ generation_config.json # Default generation settings
βββ tokenizer.json # Tokenizer files
βββ model.safetensors # Weights (quantized linear layers + bf16 embeddings/norms)
Weight Format
Linear layers are quantized:
*.weight_packed: Packed uint8 indices (2 elements per byte)*.scales: Per-block scale factors (bfloat16)*.centroids: K-means centroids for this layer (float32)
Embeddings and norms are stored in bfloat16.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support