NanbeigeVLM
Collection
Vision gained Nanbeige4.1-3B Models • 3 items • Updated
Full vision-language model after Stage 2 instruction fine-tuning on LLaVA-Instruct-150K. LoRA weights have been merged into the base model for easy inference.
Image → SigLIP so400m → AvgPool(729→196) → MLP Projector → Nanbeige4.1-3B → Text
from transformers import AutoModel, AutoTokenizer
from PIL import Image
model = AutoModel.from_pretrained(
"SkyAsl/Nanbeige4.1-VLM",
trust_remote_code=True,
)
model.to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"SkyAsl/Nanbeige4.1-VLM",
trust_remote_code=True,
)
model.set_tokenizer(tokenizer)
image = Image.open("photo.jpg")
result = model.describe(image, prompt="What do you see in this image?")
print(result)
| Stage 1 | Stage 2 | |
|---|---|---|
| Dataset | LLaVA-CC3M-595K | LLaVA-Instruct-150K |
| Trainable | Projector only | Projector + LoRA (r=64) |
| LR | 2e-3 | 2e-5 |
| Hardware | A100 80GB | A100 80GB |
| Duration | ~6 hours | ~5 hours |