Nanbeige4.1-VLM

Full vision-language model after Stage 2 instruction fine-tuning on LLaVA-Instruct-150K. LoRA weights have been merged into the base model for easy inference.

Architecture

Image → SigLIP so400m → AvgPool(729→196) → MLP Projector → Nanbeige4.1-3B → Text

Usage

from transformers import AutoModel, AutoTokenizer
from PIL import Image

model = AutoModel.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM",
    trust_remote_code=True,
)
model.to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM",
    trust_remote_code=True,
)
model.set_tokenizer(tokenizer)

image  = Image.open("photo.jpg")
result = model.describe(image, prompt="What do you see in this image?")
print(result)

Training Details

Stage 1 Stage 2
Dataset LLaVA-CC3M-595K LLaVA-Instruct-150K
Trainable Projector only Projector + LoRA (r=64)
LR 2e-3 2e-5
Hardware A100 80GB A100 80GB
Duration ~6 hours ~5 hours

Related Repos

Downloads last month
196
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SkyAsl/Nanbeige4.1-VLM

Finetuned
(25)
this model
Quantizations
1 model

Datasets used to train SkyAsl/Nanbeige4.1-VLM

Collection including SkyAsl/Nanbeige4.1-VLM