SkyAsl
/

Nanbeige4.1-VLM

Image-Text-to-Text

vision-language

Model card Files Files and versions

Nanbeige4.1-VLM

Full vision-language model after Stage 2 instruction fine-tuning on LLaVA-Instruct-150K. LoRA weights have been merged into the base model for easy inference.

Architecture

Image → SigLIP so400m → AvgPool(729→196) → MLP Projector → Nanbeige4.1-3B → Text

Usage

from transformers import AutoModel, AutoTokenizer
from PIL import Image

model = AutoModel.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM",
    trust_remote_code=True,
)
model.to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "SkyAsl/Nanbeige4.1-VLM",
    trust_remote_code=True,
)
model.set_tokenizer(tokenizer)

image  = Image.open("photo.jpg")
result = model.describe(image, prompt="What do you see in this image?")
print(result)

Training Details

	Stage 1	Stage 2
Dataset	LLaVA-CC3M-595K	LLaVA-Instruct-150K
Trainable	Projector only	Projector + LoRA (r=64)
LR	2e-3	2e-5
Hardware	A100 80GB	A100 80GB
Duration	~6 hours	~5 hours

Related Repos

Stage 1 base: SkyAsl/Nanbeige4.1-VLM-Base
LoRA only: SkyAsl/Nanbeige4.1-VLM-LoRA

Downloads last month: 196

Safetensors

Model size

4B params

Tensor type

BF16

·

Model tree for SkyAsl/Nanbeige4.1-VLM

Base model

Nanbeige/Nanbeige4-3B-Base

Finetuned

Nanbeige/Nanbeige4.1-3B

Finetuned

(25)

this model

Quantizations

Datasets used to train SkyAsl/Nanbeige4.1-VLM

Collection including SkyAsl/Nanbeige4.1-VLM

NanbeigeVLM

Vision gained Nanbeige4.1-3B Models • 3 items • Updated 25 days ago