StarVLA-WM4A (LIBERO)

StarVLA-WM4A is a Vision-Language-Action (VLA) policy built on top of the StarVLA framework. It couples the Cosmos-Predict2 video world model as a frozen perception backbone with a lightweight flow-matching action DiT head (CosmoPredict2GR00T framework), and is fine-tuned on the full LIBERO manipulation suite.

It is trained on the joint LIBERO-Spatial / LIBERO-Object / LIBERO-Goal / LIBERO-10 task mix.

🤝 Please refer to the official StarVLA repository for installation, training recipes, and evaluation tooling. This repo only hosts the model weights and the minimal configuration required to load them.

✨ Highlights

Property	Value
Framework	`CosmoPredict2GR00T` (StarVLA)
Perception backbone	`nvidia/Cosmos-Predict2-2B-Video2World` (frozen VAE + T5)
Action head	DiT-B, 16 layers, hidden=1024
Action dim / horizon	7 / 8 (delta qpos + gripper)
State dim	7
Benchmark	LIBERO (4 task suites)
Training precision	bf16 mixed precision
LIBERO-Goal success rate	92.0% (184 / 200, see below)

📦 Files

StarVLA_WM4A/
├── README.md                  # this file
├── config.yaml                # minimal loadable config
├── dataset_statistics.json    # action/state normalization stats
└── starvla_wm4a_libero.pt     # model weights (~14 GB)

🚀 Quick Start

1. Install StarVLA

Follow the installation instructions in the official repository:

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# create the conda env, install deps etc. — see the upstream README

2. Download the checkpoint

# Option A — huggingface-cli
huggingface-cli download JackAILab/StarVLA_WM4A \
    --local-dir ./pretrained/StarVLA_WM4A

# Option B — python
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="JackAILab/StarVLA_WM4A",
    local_dir="./pretrained/StarVLA_WM4A",
)

You also need the Cosmos-Predict2 backbone that this model is built on:

huggingface-cli download nvidia/Cosmos-Predict2-2B-Video2World \
    --local-dir ./pretrained/Cosmos-Predict2-2B-Video2World

3. Run LIBERO evaluation

From the starVLA/ repo root:

# start the policy server with this checkpoint
CUDA_VISIBLE_DEVICES=0 python deployment/model_server/server_policy.py \
    --ckpt_path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --port 6694 \
    --use_bf16

# in a second shell (with the `libero` env activated):
python examples/LIBERO/eval_files/eval_libero.py \
    --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --args.host 127.0.0.1 \
    --args.port 6694 \
    --args.task-suite-name libero_goal \
    --args.num-trials-per-task 20 \
    --args.video-out-path results/eval_libero_goal

4. Load in Python

from starVLA.model.framework.base_framework import baseframework

policy = baseframework.from_pretrained(
    "./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt",
)
policy = policy.to("cuda").eval()

# predict a 7-DoF action chunk from an observation dict
# observation = {"image": [PIL.Image], "lang": "put the bowl on the plate", "state": np.ndarray[7]}
action_chunk = policy.predict_action([observation])  # -> shape [1, 8, 7]

Before loading, make sure the backbone paths in config.yaml (framework.world_model.base_wm, framework.qwenvl.base_vlm) point to your local copy of Cosmos-Predict2-2B-Video2World (or leave them as the HF repo id if your StarVLA build resolves HF paths directly).

🧪 Model Configuration

Key settings (see config.yaml for the full spec):

framework:
  name: CosmoPredict2GR00T
  world_model:
    base_wm: nvidia/Cosmos-Predict2-2B-Video2World
  action_model:
    action_model_type: DiT-B      # 16-layer DiT
    hidden_size: 1024
    action_dim: 7                 # (dx, dy, dz, droll, dpitch, dyaw, gripper)
    state_dim: 7
    future_action_window_size: 7  # predicts 8 actions per step
    action_horizon: 8
    repeated_diffusion_steps: 8
    num_inference_timesteps: 4
  enable_video_loss: false

trainer:
  max_train_steps: 80000
  num_warmup_steps: 3000
  learning_rate:
    base: 1.0e-05                 # backbone LR (frozen text/vae modules)
  lr_scheduler_type: cosine_with_min_lr
  freeze_modules: backbone.text_encoder, backbone.vae

Frozen modules: T5 text encoder and Cosmos VAE — only the DiT transformer and action head receive gradients.
Optimizer: AdamW, β = (0.9, 0.95), weight decay 1e-8, grad clip 1.0.
Schedule: cosine-with-min-lr, 3 k warmup.
Precision: bf16 mixed precision with gradient checkpointing.

dataset_statistics.json contains the per-dimension action/state mean/std/min/max computed on the LIBERO Franka mix. These are required at inference time for normalization (unnorm_key=franka).

🏆 LIBERO-Goal Results

Evaluated with the standard StarVLA LIBERO pipeline — 20 rollouts per task, 10 tasks in the libero_goal suite (200 rollouts total). The policy server runs at bf16, 4 inference timesteps, action chunk of 8.

Overall success rate: 92.0% (184 / 200)

Task	Success	Rate
`push_the_plate_to_the_front_of_the_stove`	20 / 20	100.0%
`put_the_bowl_on_the_plate`	20 / 20	100.0%
`put_the_wine_bottle_on_top_of_the_cabinet`	20 / 20	100.0%
`turn_on_the_stove`	20 / 20	100.0%
`open_the_middle_drawer_of_the_cabinet`	19 / 20	95.0%
`put_the_bowl_on_top_of_the_cabinet`	19 / 20	95.0%
`put_the_cream_cheese_in_the_bowl`	18 / 20	90.0%
`put_the_bowl_on_the_stove`	17 / 20	85.0%
`put_the_wine_bottle_on_the_rack`	16 / 20	80.0%
`open_the_top_drawer_and_put_the_bowl_inside`	15 / 20	75.0%

Reproduce with:

python examples/LIBERO/eval_files/eval_libero.py \
    --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --args.task-suite-name libero_goal \
    --args.num-trials-per-task 20

Evaluation on the other LIBERO suites (libero_spatial, libero_object, libero_10) is ongoing and will be appended here once the full sweep finishes.

📊 Training Data

Trained on the four LIBERO task suites in a balanced mixture, loaded through the StarVLA LeRobot data pipeline:

libero_spatial_no_noops_1.0.0_lerobot
libero_object_no_noops_1.0.0_lerobot
libero_goal_no_noops_1.0.0_lerobot
libero_10_no_noops_1.0.0_lerobot

All four are derived from the original LIBERO benchmark (see LIBERO) and wrapped into LeRobot format (see openvla/modified_libero_rlds for the upstream RLDS version).

Input: single RGB view at 224 × 224, language instruction, 7-D robot state. Output: chunk of 8 future actions (delta_qpos + gripper).

📜 License

Released under the Apache 2.0 license.

This checkpoint is built on top of nvidia/Cosmos-Predict2-2B-Video2World — please also comply with the upstream Cosmos model license when using or redistributing these weights.

📖 Citation

If you use this checkpoint, please cite the StarVLA project and the Cosmos-Predict2 world model:

@misc{starvla2026,
  title  = {StarVLA: A Unified Vision-Language-Action Framework},
  author = {StarVLA Contributors},
  year   = {2026},
  url    = {https://github.com/starVLA/starVLA}
}

@misc{cosmospredict2,
  title  = {Cosmos-Predict2: A Video World Model for Robotics and Simulation},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World}
}

🔗 Links

Framework: https://github.com/starVLA/starVLA
Backbone: https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World
Benchmark: https://github.com/Lifelong-Robot-Learning/LIBERO
Issues / Questions: please open an issue in the StarVLA repo and tag it with model/StarVLA_WM4A.

Downloads last month: 6

Video Preview

Robotics

Model tree for JackAILab/StarVLA_WM4A

Base model

nvidia/Cosmos-Predict2-2B-Video2World

Finetuned

(4)

this model

JackAILab
/

StarVLA_WM4A