StarVLA-WM4A (LIBERO)

StarVLA-WM4A is a Vision-Language-Action (VLA) policy built on top of the StarVLA framework. It couples the Cosmos-Predict2 video world model as a frozen perception backbone with a lightweight flow-matching action DiT head (CosmoPredict2GR00T framework), and is fine-tuned on the full LIBERO manipulation suite.

It is trained on the joint LIBERO-Spatial / LIBERO-Object / LIBERO-Goal / LIBERO-10 task mix.

🀝 Please refer to the official StarVLA repository for installation, training recipes, and evaluation tooling. This repo only hosts the model weights and the minimal configuration required to load them.


✨ Highlights

Property Value
Framework CosmoPredict2GR00T (StarVLA)
Perception backbone nvidia/Cosmos-Predict2-2B-Video2World (frozen VAE + T5)
Action head DiT-B, 16 layers, hidden=1024
Action dim / horizon 7 / 8 (delta qpos + gripper)
State dim 7
Benchmark LIBERO (4 task suites)
Training precision bf16 mixed precision
LIBERO-Goal success rate 92.0% (184 / 200, see below)

πŸ“¦ Files

StarVLA_WM4A/
β”œβ”€β”€ README.md                  # this file
β”œβ”€β”€ config.yaml                # minimal loadable config
β”œβ”€β”€ dataset_statistics.json    # action/state normalization stats
└── starvla_wm4a_libero.pt     # model weights (~14 GB)

πŸš€ Quick Start

1. Install StarVLA

Follow the installation instructions in the official repository:

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# create the conda env, install deps etc. β€” see the upstream README

2. Download the checkpoint

# Option A β€” huggingface-cli
huggingface-cli download JackAILab/StarVLA_WM4A \
    --local-dir ./pretrained/StarVLA_WM4A

# Option B β€” python
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="JackAILab/StarVLA_WM4A",
    local_dir="./pretrained/StarVLA_WM4A",
)

You also need the Cosmos-Predict2 backbone that this model is built on:

huggingface-cli download nvidia/Cosmos-Predict2-2B-Video2World \
    --local-dir ./pretrained/Cosmos-Predict2-2B-Video2World

3. Run LIBERO evaluation

From the starVLA/ repo root:

# start the policy server with this checkpoint
CUDA_VISIBLE_DEVICES=0 python deployment/model_server/server_policy.py \
    --ckpt_path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --port 6694 \
    --use_bf16

# in a second shell (with the `libero` env activated):
python examples/LIBERO/eval_files/eval_libero.py \
    --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --args.host 127.0.0.1 \
    --args.port 6694 \
    --args.task-suite-name libero_goal \
    --args.num-trials-per-task 20 \
    --args.video-out-path results/eval_libero_goal

4. Load in Python

from starVLA.model.framework.base_framework import baseframework

policy = baseframework.from_pretrained(
    "./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt",
)
policy = policy.to("cuda").eval()

# predict a 7-DoF action chunk from an observation dict
# observation = {"image": [PIL.Image], "lang": "put the bowl on the plate", "state": np.ndarray[7]}
action_chunk = policy.predict_action([observation])  # -> shape [1, 8, 7]

Before loading, make sure the backbone paths in config.yaml (framework.world_model.base_wm, framework.qwenvl.base_vlm) point to your local copy of Cosmos-Predict2-2B-Video2World (or leave them as the HF repo id if your StarVLA build resolves HF paths directly).


πŸ§ͺ Model Configuration

Key settings (see config.yaml for the full spec):

framework:
  name: CosmoPredict2GR00T
  world_model:
    base_wm: nvidia/Cosmos-Predict2-2B-Video2World
  action_model:
    action_model_type: DiT-B      # 16-layer DiT
    hidden_size: 1024
    action_dim: 7                 # (dx, dy, dz, droll, dpitch, dyaw, gripper)
    state_dim: 7
    future_action_window_size: 7  # predicts 8 actions per step
    action_horizon: 8
    repeated_diffusion_steps: 8
    num_inference_timesteps: 4
  enable_video_loss: false

trainer:
  max_train_steps: 80000
  num_warmup_steps: 3000
  learning_rate:
    base: 1.0e-05                 # backbone LR (frozen text/vae modules)
  lr_scheduler_type: cosine_with_min_lr
  freeze_modules: backbone.text_encoder, backbone.vae
  • Frozen modules: T5 text encoder and Cosmos VAE β€” only the DiT transformer and action head receive gradients.
  • Optimizer: AdamW, Ξ² = (0.9, 0.95), weight decay 1e-8, grad clip 1.0.
  • Schedule: cosine-with-min-lr, 3 k warmup.
  • Precision: bf16 mixed precision with gradient checkpointing.

dataset_statistics.json contains the per-dimension action/state mean/std/min/max computed on the LIBERO Franka mix. These are required at inference time for normalization (unnorm_key=franka).


πŸ† LIBERO-Goal Results

Evaluated with the standard StarVLA LIBERO pipeline β€” 20 rollouts per task, 10 tasks in the libero_goal suite (200 rollouts total). The policy server runs at bf16, 4 inference timesteps, action chunk of 8.

Overall success rate: 92.0% (184 / 200)

Task Success Rate
push_the_plate_to_the_front_of_the_stove 20 / 20 100.0%
put_the_bowl_on_the_plate 20 / 20 100.0%
put_the_wine_bottle_on_top_of_the_cabinet 20 / 20 100.0%
turn_on_the_stove 20 / 20 100.0%
open_the_middle_drawer_of_the_cabinet 19 / 20 95.0%
put_the_bowl_on_top_of_the_cabinet 19 / 20 95.0%
put_the_cream_cheese_in_the_bowl 18 / 20 90.0%
put_the_bowl_on_the_stove 17 / 20 85.0%
put_the_wine_bottle_on_the_rack 16 / 20 80.0%
open_the_top_drawer_and_put_the_bowl_inside 15 / 20 75.0%

Reproduce with:

python examples/LIBERO/eval_files/eval_libero.py \
    --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --args.task-suite-name libero_goal \
    --args.num-trials-per-task 20

Evaluation on the other LIBERO suites (libero_spatial, libero_object, libero_10) is ongoing and will be appended here once the full sweep finishes.


πŸ“Š Training Data

Trained on the four LIBERO task suites in a balanced mixture, loaded through the StarVLA LeRobot data pipeline:

  • libero_spatial_no_noops_1.0.0_lerobot
  • libero_object_no_noops_1.0.0_lerobot
  • libero_goal_no_noops_1.0.0_lerobot
  • libero_10_no_noops_1.0.0_lerobot

All four are derived from the original LIBERO benchmark (see LIBERO) and wrapped into LeRobot format (see openvla/modified_libero_rlds for the upstream RLDS version).

Input: single RGB view at 224 Γ— 224, language instruction, 7-D robot state. Output: chunk of 8 future actions (delta_qpos + gripper).


πŸ“œ License

Released under the Apache 2.0 license.

This checkpoint is built on top of nvidia/Cosmos-Predict2-2B-Video2World β€” please also comply with the upstream Cosmos model license when using or redistributing these weights.


πŸ“– Citation

If you use this checkpoint, please cite the StarVLA project and the Cosmos-Predict2 world model:

@misc{starvla2026,
  title  = {StarVLA: A Unified Vision-Language-Action Framework},
  author = {StarVLA Contributors},
  year   = {2026},
  url    = {https://github.com/starVLA/starVLA}
}

@misc{cosmospredict2,
  title  = {Cosmos-Predict2: A Video World Model for Robotics and Simulation},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World}
}

πŸ”— Links

Downloads last month
6
Video Preview
loading

Model tree for JackAILab/StarVLA_WM4A

Finetuned
(4)
this model

Dataset used to train JackAILab/StarVLA_WM4A