StarVLA-WM4A (LIBERO)
StarVLA-WM4A is a Vision-Language-Action (VLA) policy built on top of the
StarVLA framework. It couples the
Cosmos-Predict2
video world model as a frozen perception backbone with a lightweight
flow-matching action DiT head (CosmoPredict2GR00T framework), and is
fine-tuned on the full LIBERO manipulation suite.
It is trained on the joint LIBERO-Spatial / LIBERO-Object / LIBERO-Goal / LIBERO-10 task mix.
π€ Please refer to the official StarVLA repository for installation, training recipes, and evaluation tooling. This repo only hosts the model weights and the minimal configuration required to load them.
β¨ Highlights
| Property | Value |
|---|---|
| Framework | CosmoPredict2GR00T (StarVLA) |
| Perception backbone | nvidia/Cosmos-Predict2-2B-Video2World (frozen VAE + T5) |
| Action head | DiT-B, 16 layers, hidden=1024 |
| Action dim / horizon | 7 / 8 (delta qpos + gripper) |
| State dim | 7 |
| Benchmark | LIBERO (4 task suites) |
| Training precision | bf16 mixed precision |
| LIBERO-Goal success rate | 92.0% (184 / 200, see below) |
π¦ Files
StarVLA_WM4A/
βββ README.md # this file
βββ config.yaml # minimal loadable config
βββ dataset_statistics.json # action/state normalization stats
βββ starvla_wm4a_libero.pt # model weights (~14 GB)
π Quick Start
1. Install StarVLA
Follow the installation instructions in the official repository:
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# create the conda env, install deps etc. β see the upstream README
2. Download the checkpoint
# Option A β huggingface-cli
huggingface-cli download JackAILab/StarVLA_WM4A \
--local-dir ./pretrained/StarVLA_WM4A
# Option B β python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="JackAILab/StarVLA_WM4A",
local_dir="./pretrained/StarVLA_WM4A",
)
You also need the Cosmos-Predict2 backbone that this model is built on:
huggingface-cli download nvidia/Cosmos-Predict2-2B-Video2World \
--local-dir ./pretrained/Cosmos-Predict2-2B-Video2World
3. Run LIBERO evaluation
From the starVLA/ repo root:
# start the policy server with this checkpoint
CUDA_VISIBLE_DEVICES=0 python deployment/model_server/server_policy.py \
--ckpt_path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
--port 6694 \
--use_bf16
# in a second shell (with the `libero` env activated):
python examples/LIBERO/eval_files/eval_libero.py \
--args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
--args.host 127.0.0.1 \
--args.port 6694 \
--args.task-suite-name libero_goal \
--args.num-trials-per-task 20 \
--args.video-out-path results/eval_libero_goal
4. Load in Python
from starVLA.model.framework.base_framework import baseframework
policy = baseframework.from_pretrained(
"./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt",
)
policy = policy.to("cuda").eval()
# predict a 7-DoF action chunk from an observation dict
# observation = {"image": [PIL.Image], "lang": "put the bowl on the plate", "state": np.ndarray[7]}
action_chunk = policy.predict_action([observation]) # -> shape [1, 8, 7]
Before loading, make sure the backbone paths in config.yaml
(framework.world_model.base_wm, framework.qwenvl.base_vlm) point to your
local copy of Cosmos-Predict2-2B-Video2World (or leave them as the HF repo id
if your StarVLA build resolves HF paths directly).
π§ͺ Model Configuration
Key settings (see config.yaml for the full spec):
framework:
name: CosmoPredict2GR00T
world_model:
base_wm: nvidia/Cosmos-Predict2-2B-Video2World
action_model:
action_model_type: DiT-B # 16-layer DiT
hidden_size: 1024
action_dim: 7 # (dx, dy, dz, droll, dpitch, dyaw, gripper)
state_dim: 7
future_action_window_size: 7 # predicts 8 actions per step
action_horizon: 8
repeated_diffusion_steps: 8
num_inference_timesteps: 4
enable_video_loss: false
trainer:
max_train_steps: 80000
num_warmup_steps: 3000
learning_rate:
base: 1.0e-05 # backbone LR (frozen text/vae modules)
lr_scheduler_type: cosine_with_min_lr
freeze_modules: backbone.text_encoder, backbone.vae
- Frozen modules: T5 text encoder and Cosmos VAE β only the DiT transformer and action head receive gradients.
- Optimizer: AdamW,
Ξ² = (0.9, 0.95), weight decay1e-8, grad clip1.0. - Schedule: cosine-with-min-lr, 3 k warmup.
- Precision: bf16 mixed precision with gradient checkpointing.
dataset_statistics.json contains the per-dimension action/state mean/std/min/max
computed on the LIBERO Franka mix. These are required at inference time for
normalization (unnorm_key=franka).
π LIBERO-Goal Results
Evaluated with the standard StarVLA LIBERO pipeline β 20 rollouts per task,
10 tasks in the libero_goal suite (200 rollouts total). The policy server
runs at bf16, 4 inference timesteps, action chunk of 8.
Overall success rate: 92.0% (184 / 200)
| Task | Success | Rate |
|---|---|---|
push_the_plate_to_the_front_of_the_stove |
20 / 20 | 100.0% |
put_the_bowl_on_the_plate |
20 / 20 | 100.0% |
put_the_wine_bottle_on_top_of_the_cabinet |
20 / 20 | 100.0% |
turn_on_the_stove |
20 / 20 | 100.0% |
open_the_middle_drawer_of_the_cabinet |
19 / 20 | 95.0% |
put_the_bowl_on_top_of_the_cabinet |
19 / 20 | 95.0% |
put_the_cream_cheese_in_the_bowl |
18 / 20 | 90.0% |
put_the_bowl_on_the_stove |
17 / 20 | 85.0% |
put_the_wine_bottle_on_the_rack |
16 / 20 | 80.0% |
open_the_top_drawer_and_put_the_bowl_inside |
15 / 20 | 75.0% |
Reproduce with:
python examples/LIBERO/eval_files/eval_libero.py \
--args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
--args.task-suite-name libero_goal \
--args.num-trials-per-task 20
Evaluation on the other LIBERO suites (libero_spatial, libero_object,
libero_10) is ongoing and will be appended here once the full sweep finishes.
π Training Data
Trained on the four LIBERO task suites in a balanced mixture, loaded through the StarVLA LeRobot data pipeline:
libero_spatial_no_noops_1.0.0_lerobotlibero_object_no_noops_1.0.0_lerobotlibero_goal_no_noops_1.0.0_lerobotlibero_10_no_noops_1.0.0_lerobot
All four are derived from the original LIBERO benchmark (see LIBERO) and wrapped into LeRobot format (see openvla/modified_libero_rlds for the upstream RLDS version).
Input: single RGB view at 224 Γ 224, language instruction, 7-D robot state.
Output: chunk of 8 future actions (delta_qpos + gripper).
π License
Released under the Apache 2.0 license.
This checkpoint is built on top of nvidia/Cosmos-Predict2-2B-Video2World β please also comply with the upstream Cosmos model license when using or redistributing these weights.
π Citation
If you use this checkpoint, please cite the StarVLA project and the Cosmos-Predict2 world model:
@misc{starvla2026,
title = {StarVLA: A Unified Vision-Language-Action Framework},
author = {StarVLA Contributors},
year = {2026},
url = {https://github.com/starVLA/starVLA}
}
@misc{cosmospredict2,
title = {Cosmos-Predict2: A Video World Model for Robotics and Simulation},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World}
}
π Links
- Framework: https://github.com/starVLA/starVLA
- Backbone: https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World
- Benchmark: https://github.com/Lifelong-Robot-Learning/LIBERO
- Issues / Questions: please open an issue in the
StarVLA repo and tag it with
model/StarVLA_WM4A.
- Downloads last month
- 6
Model tree for JackAILab/StarVLA_WM4A
Base model
nvidia/Cosmos-Predict2-2B-Video2World