MIMIC: Motion Imitation from Massive Internet Clips

A 4.0B-parameter vision-language-action model for full-body humanoid control, trained entirely from internet-scale human video.

Model Details

  • Architecture: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
  • Parameters: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
  • Action space: 22-DoF joint angles at 10Hz
  • Action horizon: 16 steps (1.6s)
  • Training data: MoveNet-332 (~332K clips, ~4.7M samples from Kinetics-700)
  • Training compute: 4x RTX Pro Blackwell GPUs (~576 GPU-hours)
  • Checkpoint step: 107250
  • Best validation loss: 0.10838565230369568

Usage

from training.vla_model import QwenVLAModel
import torch, yaml

config = yaml.safe_load(open("training/config_kinetics.yaml"))
model = QwenVLAModel(**config["model_config"])

ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval().cuda()

See the GitHub repo for full inference and training code.

Training

Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder (SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT action head is trained from scratch.

Citation

Paper forthcoming.

License

Apache 2.0

Downloads last month
5
Video Preview
loading

Dataset used to train maxsegan/mimic-vlam