DynaCLIP: Physics-Grounded Visual Representations via Dynamics Contrastive Learning

Model Description

DynaCLIP is a self-supervised visual encoder that embeds implicit physical dynamics into visual representations through contrastive pre-training with analytically computed physics priors. Built on DINOv2-ViT-B/14, it fine-tunes the entire backbone with a Soft InfoNCE loss using category-grounded material properties (mass, friction, restitution) derived from 10 physics archetypes.

DynaCLIP achieves #1 ranking on 13 out of 14 headline metrics across 14 experiments, spanning physics understanding, robotic manipulation, material reasoning, and world modeling — without any task-specific fine-tuning of the representation.

Architecture

  • Backbone: DINOv2-ViT-B/14 (86M params, fully fine-tuned)
  • Feature extraction: CLS token ∥ mean-pooled patches → 1536-dim
  • Projection head: Linear(1536,768) → LayerNorm → GELU → Linear(768,512) → L2-norm (discarded after pre-training)
  • Total parameters: 88.2M (training) / 86M (inference)

Files

File Size Description
dynaclip_backbone.pt 346 MB Recommended. Backbone weights only (DINOv2-ViT-B/14 fine-tuned). Use for downstream tasks.
dynaclip_final.pt 1.01 GB Full training checkpoint (model + optimizer + scheduler + loss). Use to resume training.

Usage

import torch
from dynaclip.models import DynaCLIPEncoder

# Load backbone for downstream feature extraction
encoder = DynaCLIPEncoder(checkpoint_path="dynaclip_backbone.pt")
encoder.eval().cuda()

images = torch.randn(4, 3, 224, 224).cuda()
with torch.no_grad():
    features = encoder(images)  # [4, 1536]

Or download from Python:

from huggingface_hub import hf_hub_download

path = hf_hub_download(repo_id="zhengtaoyao/DynaCLIP", filename="dynaclip_backbone.pt")

Training Details

  • Data: DomainNet (345 categories) with analytical physics trajectories
  • Physics priors: 10 material archetypes (metal, wood, fabric, glass/ceramic, rubber/plastic, food/organic, paper, animal, stone, default)
  • Loss: Soft InfoNCE with learnable temperature (init τ=0.07)
  • Optimizer: AdamW (backbone lr=1e-5, head lr=1e-3, weight decay=0.05)
  • Schedule: Cosine with 500-step warmup, 20K total steps
  • Batch size: 1280 effective (DDP across 8 GPUs)
  • Precision: bf16
  • Hardware: 8× NVIDIA RTX PRO 6000 Blackwell (97 GB each)

Key Results (v2)

LIBERO-10 Robotic Manipulation (200 epochs, action chunking)

Backbone Success Rate
VC-1 31.4%
MCR 32.9%
MVP 32.9%
R3M 42.0%
Voltron 43.6%
CLIP ViT-B/16 46.9%
Theia 49.0%
SigLIP ViT-B/16 49.2%
DINOv2 ViT-B/14 51.5%
DynaCLIP (Ours) 60.4%

Invisible-Object Physics Prediction (Linear Probing, R²)

Backbone Mass Friction Restitution
CLIP ViT-B/16 0.196 0.065 0.041
SigLIP ViT-B/16 0.207 0.048 0.082
DINOv2 ViT-B/14 0.338 0.220 0.276
DynaCLIP (Ours) 0.553 0.378 0.652

Material Clustering (NMI / ARI)

Backbone NMI ARI
CLIP ViT-B/16 0.253 0.076
DINOv2 ViT-B/14 0.381 0.181
DynaCLIP (Ours) 0.424 0.208

CALVIN ABC→D Feature-Based Control

Backbone Avg Len (↑)
DINOv2 ViT-B/14 1.23
CLIP ViT-B/16 2.18
DynaCLIP (Ours) 2.64

DROID-100 Offline Prediction

Backbone CosSim (↑) GripAcc (↑)
DINOv2 ViT-B/14 0.740 81.8%
CLIP ViT-B/16 0.748 80.3%
DynaCLIP (Ours) 0.787 84.7%

ManiSkill3 Generalization

Backbone Gen Ratio (↑)
VC-1 0.85×
DINOv2 ViT-B/14 0.90×
CLIP ViT-B/16 0.95×
SigLIP ViT-B/16 0.98×
DynaCLIP (Ours) 1.02×

Summary

13/14 headline metrics at rank #1 across physics understanding, robotic manipulation, material reasoning, and visual world modeling. DynaCLIP demonstrates that physics-grounded contrastive learning produces representations that systematically improve downstream embodied AI tasks.

Citation

@inproceedings{yao2026dynaclip,
  title={DynaCLIP: Physics-Grounded Visual Representations via Dynamics Contrastive Learning},
  author={Yao, Zhengtao},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2026}
}

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support