DynaCLIP: Physics-Grounded Visual Representations via Dynamics Contrastive Learning
Model Description
DynaCLIP is a self-supervised visual encoder that embeds implicit physical dynamics into visual representations through contrastive pre-training with analytically computed physics priors. Built on DINOv2-ViT-B/14, it fine-tunes the entire backbone with a Soft InfoNCE loss using category-grounded material properties (mass, friction, restitution) derived from 10 physics archetypes.
DynaCLIP achieves #1 ranking on 13 out of 14 headline metrics across 14 experiments, spanning physics understanding, robotic manipulation, material reasoning, and world modeling — without any task-specific fine-tuning of the representation.
Architecture
- Backbone: DINOv2-ViT-B/14 (86M params, fully fine-tuned)
- Feature extraction: CLS token ∥ mean-pooled patches → 1536-dim
- Projection head: Linear(1536,768) → LayerNorm → GELU → Linear(768,512) → L2-norm (discarded after pre-training)
- Total parameters: 88.2M (training) / 86M (inference)
Files
| File |
Size |
Description |
dynaclip_backbone.pt |
346 MB |
Recommended. Backbone weights only (DINOv2-ViT-B/14 fine-tuned). Use for downstream tasks. |
dynaclip_final.pt |
1.01 GB |
Full training checkpoint (model + optimizer + scheduler + loss). Use to resume training. |
Usage
import torch
from dynaclip.models import DynaCLIPEncoder
encoder = DynaCLIPEncoder(checkpoint_path="dynaclip_backbone.pt")
encoder.eval().cuda()
images = torch.randn(4, 3, 224, 224).cuda()
with torch.no_grad():
features = encoder(images)
Or download from Python:
from huggingface_hub import hf_hub_download
path = hf_hub_download(repo_id="zhengtaoyao/DynaCLIP", filename="dynaclip_backbone.pt")
Training Details
- Data: DomainNet (345 categories) with analytical physics trajectories
- Physics priors: 10 material archetypes (metal, wood, fabric, glass/ceramic, rubber/plastic, food/organic, paper, animal, stone, default)
- Loss: Soft InfoNCE with learnable temperature (init τ=0.07)
- Optimizer: AdamW (backbone lr=1e-5, head lr=1e-3, weight decay=0.05)
- Schedule: Cosine with 500-step warmup, 20K total steps
- Batch size: 1280 effective (DDP across 8 GPUs)
- Precision: bf16
- Hardware: 8× NVIDIA RTX PRO 6000 Blackwell (97 GB each)
Key Results (v2)
LIBERO-10 Robotic Manipulation (200 epochs, action chunking)
| Backbone |
Success Rate |
| VC-1 |
31.4% |
| MCR |
32.9% |
| MVP |
32.9% |
| R3M |
42.0% |
| Voltron |
43.6% |
| CLIP ViT-B/16 |
46.9% |
| Theia |
49.0% |
| SigLIP ViT-B/16 |
49.2% |
| DINOv2 ViT-B/14 |
51.5% |
| DynaCLIP (Ours) |
60.4% |
Invisible-Object Physics Prediction (Linear Probing, R²)
| Backbone |
Mass |
Friction |
Restitution |
| CLIP ViT-B/16 |
0.196 |
0.065 |
0.041 |
| SigLIP ViT-B/16 |
0.207 |
0.048 |
0.082 |
| DINOv2 ViT-B/14 |
0.338 |
0.220 |
0.276 |
| DynaCLIP (Ours) |
0.553 |
0.378 |
0.652 |
Material Clustering (NMI / ARI)
| Backbone |
NMI |
ARI |
| CLIP ViT-B/16 |
0.253 |
0.076 |
| DINOv2 ViT-B/14 |
0.381 |
0.181 |
| DynaCLIP (Ours) |
0.424 |
0.208 |
CALVIN ABC→D Feature-Based Control
| Backbone |
Avg Len (↑) |
| DINOv2 ViT-B/14 |
1.23 |
| CLIP ViT-B/16 |
2.18 |
| DynaCLIP (Ours) |
2.64 |
DROID-100 Offline Prediction
| Backbone |
CosSim (↑) |
GripAcc (↑) |
| DINOv2 ViT-B/14 |
0.740 |
81.8% |
| CLIP ViT-B/16 |
0.748 |
80.3% |
| DynaCLIP (Ours) |
0.787 |
84.7% |
ManiSkill3 Generalization
| Backbone |
Gen Ratio (↑) |
| VC-1 |
0.85× |
| DINOv2 ViT-B/14 |
0.90× |
| CLIP ViT-B/16 |
0.95× |
| SigLIP ViT-B/16 |
0.98× |
| DynaCLIP (Ours) |
1.02× |
Summary
13/14 headline metrics at rank #1 across physics understanding, robotic manipulation, material reasoning, and visual world modeling. DynaCLIP demonstrates that physics-grounded contrastive learning produces representations that systematically improve downstream embodied AI tasks.
Citation
@inproceedings{yao2026dynaclip,
title={DynaCLIP: Physics-Grounded Visual Representations via Dynamics Contrastive Learning},
author={Yao, Zhengtao},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2026}
}
Links