Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -14,12 +14,56 @@ base_model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera
|
|
| 14 |
|
| 15 |
Converted from [alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera) (VideoX-Fun format) to HuggingFace diffusers format.
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Model Details
|
| 18 |
|
| 19 |
-
- **Architecture**: WanTransformer3DModel + CameraControlAdapter
|
| 20 |
- **Parameters**: 1.616B total (1.564B base + 51.9M camera adapter)
|
| 21 |
- **Precision**: bfloat16
|
| 22 |
-
- **in_channels**: 32 (16 noise + 16 image latents; camera
|
| 23 |
- **Camera conditioning**: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution
|
| 24 |
|
| 25 |
## Camera Control Architecture
|
|
@@ -32,74 +76,43 @@ Unlike the regular Control model (which concatenates control signals as extra in
|
|
| 32 |
|
| 33 |
The adapter output is **added** to patch-embedded latents before the transformer blocks.
|
| 34 |
|
| 35 |
-
##
|
| 36 |
-
|
| 37 |
-
Forward-pass comparison against the original VideoX-Fun model in fp32:
|
| 38 |
-
- Max absolute diff: **1.67e-6** (attention backend numerical noise)
|
| 39 |
-
- allclose(atol=1e-2, rtol=1e-2): **True**
|
| 40 |
-
- Parameter count: identical (1,616,313,152)
|
| 41 |
-
|
| 42 |
-
## Usage
|
| 43 |
-
|
| 44 |
-
```python
|
| 45 |
-
import sys, torch
|
| 46 |
-
sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer")
|
| 47 |
-
from modeling_wan_camera import WanCameraControlTransformer3DModel
|
| 48 |
-
|
| 49 |
-
# Load transformer
|
| 50 |
-
transformer = WanCameraControlTransformer3DModel.from_pretrained(
|
| 51 |
-
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
|
| 52 |
-
subfolder="transformer",
|
| 53 |
-
torch_dtype=torch.bfloat16,
|
| 54 |
-
)
|
| 55 |
-
|
| 56 |
-
# Load other components
|
| 57 |
-
from diffusers import AutoencoderKLWan, UniPCMultistepScheduler
|
| 58 |
-
from transformers import CLIPVisionModel, CLIPImageProcessor, UMT5EncoderModel, AutoTokenizer
|
| 59 |
-
|
| 60 |
-
vae = AutoencoderKLWan.from_pretrained(
|
| 61 |
-
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
|
| 62 |
-
subfolder="vae", torch_dtype=torch.bfloat16)
|
| 63 |
-
text_encoder = UMT5EncoderModel.from_pretrained(
|
| 64 |
-
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
|
| 65 |
-
subfolder="text_encoder", torch_dtype=torch.bfloat16)
|
| 66 |
-
image_encoder = CLIPVisionModel.from_pretrained(
|
| 67 |
-
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
|
| 68 |
-
subfolder="image_encoder", torch_dtype=torch.bfloat16)
|
| 69 |
-
tokenizer = AutoTokenizer.from_pretrained(
|
| 70 |
-
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
|
| 71 |
-
subfolder="tokenizer")
|
| 72 |
-
scheduler = UniPCMultistepScheduler.from_pretrained(
|
| 73 |
-
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
|
| 74 |
-
subfolder="scheduler")
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
## Camera Conditioning Input
|
| 78 |
-
|
| 79 |
-
The transformer accepts `control_camera_video` as a `[B, 24, F, H_pixel, W_pixel]` tensor of temporally-packed Plucker ray embeddings:
|
| 80 |
|
| 81 |
```python
|
| 82 |
output = transformer(
|
| 83 |
hidden_states=latents, # [B, 32, F, H, W] noise + image latents
|
| 84 |
-
timestep=timestep,
|
| 85 |
-
encoder_hidden_states=text_emb, # [B, 512, 4096]
|
| 86 |
-
encoder_hidden_states_image=clip_emb, # [B, 257, 1280]
|
| 87 |
-
control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays
|
| 88 |
return_dict=False,
|
| 89 |
)[0]
|
| 90 |
```
|
| 91 |
|
| 92 |
-
Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
##
|
| 95 |
|
| 96 |
-
|
|
| 97 |
|---|---|---|
|
| 98 |
-
|
|
| 99 |
-
|
|
| 100 |
-
|
|
| 101 |
-
|
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
## License
|
| 105 |
|
|
|
|
| 14 |
|
| 15 |
Converted from [alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera) (VideoX-Fun format) to HuggingFace diffusers format.
|
| 16 |
|
| 17 |
+
Self-contained repo with all weights + custom model code. The transformer uses a custom `WanCameraControlTransformer3DModel` class (included in this repo) that extends diffusers' `WanTransformer3DModel` with a camera control adapter.
|
| 18 |
+
|
| 19 |
+
## Quick Start
|
| 20 |
+
|
| 21 |
+
```python
|
| 22 |
+
import torch
|
| 23 |
+
from huggingface_hub import hf_hub_download, snapshot_download
|
| 24 |
+
|
| 25 |
+
REPO = "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers"
|
| 26 |
+
|
| 27 |
+
# Download the custom model class and import it
|
| 28 |
+
import importlib.util, sys
|
| 29 |
+
spec = importlib.util.spec_from_file_location(
|
| 30 |
+
"modeling_wan_camera",
|
| 31 |
+
hf_hub_download(REPO, "modeling_wan_camera.py"))
|
| 32 |
+
mod = importlib.util.module_from_spec(spec)
|
| 33 |
+
spec.loader.exec_module(mod)
|
| 34 |
+
|
| 35 |
+
# Load transformer with camera adapter
|
| 36 |
+
transformer = mod.WanCameraControlTransformer3DModel.from_pretrained(
|
| 37 |
+
REPO, subfolder="transformer", torch_dtype=torch.bfloat16)
|
| 38 |
+
|
| 39 |
+
# Load other pipeline components
|
| 40 |
+
from diffusers import AutoencoderKLWan
|
| 41 |
+
from transformers import CLIPVisionModel, UMT5EncoderModel, AutoTokenizer
|
| 42 |
+
|
| 43 |
+
vae = AutoencoderKLWan.from_pretrained(REPO, subfolder="vae", torch_dtype=torch.bfloat16)
|
| 44 |
+
text_encoder = UMT5EncoderModel.from_pretrained(REPO, subfolder="text_encoder", torch_dtype=torch.bfloat16)
|
| 45 |
+
image_encoder = CLIPVisionModel.from_pretrained(REPO, subfolder="image_encoder", torch_dtype=torch.bfloat16)
|
| 46 |
+
tokenizer = AutoTokenizer.from_pretrained(REPO, subfolder="tokenizer")
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
Or if you've cloned/downloaded the repo locally:
|
| 50 |
+
|
| 51 |
+
```python
|
| 52 |
+
import sys, torch
|
| 53 |
+
sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers")
|
| 54 |
+
from modeling_wan_camera import WanCameraControlTransformer3DModel
|
| 55 |
+
|
| 56 |
+
transformer = WanCameraControlTransformer3DModel.from_pretrained(
|
| 57 |
+
"Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer",
|
| 58 |
+
torch_dtype=torch.bfloat16)
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
## Model Details
|
| 62 |
|
| 63 |
+
- **Architecture**: WanTransformer3DModel + CameraControlAdapter
|
| 64 |
- **Parameters**: 1.616B total (1.564B base + 51.9M camera adapter)
|
| 65 |
- **Precision**: bfloat16
|
| 66 |
+
- **in_channels**: 32 (16 noise + 16 image latents; camera enters via adapter, not channel concat)
|
| 67 |
- **Camera conditioning**: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution
|
| 68 |
|
| 69 |
## Camera Control Architecture
|
|
|
|
| 76 |
|
| 77 |
The adapter output is **added** to patch-embedded latents before the transformer blocks.
|
| 78 |
|
| 79 |
+
## Transformer Forward Pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
```python
|
| 82 |
output = transformer(
|
| 83 |
hidden_states=latents, # [B, 32, F, H, W] noise + image latents
|
| 84 |
+
timestep=timestep, # [B] diffusion timestep
|
| 85 |
+
encoder_hidden_states=text_emb, # [B, 512, 4096] text embeddings
|
| 86 |
+
encoder_hidden_states_image=clip_emb, # [B, 257, 1280] CLIP image tokens
|
| 87 |
+
control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays at pixel res
|
| 88 |
return_dict=False,
|
| 89 |
)[0]
|
| 90 |
```
|
| 91 |
|
| 92 |
+
Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using VideoX-Fun's `process_pose_file()` or `ray_condition()` utilities from camera extrinsic matrices.
|
| 93 |
+
|
| 94 |
+
## Conversion Verification
|
| 95 |
+
|
| 96 |
+
Forward-pass comparison against the original VideoX-Fun model in fp32:
|
| 97 |
+
- Max absolute diff: **1.67e-6** (attention backend numerical noise)
|
| 98 |
+
- allclose(atol=1e-2, rtol=1e-2): **True**
|
| 99 |
+
- Parameter count: identical (1,616,313,152)
|
| 100 |
+
|
| 101 |
+
Verified both from local weights and from this HuggingFace repo.
|
| 102 |
|
| 103 |
+
## Repo Contents
|
| 104 |
|
| 105 |
+
| File / Directory | Description | Size |
|
| 106 |
|---|---|---|
|
| 107 |
+
| `modeling_wan_camera.py` | Custom model class (also in `transformer/`) | 6 KB |
|
| 108 |
+
| `transformer/` | Converted transformer weights + config | 3.0 GB |
|
| 109 |
+
| `text_encoder/` | UMT5-XXL text encoder | 21 GB |
|
| 110 |
+
| `image_encoder/` | CLIP ViT-H image encoder | 1.2 GB |
|
| 111 |
+
| `vae/` | Wan2.1 VAE | 485 MB |
|
| 112 |
+
| `tokenizer/` | UMT5 tokenizer | 21 MB |
|
| 113 |
+
| `scheduler/` | UniPCMultistepScheduler config | 1 KB |
|
| 114 |
+
| `image_processor/` | CLIPImageProcessor config | 1 KB |
|
| 115 |
+
| `model_index.json` | Pipeline component index | 1 KB |
|
| 116 |
|
| 117 |
## License
|
| 118 |
|