the-sweater-cat
/

Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers

@@ -14,12 +14,56 @@ base_model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera
 Converted from [alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera) (VideoX-Fun format) to HuggingFace diffusers format.
 ## Model Details
-- **Architecture**: WanTransformer3DModel + CameraControlAdapter (SimpleAdapter)
 - **Parameters**: 1.616B total (1.564B base + 51.9M camera adapter)
 - **Precision**: bfloat16
-- **in_channels**: 32 (16 noise + 16 image latents; camera control enters via separate adapter, not channel concat)
 - **Camera conditioning**: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution
 ## Camera Control Architecture
@@ -32,74 +76,43 @@ Unlike the regular Control model (which concatenates control signals as extra in
 The adapter output is **added** to patch-embedded latents before the transformer blocks.
-## Conversion Verification
-Forward-pass comparison against the original VideoX-Fun model in fp32:
-- Max absolute diff: **1.67e-6** (attention backend numerical noise)
-- allclose(atol=1e-2, rtol=1e-2): **True**
-- Parameter count: identical (1,616,313,152)
-## Usage
-```python
-import sys, torch
-sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer")
-from modeling_wan_camera import WanCameraControlTransformer3DModel
-# Load transformer
-transformer = WanCameraControlTransformer3DModel.from_pretrained(
-    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
-    subfolder="transformer",
-    torch_dtype=torch.bfloat16,
-)
-# Load other components
-from diffusers import AutoencoderKLWan, UniPCMultistepScheduler
-from transformers import CLIPVisionModel, CLIPImageProcessor, UMT5EncoderModel, AutoTokenizer
-vae = AutoencoderKLWan.from_pretrained(
-    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
-    subfolder="vae", torch_dtype=torch.bfloat16)
-text_encoder = UMT5EncoderModel.from_pretrained(
-    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
-    subfolder="text_encoder", torch_dtype=torch.bfloat16)
-image_encoder = CLIPVisionModel.from_pretrained(
-    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
-    subfolder="image_encoder", torch_dtype=torch.bfloat16)
-tokenizer = AutoTokenizer.from_pretrained(
-    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
-    subfolder="tokenizer")
-scheduler = UniPCMultistepScheduler.from_pretrained(
-    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
-    subfolder="scheduler")
-```
-## Camera Conditioning Input
-The transformer accepts `control_camera_video` as a `[B, 24, F, H_pixel, W_pixel]` tensor of temporally-packed Plucker ray embeddings:
 ```python
 output = transformer(
     hidden_states=latents,           # [B, 32, F, H, W] noise + image latents
-    timestep=timestep,
-    encoder_hidden_states=text_emb,  # [B, 512, 4096]
-    encoder_hidden_states_image=clip_emb,  # [B, 257, 1280]
-    control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays
     return_dict=False,
 )[0]
 ```
-Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using the VideoX-Fun `process_pose_file()` utility from camera extrinsic matrices.
-## Components
-| Component | Source | Size |
 |---|---|---|
-| transformer | Converted from VideoX-Fun | 3.0 GB |
-| text_encoder | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | 21 GB |
-| image_encoder | Wan-AI/Wan2.1-I2V-14B-480P-Diffusers | 1.2 GB |
-| vae | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | 485 MB |
-| tokenizer | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | 21 MB |
 ## License

 Converted from [alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera) (VideoX-Fun format) to HuggingFace diffusers format.
+Self-contained repo with all weights + custom model code. The transformer uses a custom `WanCameraControlTransformer3DModel` class (included in this repo) that extends diffusers' `WanTransformer3DModel` with a camera control adapter.
+## Quick Start
+```python
+import torch
+from huggingface_hub import hf_hub_download, snapshot_download
+REPO = "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers"
+# Download the custom model class and import it
+import importlib.util, sys
+spec = importlib.util.spec_from_file_location(
+    "modeling_wan_camera",
+    hf_hub_download(REPO, "modeling_wan_camera.py"))
+mod = importlib.util.module_from_spec(spec)
+spec.loader.exec_module(mod)
+# Load transformer with camera adapter
+transformer = mod.WanCameraControlTransformer3DModel.from_pretrained(
+    REPO, subfolder="transformer", torch_dtype=torch.bfloat16)
+# Load other pipeline components
+from diffusers import AutoencoderKLWan
+from transformers import CLIPVisionModel, UMT5EncoderModel, AutoTokenizer
+vae = AutoencoderKLWan.from_pretrained(REPO, subfolder="vae", torch_dtype=torch.bfloat16)
+text_encoder = UMT5EncoderModel.from_pretrained(REPO, subfolder="text_encoder", torch_dtype=torch.bfloat16)
+image_encoder = CLIPVisionModel.from_pretrained(REPO, subfolder="image_encoder", torch_dtype=torch.bfloat16)
+tokenizer = AutoTokenizer.from_pretrained(REPO, subfolder="tokenizer")
+```
+Or if you've cloned/downloaded the repo locally:
+```python
+import sys, torch
+sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers")
+from modeling_wan_camera import WanCameraControlTransformer3DModel
+transformer = WanCameraControlTransformer3DModel.from_pretrained(
+    "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer",
+    torch_dtype=torch.bfloat16)
+```
 ## Model Details
+- **Architecture**: WanTransformer3DModel + CameraControlAdapter
 - **Parameters**: 1.616B total (1.564B base + 51.9M camera adapter)
 - **Precision**: bfloat16
+- **in_channels**: 32 (16 noise + 16 image latents; camera enters via adapter, not channel concat)
 - **Camera conditioning**: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution
 ## Camera Control Architecture
 The adapter output is **added** to patch-embedded latents before the transformer blocks.
+## Transformer Forward Pass
 ```python
 output = transformer(
     hidden_states=latents,           # [B, 32, F, H, W] noise + image latents
+    timestep=timestep,               # [B] diffusion timestep
+    encoder_hidden_states=text_emb,  # [B, 512, 4096] text embeddings
+    encoder_hidden_states_image=clip_emb,  # [B, 257, 1280] CLIP image tokens
+    control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays at pixel res
     return_dict=False,
 )[0]
 ```
+Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using VideoX-Fun's `process_pose_file()` or `ray_condition()` utilities from camera extrinsic matrices.
+## Conversion Verification
+Forward-pass comparison against the original VideoX-Fun model in fp32:
+- Max absolute diff: **1.67e-6** (attention backend numerical noise)
+- allclose(atol=1e-2, rtol=1e-2): **True**
+- Parameter count: identical (1,616,313,152)
+Verified both from local weights and from this HuggingFace repo.
+## Repo Contents
+| File / Directory | Description | Size |
 |---|---|---|
+| `modeling_wan_camera.py` | Custom model class (also in `transformer/`) | 6 KB |
+| `transformer/` | Converted transformer weights + config | 3.0 GB |
+| `text_encoder/` | UMT5-XXL text encoder | 21 GB |
+| `image_encoder/` | CLIP ViT-H image encoder | 1.2 GB |
+| `vae/` | Wan2.1 VAE | 485 MB |
+| `tokenizer/` | UMT5 tokenizer | 21 MB |
+| `scheduler/` | UniPCMultistepScheduler config | 1 KB |
+| `image_processor/` | CLIPImageProcessor config | 1 KB |
+| `model_index.json` | Pipeline component index | 1 KB |
 ## License