the-sweater-cat commited on
Commit
e91969e
·
verified ·
1 Parent(s): 795c1a4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +72 -59
README.md CHANGED
@@ -14,12 +14,56 @@ base_model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera
14
 
15
  Converted from [alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera) (VideoX-Fun format) to HuggingFace diffusers format.
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## Model Details
18
 
19
- - **Architecture**: WanTransformer3DModel + CameraControlAdapter (SimpleAdapter)
20
  - **Parameters**: 1.616B total (1.564B base + 51.9M camera adapter)
21
  - **Precision**: bfloat16
22
- - **in_channels**: 32 (16 noise + 16 image latents; camera control enters via separate adapter, not channel concat)
23
  - **Camera conditioning**: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution
24
 
25
  ## Camera Control Architecture
@@ -32,74 +76,43 @@ Unlike the regular Control model (which concatenates control signals as extra in
32
 
33
  The adapter output is **added** to patch-embedded latents before the transformer blocks.
34
 
35
- ## Conversion Verification
36
-
37
- Forward-pass comparison against the original VideoX-Fun model in fp32:
38
- - Max absolute diff: **1.67e-6** (attention backend numerical noise)
39
- - allclose(atol=1e-2, rtol=1e-2): **True**
40
- - Parameter count: identical (1,616,313,152)
41
-
42
- ## Usage
43
-
44
- ```python
45
- import sys, torch
46
- sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer")
47
- from modeling_wan_camera import WanCameraControlTransformer3DModel
48
-
49
- # Load transformer
50
- transformer = WanCameraControlTransformer3DModel.from_pretrained(
51
- "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
52
- subfolder="transformer",
53
- torch_dtype=torch.bfloat16,
54
- )
55
-
56
- # Load other components
57
- from diffusers import AutoencoderKLWan, UniPCMultistepScheduler
58
- from transformers import CLIPVisionModel, CLIPImageProcessor, UMT5EncoderModel, AutoTokenizer
59
-
60
- vae = AutoencoderKLWan.from_pretrained(
61
- "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
62
- subfolder="vae", torch_dtype=torch.bfloat16)
63
- text_encoder = UMT5EncoderModel.from_pretrained(
64
- "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
65
- subfolder="text_encoder", torch_dtype=torch.bfloat16)
66
- image_encoder = CLIPVisionModel.from_pretrained(
67
- "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
68
- subfolder="image_encoder", torch_dtype=torch.bfloat16)
69
- tokenizer = AutoTokenizer.from_pretrained(
70
- "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
71
- subfolder="tokenizer")
72
- scheduler = UniPCMultistepScheduler.from_pretrained(
73
- "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers",
74
- subfolder="scheduler")
75
- ```
76
-
77
- ## Camera Conditioning Input
78
-
79
- The transformer accepts `control_camera_video` as a `[B, 24, F, H_pixel, W_pixel]` tensor of temporally-packed Plucker ray embeddings:
80
 
81
  ```python
82
  output = transformer(
83
  hidden_states=latents, # [B, 32, F, H, W] noise + image latents
84
- timestep=timestep,
85
- encoder_hidden_states=text_emb, # [B, 512, 4096]
86
- encoder_hidden_states_image=clip_emb, # [B, 257, 1280]
87
- control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays
88
  return_dict=False,
89
  )[0]
90
  ```
91
 
92
- Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using the VideoX-Fun `process_pose_file()` utility from camera extrinsic matrices.
 
 
 
 
 
 
 
 
 
93
 
94
- ## Components
95
 
96
- | Component | Source | Size |
97
  |---|---|---|
98
- | transformer | Converted from VideoX-Fun | 3.0 GB |
99
- | text_encoder | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | 21 GB |
100
- | image_encoder | Wan-AI/Wan2.1-I2V-14B-480P-Diffusers | 1.2 GB |
101
- | vae | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | 485 MB |
102
- | tokenizer | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | 21 MB |
 
 
 
 
103
 
104
  ## License
105
 
 
14
 
15
  Converted from [alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control-Camera) (VideoX-Fun format) to HuggingFace diffusers format.
16
 
17
+ Self-contained repo with all weights + custom model code. The transformer uses a custom `WanCameraControlTransformer3DModel` class (included in this repo) that extends diffusers' `WanTransformer3DModel` with a camera control adapter.
18
+
19
+ ## Quick Start
20
+
21
+ ```python
22
+ import torch
23
+ from huggingface_hub import hf_hub_download, snapshot_download
24
+
25
+ REPO = "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers"
26
+
27
+ # Download the custom model class and import it
28
+ import importlib.util, sys
29
+ spec = importlib.util.spec_from_file_location(
30
+ "modeling_wan_camera",
31
+ hf_hub_download(REPO, "modeling_wan_camera.py"))
32
+ mod = importlib.util.module_from_spec(spec)
33
+ spec.loader.exec_module(mod)
34
+
35
+ # Load transformer with camera adapter
36
+ transformer = mod.WanCameraControlTransformer3DModel.from_pretrained(
37
+ REPO, subfolder="transformer", torch_dtype=torch.bfloat16)
38
+
39
+ # Load other pipeline components
40
+ from diffusers import AutoencoderKLWan
41
+ from transformers import CLIPVisionModel, UMT5EncoderModel, AutoTokenizer
42
+
43
+ vae = AutoencoderKLWan.from_pretrained(REPO, subfolder="vae", torch_dtype=torch.bfloat16)
44
+ text_encoder = UMT5EncoderModel.from_pretrained(REPO, subfolder="text_encoder", torch_dtype=torch.bfloat16)
45
+ image_encoder = CLIPVisionModel.from_pretrained(REPO, subfolder="image_encoder", torch_dtype=torch.bfloat16)
46
+ tokenizer = AutoTokenizer.from_pretrained(REPO, subfolder="tokenizer")
47
+ ```
48
+
49
+ Or if you've cloned/downloaded the repo locally:
50
+
51
+ ```python
52
+ import sys, torch
53
+ sys.path.insert(0, "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers")
54
+ from modeling_wan_camera import WanCameraControlTransformer3DModel
55
+
56
+ transformer = WanCameraControlTransformer3DModel.from_pretrained(
57
+ "Wan2.1-Fun-V1.1-1.3B-Control-Camera-Diffusers/transformer",
58
+ torch_dtype=torch.bfloat16)
59
+ ```
60
+
61
  ## Model Details
62
 
63
+ - **Architecture**: WanTransformer3DModel + CameraControlAdapter
64
  - **Parameters**: 1.616B total (1.564B base + 51.9M camera adapter)
65
  - **Precision**: bfloat16
66
+ - **in_channels**: 32 (16 noise + 16 image latents; camera enters via adapter, not channel concat)
67
  - **Camera conditioning**: 24-channel Plucker ray embeddings (6ch x 4 temporal packing) at pixel resolution
68
 
69
  ## Camera Control Architecture
 
76
 
77
  The adapter output is **added** to patch-embedded latents before the transformer blocks.
78
 
79
+ ## Transformer Forward Pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ```python
82
  output = transformer(
83
  hidden_states=latents, # [B, 32, F, H, W] noise + image latents
84
+ timestep=timestep, # [B] diffusion timestep
85
+ encoder_hidden_states=text_emb, # [B, 512, 4096] text embeddings
86
+ encoder_hidden_states_image=clip_emb, # [B, 257, 1280] CLIP image tokens
87
+ control_camera_video=camera_emb, # [B, 24, F, H*8, W*8] Plucker rays at pixel res
88
  return_dict=False,
89
  )[0]
90
  ```
91
 
92
+ Camera trajectories (pan, zoom, rotate) are converted to Plucker embeddings using VideoX-Fun's `process_pose_file()` or `ray_condition()` utilities from camera extrinsic matrices.
93
+
94
+ ## Conversion Verification
95
+
96
+ Forward-pass comparison against the original VideoX-Fun model in fp32:
97
+ - Max absolute diff: **1.67e-6** (attention backend numerical noise)
98
+ - allclose(atol=1e-2, rtol=1e-2): **True**
99
+ - Parameter count: identical (1,616,313,152)
100
+
101
+ Verified both from local weights and from this HuggingFace repo.
102
 
103
+ ## Repo Contents
104
 
105
+ | File / Directory | Description | Size |
106
  |---|---|---|
107
+ | `modeling_wan_camera.py` | Custom model class (also in `transformer/`) | 6 KB |
108
+ | `transformer/` | Converted transformer weights + config | 3.0 GB |
109
+ | `text_encoder/` | UMT5-XXL text encoder | 21 GB |
110
+ | `image_encoder/` | CLIP ViT-H image encoder | 1.2 GB |
111
+ | `vae/` | Wan2.1 VAE | 485 MB |
112
+ | `tokenizer/` | UMT5 tokenizer | 21 MB |
113
+ | `scheduler/` | UniPCMultistepScheduler config | 1 KB |
114
+ | `image_processor/` | CLIPImageProcessor config | 1 KB |
115
+ | `model_index.json` | Pipeline component index | 1 KB |
116
 
117
  ## License
118