Reddit Claim - training bug in Qwen3.5 35B A3B

#70

by jpsequeira - opened 9 days ago

Discussion

jpsequeira

9 days ago

Any truth to this? post

niryuu

9 days ago

•

edited 9 days ago

I paste the problem related to model weights claimed in this Reddit thread.

What I found:

Two tensors. In blocks 36 and 37. ssm_conv1d.weight.

Their scale was ~60% higher than normal (σ=0.102 vs median 0.063). Because of how AdamW works, rare experts in the last layers get a huge effective learning rate - their weights drift.

In a recurrent architecture like DeltaNet, this kills the hidden state. The model forgets context after a few tokens.

Surprisingly I didn't found any issues in Gemma 4 26B A4B - all scales were correct in model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment