Reddit Claim - training bug in Qwen3.5 35B A3B

#70
by jpsequeira - opened

Any truth to this? post

I paste the problem related to model weights claimed in this Reddit thread.

What I found:

Two tensors. In blocks 36 and 37. ssm_conv1d.weight.

Their scale was ~60% higher than normal (σ=0.102 vs median 0.063). Because of how AdamW works, rare experts in the last layers get a huge effective learning rate - their weights drift.

In a recurrent architecture like DeltaNet, this kills the hidden state. The model forgets context after a few tokens.

Surprisingly I didn't found any issues in Gemma 4 26B A4B - all scales were correct in model.

Sign up or log in to comment