Reddit Claim - training bug in Qwen3.5 35B A3B
#70
by jpsequeira - opened
I paste the problem related to model weights claimed in this Reddit thread.
What I found:
Two tensors. In blocks 36 and 37. ssm_conv1d.weight.
Their scale was ~60% higher than normal (σ=0.102 vs median 0.063). Because of how AdamW works, rare experts in the last layers get a huge effective learning rate - their weights drift.
In a recurrent architecture like DeltaNet, this kills the hidden state. The model forgets context after a few tokens.
Surprisingly I didn't found any issues in Gemma 4 26B A4B - all scales were correct in model.