Toddler-LLM-GRPO (finetuned from SmolLM2 with GRPO)

GRPO parameters (Step 1):
- num_of_generations: 8
- batch_size: 200
- warmup_ratio: 0.1
- max_prompt_length: 96
- max_completion_length: 96
- dtype: bfloat16
- steps: 2000 (checkpoints every 250)
LoRA (PEFT):
- rank: 64
- lora_alpha: 64
- target_modules: [q,k,v,o,gate,up,down]
Reward weights:
- RM-1 (childish): 1.0
- RM-2 (coherence): 0.2
- RM-3 (length PMF): 0.5

RM-1 (Toddler-BERT): encourages child-like style found @ enochlev/childish_behavior_model
- Note: Alone, it can lead to incoherent/nonsensical local maxima
RM-2 (Coherence-BERT): ensures coherence; trained with soft labels found @ enochlev/child_coherence_model
RM-3 (Length PMF): favors single, length-appropriate sentences using CHILDES-based PMF and punctuation penalty
RM-4 (Caregiver clarity): used solely to filter training prompts (top 10%)

RM-1 and RM-2 must be balanced; over-emphasizing RM-1 harms coherence; over-emphasizing RM-2 can reduce “child-like” behavior
RM-3 converges quickly (near-100%); primarily stabilizes output length
Minor data augmentation was added to discourage spurious “baby words”

Caregiver Utterance: What did you have for dessert for lunch?
- Gold Child Response: a ice cream sandwich
- llm-grpo-toddler-small-11: want ice cream
- llm-toddler-30: i had some spaghettis.
Caregiver Utterance: What did we do with the toy?
- Gold Child Response: we put it in his cage
- llm-grpo-toddler-small-11: put toy in toy box
- llm-toddler-30: it broke.
Caregiver Utterance: Where is your bag?
- Gold Child Response: I didn't bring it. It's in the car
- llm-grpo-toddler-small-11: me carry bag
- llm-toddler-30: i got it
Caregiver Utterance: What did you find?
- Gold Child Response: look. Elmo
- llm-grpo-toddler-small-11: me found toy ball
- llm-toddler-30: i found luke's lightsaber.
Caregiver Utterance: What did you put on the pizza?
- Gold Child Response: cheese and cheese
- llm-grpo-toddler-small-11: put pizza on pizza
- llm-toddler-30: i put sause on it

Automatic (1000 caregiver–child Q/A, coherence judged by ChatGPT 4.1):
- SmolLM‑350M (GRPO) achieved 0.733 accuracy (Toddler-LLM: 0.740)
Human evaluation (19 participants; Cozmo and Misty II platforms):
- Perceived age: around Pre‑K/K; Cozmo generally perceived younger and more lifelike/likeable
- AoA and vocabulary: model output remained child-like; occasional adult-level responses emerged due to base model pretraining knowledge

Not for clinical/diagnostic use or child assessment
English-only; inherits base model knowledge which may surface adult concepts
Sensitive to RM weight balance; may produce random “baby words” or off-context responses if misweighted
Can occasionally respond in other languages depending on base model capabilities

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for enochlev/llm-grpo-toddler-small-11

Base model

Quantized

Finetuned

(138)

this model