Toddler-LLM-GRPO (finetuned from SmolLM2 with GRPO)
Overview
- Model name: Toddler-LLM-GRPO
- Type: Decoder-only small LM
- Base model: SmolLM2 (350M parameters used in primary evaluations)
- Status: GRPO finetuned to produce toddler-like, coherent, short responses
- Primary language: English
- Target behavior: Coherent, short, child-like responses (approx. 2–3 years old)
Training data
- Parent–child utterances from CHILDES
- Caregiver prompts filtered by RM-4 (top 10% clarity)
- Child utterance coherence scored by RM-2
Finetuning procedure (GRPO)
- GRPO parameters (Step 1):
- num_of_generations: 8
- batch_size: 200
- warmup_ratio: 0.1
- max_prompt_length: 96
- max_completion_length: 96
- dtype: bfloat16
- steps: 2000 (checkpoints every 250)
- LoRA (PEFT):
- rank: 64
- lora_alpha: 64
- target_modules: [q,k,v,o,gate,up,down]
- Reward weights:
- RM-1 (childish): 1.0
- RM-2 (coherence): 0.2
- RM-3 (length PMF): 0.5
Reward models and filters
- RM-1 (Toddler-BERT): encourages child-like style found @ enochlev/childish_behavior_model
- Note: Alone, it can lead to incoherent/nonsensical local maxima
- RM-2 (Coherence-BERT): ensures coherence; trained with soft labels found @ enochlev/child_coherence_model
- RM-3 (Length PMF): favors single, length-appropriate sentences using CHILDES-based PMF and punctuation penalty
- RM-4 (Caregiver clarity): used solely to filter training prompts (top 10%)
Observations during training
- RM-1 and RM-2 must be balanced; over-emphasizing RM-1 harms coherence; over-emphasizing RM-2 can reduce “child-like” behavior
- RM-3 converges quickly (near-100%); primarily stabilizes output length
- Minor data augmentation was added to discourage spurious “baby words”
Example generations (from evaluation set)
- Caregiver Utterance: What did you have for dessert for lunch?
- Gold Child Response: a ice cream sandwich
- llm-grpo-toddler-small-11: want ice cream
- llm-toddler-30: i had some spaghettis.
- Caregiver Utterance: What did we do with the toy?
- Gold Child Response: we put it in his cage
- llm-grpo-toddler-small-11: put toy in toy box
- llm-toddler-30: it broke.
- Caregiver Utterance: Where is your bag?
- Gold Child Response: I didn't bring it. It's in the car
- llm-grpo-toddler-small-11: me carry bag
- llm-toddler-30: i got it
- Caregiver Utterance: What did you find?
- Gold Child Response: look. Elmo
- llm-grpo-toddler-small-11: me found toy ball
- llm-toddler-30: i found luke's lightsaber.
- Caregiver Utterance: What did you put on the pizza?
- Gold Child Response: cheese and cheese
- llm-grpo-toddler-small-11: put pizza on pizza
- llm-toddler-30: i put sause on it
Evaluation (highlights)
- Automatic (1000 caregiver–child Q/A, coherence judged by ChatGPT 4.1):
- SmolLM‑350M (GRPO) achieved 0.733 accuracy (Toddler-LLM: 0.740)
- Human evaluation (19 participants; Cozmo and Misty II platforms):
- Perceived age: around Pre‑K/K; Cozmo generally perceived younger and more lifelike/likeable
- AoA and vocabulary: model output remained child-like; occasional adult-level responses emerged due to base model pretraining knowledge
Intended use
- Rapid prototyping of toddler-like dialogue on small, efficient base models
- Human-robot interaction research with adjustable RM weighting and LoRA
Out-of-scope and limitations
- Not for clinical/diagnostic use or child assessment
- English-only; inherits base model knowledge which may surface adult concepts
- Sensitive to RM weight balance; may produce random “baby words” or off-context responses if misweighted
- Can occasionally respond in other languages depending on base model capabilities
- Downloads last month
- 2
Model tree for enochlev/llm-grpo-toddler-small-11
Base model
HuggingFaceTB/SmolLM2-360M Quantized
HuggingFaceTB/SmolLM2-360M-Instruct