Toddler-LLM-GRPO (finetuned from SmolLM2 with GRPO)

Overview

  • Model name: Toddler-LLM-GRPO
  • Type: Decoder-only small LM
  • Base model: SmolLM2 (350M parameters used in primary evaluations)
  • Status: GRPO finetuned to produce toddler-like, coherent, short responses
  • Primary language: English
  • Target behavior: Coherent, short, child-like responses (approx. 2–3 years old)

Training data

  • Parent–child utterances from CHILDES
  • Caregiver prompts filtered by RM-4 (top 10% clarity)
  • Child utterance coherence scored by RM-2

Finetuning procedure (GRPO)

  • GRPO parameters (Step 1):
    • num_of_generations: 8
    • batch_size: 200
    • warmup_ratio: 0.1
    • max_prompt_length: 96
    • max_completion_length: 96
    • dtype: bfloat16
    • steps: 2000 (checkpoints every 250)
  • LoRA (PEFT):
    • rank: 64
    • lora_alpha: 64
    • target_modules: [q,k,v,o,gate,up,down]
  • Reward weights:
    • RM-1 (childish): 1.0
    • RM-2 (coherence): 0.2
    • RM-3 (length PMF): 0.5

Reward models and filters

  • RM-1 (Toddler-BERT): encourages child-like style found @ enochlev/childish_behavior_model
    • Note: Alone, it can lead to incoherent/nonsensical local maxima
  • RM-2 (Coherence-BERT): ensures coherence; trained with soft labels found @ enochlev/child_coherence_model
  • RM-3 (Length PMF): favors single, length-appropriate sentences using CHILDES-based PMF and punctuation penalty
  • RM-4 (Caregiver clarity): used solely to filter training prompts (top 10%)

Observations during training

  • RM-1 and RM-2 must be balanced; over-emphasizing RM-1 harms coherence; over-emphasizing RM-2 can reduce “child-like” behavior
  • RM-3 converges quickly (near-100%); primarily stabilizes output length
  • Minor data augmentation was added to discourage spurious “baby words”

Example generations (from evaluation set)

  • Caregiver Utterance: What did you have for dessert for lunch?
    • Gold Child Response: a ice cream sandwich
    • llm-grpo-toddler-small-11: want ice cream
    • llm-toddler-30: i had some spaghettis.
  • Caregiver Utterance: What did we do with the toy?
    • Gold Child Response: we put it in his cage
    • llm-grpo-toddler-small-11: put toy in toy box
    • llm-toddler-30: it broke.
  • Caregiver Utterance: Where is your bag?
    • Gold Child Response: I didn't bring it. It's in the car
    • llm-grpo-toddler-small-11: me carry bag
    • llm-toddler-30: i got it
  • Caregiver Utterance: What did you find?
    • Gold Child Response: look. Elmo
    • llm-grpo-toddler-small-11: me found toy ball
    • llm-toddler-30: i found luke's lightsaber.
  • Caregiver Utterance: What did you put on the pizza?
    • Gold Child Response: cheese and cheese
    • llm-grpo-toddler-small-11: put pizza on pizza
    • llm-toddler-30: i put sause on it

Evaluation (highlights)

  • Automatic (1000 caregiver–child Q/A, coherence judged by ChatGPT 4.1):
    • SmolLM‑350M (GRPO) achieved 0.733 accuracy (Toddler-LLM: 0.740)
  • Human evaluation (19 participants; Cozmo and Misty II platforms):
    • Perceived age: around Pre‑K/K; Cozmo generally perceived younger and more lifelike/likeable
    • AoA and vocabulary: model output remained child-like; occasional adult-level responses emerged due to base model pretraining knowledge

Intended use

  • Rapid prototyping of toddler-like dialogue on small, efficient base models
  • Human-robot interaction research with adjustable RM weighting and LoRA

Out-of-scope and limitations

  • Not for clinical/diagnostic use or child assessment
  • English-only; inherits base model knowledge which may surface adult concepts
  • Sensitive to RM weight balance; may produce random “baby words” or off-context responses if misweighted
  • Can occasionally respond in other languages depending on base model capabilities
Downloads last month
2
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enochlev/llm-grpo-toddler-small-11

Finetuned
(138)
this model