SLM-RL-Agents β Model Checkpoints
Paper: Towards Robust Reinforcement Learning for Small-Scale Language Model Agents
Authors: Md Rezwanul Haque, Md. Milon Islam, Fakhri Karray
Code: github.com/rezwanh001/slm-rl-agents
This repository hosts 30 trained checkpoints (15 SFT + 15 PPO) from the SLM-RL-Agents framework β a stabilised RLHF pipeline for training small language model agents in the 70M-410M parameter regime.
Models
| Family | Model | Params | Layers |
|---|---|---|---|
| Pythia | Pythia-70M-deduped | 70M | 6 |
| Pythia | Pythia-160M-deduped | 162M | 12 |
| Pythia | Pythia-410M-deduped | 405M | 24 |
| SmolLM2 | SmolLM2-135M | 135M | 30 |
| SmolLM2 | SmolLM2-360M | 361M | 32 |
Corpora
- TinyStories β simple narrative fiction
- CNN/DailyMail β news articles
- Wikitext-103 β encyclopaedic text
Repository Layout
SLM-RL-Agents/
βββ sft/ # 15 LoRA adapters
β βββ pythia-70m/{tinystories, cnn_dailymail, wikitext}/
β βββ pythia-160m/...
β βββ pythia-410m/...
β βββ smollm2-135m/...
β βββ smollm2-360m/...
βββ ppo/ # 15 fully merged models
βββ pythia-70m/{tinystories, cnn_dailymail, wikitext}/
βββ ...
Key Results
| Configuration | Reward Delta | Win Rate |
|---|---|---|
| Pythia-410M / TinyStories | +1.36 | 59.9% |
| SmolLM2-360M / TinyStories | +0.72 | 59.7% |
| SmolLM2-360M / Wikitext-103 | +0.27 | 56.5% |
Quick Start
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
root = snapshot_download(repo_id="mr3haque/SLM-RL-Agents", allow_patterns="ppo/smollm2-360m/tinystories/**")
model = AutoModelForCausalLM.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")
tokenizer = AutoTokenizer.from_pretrained(f"{root}/ppo/smollm2-360m/tinystories")
Datasets
Citation
@misc{haque2026slmrlagents,
title = {Towards Robust Reinforcement Learning for Small-Scale Language Model Agents},
author = {Haque, Md Rezwanul and Islam, Md. Milon and Karray, Fakhri},
year = {2026},
howpublished = {\url{https://github.com/rezwanh001/slm-rl-agents}}
}
- Downloads last month
- -
Model tree for mr3haque/SLM-RL-Agents
Base model
EleutherAI/pythia-160m-deduped