---
license: apache-2.0
tags:
- trl
- ppo
- lora
- alignment
- reward-modeling
- ultrafeedback
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
---

# Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool)

This model was aligned with **TRL PPO** using a reward model:
- **payelb/UltraFeedback_openbmb_deberta_1k_fixed_baseline** (tag: `baseline`)

Key settings:
- Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV)
- PPO updates: 200
- batch size: 4
- lr: 1e-05
- LoRA: r=16, alpha=32, dropout=0.05