--- license: apache-2.0 tags: - trl - ppo - lora - alignment - reward-modeling - ultrafeedback base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 --- # Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool) This model was aligned with **TRL PPO** using a reward model: - **payelb/UltraFeedback_openbmb_deberta_1k_fixed_baseline** (tag: `baseline`) Key settings: - Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV) - PPO updates: 200 - batch size: 4 - lr: 1e-05 - LoRA: r=16, alpha=32, dropout=0.05