payelb's picture
Upload README.md with huggingface_hub
1b98870 verified
metadata
license: apache-2.0
tags:
  - trl
  - ppo
  - lora
  - alignment
  - reward-modeling
  - ultrafeedback
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool)

This model was aligned with TRL PPO using a reward model:

  • payelb/UltraFeedback_openbmb_deberta_1k_fixed_baseline (tag: baseline)

Key settings:

  • Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV)
  • PPO updates: 200
  • batch size: 4
  • lr: 1e-05
  • LoRA: r=16, alpha=32, dropout=0.05