Upload README.md with huggingface_hub

1b98870 verified 3 months ago

550 Bytes

license: apache-2.0
tags:
  - trl
  - ppo
  - lora
  - alignment
  - reward-modeling
  - ultrafeedback
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool)

This model was aligned with TRL PPO using a reward model:

Key settings:

Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV)
PPO updates: 200
batch size: 4
lr: 1e-05
LoRA: r=16, alpha=32, dropout=0.05