✨ SuperCoder 2.0 is now live & open-source! Checkout Now ✨

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF).

admin_sagi2023-11-09T07:24:28+00:00November 9, 2023|

Sign up for Latest SuperAGI Updates

555, Lytton Ave. Palo Alto, CA 94301