LLM Alignment & Exploration
Overview
Large Language Models (LLMs) are powerful, but aligning them with human preferences and encouraging them to explore novel solutions remains difficult. We bring techniques from control theory and exploration research to LLMs.
Active Projects
1. Bayesian Optimization from Human Feedback
Goal: Optimize LLM outputs with minimal human labelling. Details: We treat alignment as a Bayesian Optimization problem. By efficiently querying human preferences, we aim to find optimal prompts or model weights with theoretical regret bounds, minimizing the cost of human annotation.
2. Post-Training Exploration
Goal: Encouraging LLMs to think “outside the box.” Details: Standard RLHF can lead to mode collapse (repetitive answers). We are investigating the impact of intrinsic rewards on LLMs, encouraging the model to explore diverse reasoning paths and discover creative solutions during the fine-tuning phase.