OpenAI at QCon AI NYC: Fine Tuning the Enterprise
Briefly

OpenAI at QCon AI NYC: Fine Tuning the Enterprise
"At QCon AI NYC 2025, Will Hang from OpenAI presented an overview of Agent RFT, a reinforcement fine-tuning approach intended to improve the performance of tool-using agents. Hang described a pragmatic improvement path that starts with prompt and task optimization before changing model weights. Examples included simplifying requirements, adding guardrails to prevent tool misuse, improving tool descriptions, and improving tool outputs so the agent can make better downstream decisions."
"He positioned fine-tuning options as a spectrum. Supervised fine-tuning was described as effective when there is a predictable mapping from input to output and the goal is to imitate a consistent style or structure. Preference optimization was described as a method for shifting outputs toward preferred responses using paired comparisons, and OpenAI's Direct Preference Optimization guide describes it as fine-tuning by comparing model outputs and notes it is currently limited to text inputs and outputs."
"Beware of reward hacking! Resolve any edge cases in your grader. Continuous rewards work better than binary rewards. - Will Hang, OpenAI Agent RFT was presented as reinforcement fine-tuning adapted to tool-using agents, where the model explores different strategies during training rollouts and receives a learning signal from a grader. OpenAI's documentation describes the loop as sampling candidate responses, scoring them with a grader you define, and updating the model based on those scores."
Agent RFT adapts reinforcement fine-tuning to tool-using agents by sampling candidate responses during training rollouts, scoring them with a grader, and updating the model based on those scores. Prompt and task optimization—simplifying requirements, adding guardrails, improving tool descriptions, and refining tool outputs—often provides high leverage before changing model weights but can plateau on tasks that require consistent multi-step reasoning across tool interactions. Fine-tuning options form a spectrum: supervised fine-tuning for predictable input-output mappings, preference optimization for shifting outputs via paired comparisons, and reinforcement fine-tuning for discovering strategies over longer trajectories. Graders should use continuous rewards, handle edge cases to avoid reward hacking, and enable credit assignment across full trajectories so earlier tool-selection decisions receive learning signals.
Read at InfoQ
Unable to calculate read time
[
|
]