DeepSeek-GRM: Introducing an Enhanced AI Reasoning Technique
Briefly

DeepSeek and Tsinghua University have unveiled a technique aimed at improving reasoning in large language models (LLMs), a vital factor in the competitive generative AI landscape. The approach combines generative reward modeling with self-principled critique tuning to refine LLM responses. Although China is catching up with the U.S. in developing these models, it excels in patents and academic papers. The new method purportedly enhances alignment with user preferences, leading to quicker and more accurate answers, with promising results in RM benchmarks across the board.
In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods.
Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance.
Read at TechRepublic
[
|
]