The problems with running human evals
Briefly

Running evaluations (evals) is critical in AI product development, ensuring models provide real value, safety, and alignment with user expectations. The evaluation process begins with defining product goals, which are then transformed into rating templates for human raters. These raters assess the models based on specific instructions, and results are analyzed to refine the product. However, challenges like ambiguous results and low Inter Rater Reliability (IRR) can affect outcomes, necessitating robust metrics and clear thresholds to gauge evaluation quality effectively, particularly in sensitive areas.
Result ambiguity can come in different forms. The lack of agreement among raters is the most common one, known as Inter Rater Reliability (IRR).
Read at Medium
[
|
]