In few-shot learning settings, the evaluation metrics must reflect the data imbalance commonly observed, containing largely fewer samples per class leading to skewed results; thus, specialized metrics are vital.
Experiment Design and Metrics for Mutation Testing with LLMs | HackerNoon
In evaluating LLM-generated mutations, we designed metrics that encompass cost, usability, and behavior, recognizing that higher mutation scores don't guarantee higher quality.