In the paper, we introduce a unified evaluation method where both single-choice and open-ended questions are treated as generation tasks to align better with actual LALM usage.
We adopt GPT-4 as a reference-based evaluator for assessing generation quality of LALMs in the audio domain due to its better alignment with human preferences.
Traditional metrics like WER and ROUGE showed low correlation with human judgments, prompting us to explore LLM-based evaluations for improved accuracy.
This study addresses the challenges in automated evaluation of open-ended generation tasks, focusing on producing hypotheses directly rather than merely evaluating perplexity.
Collection
[
|
...
]