The article investigates the effectiveness of sequence labeling in generating feedback for tutor training, particularly through large language models like GPT-3.5. It reveals that traditional performance metrics, such as the F1 score, inadequately assess the nuances of feedback tokens identified as praise. The study emphasizes the importance of distinguishing between various feedback types and proposes new metrics to capture the value of additional commentary that could enhance tutors' understanding of student performance, ultimately aiming to improve the quality of educational support they provide.
In sequence labeling tasks, traditional metrics like the F1 score are insufficient. Our study introduces a modified approach to better assess model performance in identifying praise.
True Positives are accurate tokens of praise while False Positives erroneously include extra words. This highlights a common challenge in ensuring precise feedback generation.
Collection
[
|
...
]