
"These experiments led to two key discoveries, according to the paper. Tuning only the self-attention projection layers (SA Proj), the part of the model that helps it decide which input elements to focus on, allowed the models to learn new tasks with little or no measurable forgetting. Also, what initially appeared as forgotten knowledge often resurfaced when the model was later trained on another specialized task."
"We thus hypothesize that perhaps what looks like forgetting or interference after fine-tuning on a narrow target task is actually bias in the output distribution due to the task distribution shift," the researchers added. "Through in-depth analysis when tuning the counting task, we confirm this hypothesis: tuning the MLP increases target accuracy but also increases the likelihood of outputting numeric tokens and a highly correlated drop in held-out task accuracy, while tuning the self-attention achieves the target learning without much bias toward numeric tokens and without losing held-out accuracy."
A controlled evaluation trained selected large multimodal models on five target tasks — fine-grained bird classification, counting, medical visual question answering, OCR reading, and time reading — and measured performance drops across eight held-out benchmarks. Tuning only self-attention projection layers (SA Proj) enabled the models to learn new tasks with little or no measurable forgetting. Tuning MLP layers increased target accuracy but biased outputs toward numeric tokens, producing correlated drops in held-out task accuracy. Apparent forgetting often resurfaced when the model was later trained on another specialized task. The results implicate output-distribution bias from task distribution shift as a source of apparent forgetting.
Read at InfoWorld
Unable to calculate read time
Collection
[
|
...
]