Our central finding that models do not need to maintain attention over all of the context across every layer has direct implications for inference efficiency of transformers, with estimated up to 45% cost-savings for llama model with 5 examples.
To study this, we introduced causal masking of self-attention over the context from layer ℓ onwards. The findings generalise across models of different sizes and in both non instruction-tuned and instruction-tuned models.
We further identify certain layers as task critical, and show that this corresponds to the task recognition point of the model and is not influenced by increasing number of examples.
In future work, we hope to extend this analysis to other sequence or classification tasks as well as true novel tasks.
Collection
[
|
...
]