Exploring Alternative Architectures for Multi-Token LLM Prediction | HackerNoon
Briefly

The described architecture is effective but not the sole feasible option. Alternative architectures, such as replicated unembeddings and linear heads, were explored. Replicated unembeddings allow for multi-token prediction but require large matrices, which can hinder performance during extensive training. A simpler approach involves using linear layers for heads, enabling a form of linear probing of the model's representations. While additional architectures with more complex head structures are possible, they were not investigated further during these studies, focusing instead on the efficacy of established methods in real and synthetic data experiments.
The architecture described in Section 2 is not the only sensible option, but proved technically viable and well-performing in our experiments.
We describe and compare alternative architectures in this section, including methods like replicated unembeddings and linear heads.
Replicating the unembedding matrix multiple times is a straightforward way of implementing multi-token prediction architectures, though it can be resource-intensive.
Architectures with more than one layer per head present additional possibilities, but experimentation in this area was not a focus.
Read at Hackernoon
[
|
]