Small language models are gaining traction as a viable alternative to large language models, especially in resource-constrained environments. They are particularly useful for applications requiring low latency responses, such as on smartphones or vehicles. This shift raises questions about their future utility and potential in the AI landscape. The concept of distillation is important here, as it involves training smaller models on the outputs of larger ones to optimize performance. Despite their advantages, there are challenges in replication and cost-effectiveness compared to hyperscale solutions.
Microsoft gave the example of using them for organizations with very limited resources, using them for very specific needs at the edge. So for example, if you wanted to put small language models on smartphones or in a vehicle, they could say very quickly answer user prompts without needing to send any data to the cloud or even have to stop and process the requests.
Distillation, in short, is taking a small model, typically an open source, quite lightweight, small language model and training it based on outputs from a much larger model.
It feels very much like you're pulling something apart only to put it back together again in a possibly more expensive or less convenient way. The reality is, if you need the capability of one of the hyperscalers, you probably don't have the capability to do that yourself, or it wouldn't make any economic sense to do so.
Collection
[
|
...
]