Caching essentially ensures that you don't have to pay for the model to do repetitive work and reprocess the same (or substantially similar) queries over and over again. According to AWS, this can reduce cost by up to 90% but one additional byproduct of this is also that the latency for getting an answer back from the model is significantly lower (AWS says by up to 85%). Adobe, which tested prompt caching for some of its generative AI applications on Bedrock, saw a 72% reduction in response time.
With intelligent prompt routing, Bedrock can automatically route prompts to different models in the same model family to help businesses strike the right balance between performance and cost. The system automatically predicts (using a small language model) how each model will perform for a given query and then route the request accordingly.
Say there is a document, and multiple people are asking questions on the same document. Every single time you're paying. And these context windows are getting longer and longer.
Collection
[
|
...
]