LLM APIs are a Synchronization Problem
Briefly

LLM APIs are a Synchronization Problem
"At its core, a large language model takes text, tokenizes it into numbers, and feeds those tokens through a stack of matrix multiplications and attention layers on the GPU. Using a large set of fixed weights, it produces activations and predicts the next token. If it weren't for temperature (randomization), you could think of it having the potential of being a much more deterministic system, at least in principle."
"As far as the core model is concerned, there's no magical distinction between "user text" and "assistant text"-everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template. You can look at the system prompt templates on Ollama for the different models to get an idea."
"If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You'd maintain the conversation history as tokens in RAM, and the model would keep a derived "working state" on the GPU-mainly the attention key/value cache built from those tokens."
Large language models convert text into tokens and process them through matrix multiplications and attention layers on GPUs to produce activations and next-token predictions. The model weights remain fixed while activations and an attention key/value (KV) cache evolve per step. Role distinctions such as system, user, assistant, and tool are encoded as special tokens and formatting within the token stream. Agentic systems maintain conversation history in RAM and a derived working state on the GPU. Caching stores KV cache for prefix tokens to avoid recomputation, creating a distributed state synchronization challenge across compute and API boundaries.
[
|
]