LLM Chat Sequence Diagram

This diagram illustrates the interactions between a user and a Large Language Model (LLM) in a typical chat application.

Components

User: The person interacting with the chat interface
Browser: The application running in the user’s web browser
Middleware Server: The backend server that handles user prompts and manages communication with the LLM.
LLM: The Large Language Model that processes queries and generates responses.

Sequence

Initialization: The user opens the chat interface, and the browser establishes a connection with the middleware server.
Chat Interaction: The user enters a prompt, which is sent to the middleware server via an HTTP POST request. The server forwards this prompt to the LLM for processing.
Incremental Response with SSE (Server-Sent Events): With the first generated token the server starts an SSE response with the browser. While the LLM generates tokens, they are streamed to the browser in real-time. The browser displays these tokens incrementally, creating the effect of the LLM “typing” its response. Once the response is complete, a completion event is sent.
Follow-up: The user can continue the conversation with follow-up prompts, which include the conversation history. The LLM itself is stateless between two chat requests.

Technical Implementation

The diagram highlights the use of Server-Sent Events (SSE) for streaming the LLM’s response. This approach offers several advantages:

Real-time feedback: Users see the response as it’s being generated
Improved user experience: No waiting for the entire response to be completed
Efficient resource usage: The connection remains open only as long as needed

What the LLM Does to Generate the Response

If the model is not yet ready during the first request (“cold start”), it is loaded and initialized. This can take a very long time. That’s why models are mostly loaded upfront.

In any case, the prompt processing begins with the prefill phase, where the input text is divided into a sequence of tokens. For each token in the prompt, its embedding is determined and the so-called QKV cache is populated. This allows the model to establish semantic relationships in the prompt (“Attention”).

In the decoding phase, the prepared input is processed through transformer blocks. For each token, at the end, this results in a probability value indicating how well it fits as a follow-up token. The LLM selects the next token from this probability distribution. This token is converted into text and emitted by the LLM.

To generate another token, the previous token must first be added to the QKV data structure. After that, the decoding phase is run again. These two steps repeat continuously, and the output is extended by one token each time.

The end of the output is reached when a special token, usually the “End-of-Sequence” token, is selected by the LLM. The generation is now complete. All data structures are released by the LLM for the next prompt request.

We take the risk out of your AI.

Do you want to master your AI architecture? Do you need to make your AI products conformant with regulation? algo consult helps your company with that.

Our Services Let's go through the process together.