LLM Chat Sequence Diagram

This diagram illustrates the interactions between a user and a Large Language Model (LLM) in a typical chat application.

LLM Chat Interaction SequenceLLM Chat Interaction SequenceBrowserBrowserBrowserMiddleware ServerLLMUserUserBrowserBrowserMiddleware ServerMiddleware ServerLLMLLMUserUserBrowserBrowserMiddleware ServerMiddleware ServerLLMLLMBrowserBrowserBrowserMiddleware ServerLLM[01]Enter prompt[02]HTTP POST /v1/chat/completions{ messages: [ {‘role’: ‘user’,‘content’: ‘##user-provided prompt##’} ]‘model’: ‘my-llama-0.1’ }[03]Forward promptcold start: load LLM model[04]load and initialize model “my-llama-0.1”Prefill Phase (Prompt Processing)[05]tokenize input text[06]perform token embedding lookup[07]populate QKV cache for all input tokensDecoding Phase (First Token)[08]Process new state via Transformer blocks[09]Sample next token from output probabilities[10]De-tokenize the token ID to text[11]Start generating responsefrom first token[12]Establish SSE connection:send HTTP esponse header“Content-Type: text/event-stream”[13]Display textToken Generation LoopDecoding Phase (More Tokens)[14]Update QKV cache with previous token’s state[15]Process new state via Transformer blocks[16]Sample next token from output probabilities[17]De-tokenize the token ID to text[18]Send token as text[19]Push new text data via SSE stream[20]Display additional text[21]Send new LLM output via SSE[22]Display new LLM output[23]generate special End-Of-Sequence (EOS) token[24]Response complete[25]Send completion event[26]Display complete LLM outputif conversation continues[27]enters next prompt[28]HTTP POST /v1/chat/completionswith complete chat session historyprocess repeats from [02]

Components

  • User: The person interacting with the chat interface
  • Browser: The application running in the user’s web browser
  • Middleware Server: The backend server that handles user prompts and manages communication with the LLM.
  • LLM: The Large Language Model that processes queries and generates responses.

Sequence

  1. Initialization: The user opens the chat interface, and the browser establishes a connection with the middleware server.

  2. Chat Interaction: The user enters a prompt, which is sent to the middleware server via an HTTP POST request. The server forwards this prompt to the LLM for processing.

  3. Incremental Response with SSE (Server-Sent Events): With the first generated token the server starts an SSE response with the browser. While the LLM generates tokens, they are streamed to the browser in real-time. The browser displays these tokens incrementally, creating the effect of the LLM “typing” its response. Once the response is complete, a completion event is sent.

  4. Follow-up: The user can continue the conversation with follow-up prompts, which include the conversation history. The LLM itself is stateless between two chat requests.

Technical Implementation

The diagram highlights the use of Server-Sent Events (SSE) for streaming the LLM’s response. This approach offers several advantages:

  • Real-time feedback: Users see the response as it’s being generated
  • Improved user experience: No waiting for the entire response to be completed
  • Efficient resource usage: The connection remains open only as long as needed

What the LLM Does to Generate the Response

If the model is not yet ready during the first request (“cold start”), it is loaded and initialized. This can take a very long time. That’s why models are mostly loaded upfront.

In any case, the prompt processing begins with the prefill phase, where the input text is divided into a sequence of tokens. For each token in the prompt, its embedding is determined and the so-called QKV cache is populated. This allows the model to establish semantic relationships in the prompt (“Attention”).

In the decoding phase, the prepared input is processed through transformer blocks. For each token, at the end, this results in a probability value indicating how well it fits as a follow-up token. The LLM selects the next token from this probability distribution. This token is converted into text and emitted by the LLM.

To generate another token, the previous token must first be added to the QKV data structure. After that, the decoding phase is run again. These two steps repeat continuously, and the output is extended by one token each time.

The end of the output is reached when a special token, usually the “End-of-Sequence” token, is selected by the LLM. The generation is now complete. All data structures are released by the LLM for the next prompt request.

We take the risk out of your AI.

Do you want to master your AI architecture? Do you need to make your AI products conformant with regulation? algo consult helps your company with that.

Our Services Let's go through the process together.