Complete LLM Inference Tutorial: KV Cache and DeepSeek V4’s Caching Revolution

ChainNewsAbmedia

When you type a sentence into ChatGPT, Claude, or DeepSeek, and the model starts replying token-by-token within a few hundred milliseconds—this process feels simple, but in reality it’s one of the most finely engineered parts of modern computing. This article organizes the complete inference workflow analysis by AI engineer Akshay Pachaar, from tokenization, embedding, and attention to the two-stage prefill/decode process, KV cache, quantization, and why DeepSeek V4 reduces cache size to just 10% of the original.

Core mental model: an LLM only “guesses the next token,” then repeats

At its core, a large language model does one thing: it predicts the next token. It takes your input token sequence, computes a probability distribution for the next token, samples a token, appends that token to the end of the input, and then predicts the next one—repeating endlessly until the model outputs a stop token or hits the limit.

The key question in the entire inference pipeline isn’t “how does it predict,” but “why is the second token much faster than the first?” The answer leads to the two most important concepts in modern LLM serving: the prefill and decode two-stage process, and KV cache.

Step 1: Tokenization turns text into numbers

Neural networks don’t read text—they read vectors. So your prompt first goes through tokenization, gets split into chunks called tokens, and each token maps to an integer ID. Most modern LLMs use BPE (Byte Pair Encoding): starting from raw characters, repeatedly merging the most frequently co-occurring character pairs, and finally producing a vocabulary of around 50 thousand common tokens.

This step matters more than most people think. In languages that have lower weight in the tokenizer training data, text gets split into more tokens, increasing inference cost and slowing speed. For Chinese and Traditional Chinese in many English-oriented tokenizers, each character often gets split into 2 to 3 tokens—one of the root reasons inference costs are relatively high for Chinese users.

Step 2: Embedding turns integers into vectors, then injects position information

Each token’s integer ID looks up a huge “embedding table.” If the model vocabulary is 50K and the hidden dimension is 4096, the shape of this table is [50k, 4096]. Each token retrieves one row vector, which is its 4096-dimensional representation.

These vectors aren’t random. During training, the model pushes semantically similar tokens into nearby regions of the embedding space: king and queen are near each other along some direction, python (a language) and javascript are near each other along another, and python and snake are near each other along a third.

Position information is injected here too, because the attention mechanism itself doesn’t inherently know which token comes first or last. Most current mainstream models use RoPE (Rotary Position Embedding): rotating vectors according to token position, embedding ordering information implicitly in the vectors.

Step 3: Self-Attention is the core of Transformer

The vector sequence then enters 32 transformer layers (or more). Each layer does two things: uses self-attention to mix information across tokens, and then uses a feed-forward network to mix information within each token.

Self-attention works like this: each token passes through three learned weight matrices Wq, Wk, and Wv to produce three vectors—query (Q), key (K), and value (V). Using its own query, each token takes the dot product with the keys of all other tokens to get the weights for “how much information this token should pull from other tokens,” and then uses those weights to form a weighted mix of the value vectors.

This is the magic of attention: each token decides by itself which positions in the context to look at, pulling useful information into its vector. With 32 layers stacked, the model can track references across thousands of tokens. The feed-forward network that follows attention carries most of the model’s “knowledge”—attention transports information, while feed-forward processes that information.

Prefill vs Decode: same GPU, two completely different bottlenecks

This is the most important split in this article. Generating a 200-word response is actually two tasks with completely different properties running on the same GPU.

Prefill stage—when you submit the prompt, the model must first run all input tokens through the network once before it can generate the first token. This step can process all input tokens “in parallel”: Q, K, V for each token are computed simultaneously, and attention becomes a large matrix-to-matrix multiplication. GPUs are built for this kind of workload; compute units (Tensor Cores) are kept busy, and the bottleneck is “compute.” The latency metric for this stage is TTFT (Time to First Token, time to first token).

Decode stage—after the first token comes out, the model switches modes. When producing the 51st token, it only needs to compute Q, K, V for the new token; the K and V for the previous 50 tokens have already been computed and don’t need to be redone. The problem is that even though each token’s computation is small, the GPU still has to load the entire model weights and the whole KV history from GPU memory, perform a tiny computation, and then write it back. The bottleneck flips from “compute” to “memory bandwidth.” The latency metric here is ITL (Inter-Token Latency, latency between tokens), which determines whether the model feels like it’s “typing” fast or slow.

So prefill is compute-bound, decode is memory-bound—same model, same hardware, but completely different performance characteristics.

KV Cache: the key optimization that makes LLM inference feasible

The decode stage avoids re-computing past tokens’ K and V, and that’s exactly what KV cache is for. Each transformer layer maintains two tensors that store all historical tokens’ K and V. When new tokens are computed, their K and V are appended; during attention, the model directly reads the entire history.

Without KV cache, generating a 1,000-token response would re-compute the entire growing sequence at every step, and complexity would explode quadratically. With KV cache, long generation can be accelerated by more than 5x. But the cost is: the cache lives in GPU VRAM; each additional generated token increases the cache by another chunk. For a 13B model, each token takes about 1MB; a 4K context burns about 4GB of VRAM just to store this cache.

This is the real reason “long context is slow and expensive”—it’s not that the model “can’t think ahead,” but that the cache eats up memory, reducing the number of concurrent users a single GPU can serve. Common optimization methods include: quantizing the cache into INT8 or INT4, using a sliding window to discard too-old tokens, using grouped-query attention (GQA) so multiple attention heads share K and V, or using paged attention (like vLLM’s) to manage cache via a paged structure (similar to how an operating system manages memory).

DeepSeek V4’s cache breakthrough: cut to 10% under a 1M context

Quantization and paging treat KV cache as a “fixed-cost” optimization. DeepSeek’s V4 series previewed at the end of 2025 takes a more aggressive route: it redesigns attention so that the cache is small from the start.

V4 uses a hybrid mechanism that combines two kinds of compressed attention variants—sparse and dense—and both operate on highly compressed KV streams. Under million-token-level contexts, V4-Pro reports that KV cache size is only about 10% of the predecessor, and per-token compute cost is only about 27%. The significance isn’t just “DeepSeek is cheaper again.” It’s that KV cache has become a bottleneck in the entire LLM field—when the attention mechanism itself is redesigned to reduce cache, it means the “constraint condition” across the technical community has completely shifted.

For Taiwan readers, a more practical takeaway is: DeepSeek V4-Flash is already available on Ollama Cloud and on U.S. hosting (see ABMedia 4/24 report), and Claude Code and OpenClaw can connect with one click—so you can verify the advantages of the new-generation attention in long-context scenarios without building your own setup.

Quantization: trade precision for speed and VRAM

Training needs high precision; inference doesn’t. Most production deployments switch from FP32 to FP16 or BF16, immediately doubling VRAM usage capacity and throughput. Even more aggressive approaches quantize weights to INT8, or even INT4.

A straightforward numbers intuition: a 7B-parameter model needs 28GB in FP32, 14GB in FP16, 7GB in INT8, and only 3.5GB in INT4. That’s why even typical laptop GPU cards can run 7B models. Methods like GPTQ and AWQ choose scaling factors per channel, minimizing quality loss from lossy compression—well-designed INT4 often performs within 1 percentage point of the original across most benchmarks.

Putting all the steps together: the complete journey of one prompt

Putting everything above in sequence, the full inference path is: (1) Tokenize—turn text into integer IDs. (2) Embed—turn IDs into vectors and inject position information. (3) Prefill—run all layers on all input tokens in parallel; it’s compute-bound, KV cache is created, and the first output token is generated. (4) Decode loop—each time, project only the new token’s Q, do attention over K and V in cache, run the feed-forward network, sample an output token, write the new K and V back into cache; it’s memory-bound. (5) Detokenize—convert token IDs back into characters and stream the output to the screen.

Serving frameworks like vLLM, TensorRT-LLM, and Text Generation Inference add layers around this loop: continuous batching (interleaving tokens from different users within the same GPU step), speculative decoding (a small model drafts, the large model verifies), and fine-grained memory management—this is how a single GPU can serve dozens of users.

Practical takeaway for developers: should you care about TTFT or ITL?

Once you understand the end-to-end inference pipeline, a few practical judgments naturally follow:

Long prompts amplify TTFT; long outputs amplify ITL—these pressures come from different sources, so don’t put optimization resources into the wrong metric. Context isn’t free: doubling context doesn’t just double computation—it also reduces how many batches can fit by compressing the KV cache capacity. Quantization is currently the highest-leverage knob: switching FP16 to INT8 often cuts latency by half with very small quality loss. GPU utilization is also often a misleading indicator—prefill may drive GPU utilization to near 100%, while decode might use only 30%; the solution isn’t “more compute,” but faster memory or smaller cache.

The Transformer architecture attracts the most attention, but inference performance actually lives in “boring details”: memory configuration, cache management, and bit width. When someone says “this model is slow,” the next question shouldn’t be “swap GPUs,” but “is it slow at ‘starting’ or slow at ‘streaming’?” The answer determines the entire optimization path.

This article, the full LLM inference tutorial: the KV cache and DeepSeek V4 cache revolution, first appeared on Chain News ABMedia.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments