← Back to Episode
AI Catchup Weekly

From Prompt to Prediction: Understanding Prefill, Decode, and the KV Cache in LLMs

April 6, 2026 3:19 Episode 0

Host A: Welcome back to AI Catchup Weekly, I'm your host, and today we're diving into something that sits right at the heart of how large language models actually work — the generation pipeline.

Host B: And I have to say, this one genuinely changed how I think about what's happening under the hood every time I type a prompt and hit enter.

Host A: So let's set the scene. When you send a message to an LLM, the model doesn't just magically spit out a response — it actually operates in two very distinct phases: prefill and decode.

Host B: Right, and the prefill phase is kind of the unsung hero here. Can you walk us through what's actually happening in that first phase?

Host A: Sure. During prefill, the model takes your entire prompt — every single token — and processes them all at once in a single parallel pass. So if you write a ten-word prompt, all ten words are being analyzed simultaneously, not one at a time.

Host B: Which is huge for speed, right? Because imagine if you had a hundred thousand token prompt and had to process each word sequentially — you'd be waiting forever just to get the first word of a response.

Host A: Exactly. And the mechanism doing all that heavy lifting is called scaled dot-product attention. Basically, every token gets to look at every other token before it and figure out what's relevant to build its understanding of context.

Host B: I love the example they use to illustrate this — "Today's weather is so…" — because as humans we instinctively know the next word should be an adjective about weather. Something like "nice" or "warm," not "delicious."

Host A: And transformers arrive at that same conclusion through attention weights. Words like "weather" carry more semantic weight than "is" or "so," and the attention mechanism naturally reflects that by paying more attention to the meaningful tokens.

Host B: Okay so after prefill, we've got this thing called a context vector. What happens next — how do we actually get from that to the words being generated?

Host A: So the context vector is essentially a compressed summary of your entire prompt, and it gets projected through a vocabulary matrix to produce logit scores for every possible next word. The highest scoring words become the most likely candidates for what comes next — that's your decode phase, generating one token at a time.

Host B: And here's where the KV cache comes in, which is honestly one of those elegant engineering solutions that sounds simple but saves enormous amounts of computation.

Host A: It really does. During decoding, every new token you generate needs to attend to all previous tokens. Without the KV cache, you'd be recomputing the keys and values for every single previous token every single time — which is massively redundant.

Host B: So instead, you just store those computed keys and values, cache them, and look them up as needed. It's like having notes from a meeting rather than re-attending the meeting every time someone asks a follow-up question.

Host A: That's a perfect analogy. And this is why LLMs can generate long responses efficiently at scale — the KV cache is absolutely essential once you're talking about production systems handling thousands of requests.

Host B: The deeper you go into how these systems are built, the more you appreciate just how much clever engineering is packed into what feels like a simple chat interface.

Host A: Couldn't agree more. Alright, that's a wrap on today's deep dive into LLM inference mechanics — prefill, decode, and the KV cache.

Host B: If this got your gears turning, stick with us — we'll keep unpacking the fascinating machinery behind modern AI right here on AI Catchup Weekly. See you next time!

Listen to This Episode

Prefer to listen? Head back to the episode page for the full audio.