Because forcing an $80,000 GPU cluster to predict the word "the" one agonizing token at a time is frankly embarrassing. It's time to stop paying PhDs to do data entry.
Let's just get this out of the way. AI generation isn't bound by how fast your computer can do math. It's bound by how fast it can read its own memory.
Standard Large Language Models generate text like a bureaucrat typing with two fingers. Your massively parallel GPU spends 95% of its time twiddling its thumbs, waiting for VRAM to transfer gigabytes of weights just to spit out the next comma. You bought a Ferrari, but you're driving it exclusively in a school zone.
Look at the video below. This isn't a simulation; this is the raw architectural difference between a bottlenecked system and a liberated one.
The top animation. The massive "boss" model computes exactly one token, passes it back, waits for the massive weights to reload from memory, and repeats. It's excruciating.
The bottom animation. Watch the bursts. A tiny, cheap "draft" model throws 5-8 guesses at the wall. The big boss model verifies all of them in parallel during a single memory load.
Parallel verification fixes the waiting game. Checking 5 drafted tokens takes almost the exact same wall-clock time as generating 1 token from scratch, because the heavy memory load only has to happen once.
"If the tiny draft model guesses right, you just got a 4x speedup for free. Math is beautiful."
There's no free lunch in enterprise architecture. Before you go demanding your IT team implement this yesterday, you need to understand the structural tradeoffs.
This isn't lossy compression. If the intern guesses wrong, the boss throws it out and computes the correct token anyway. The final text is mathematically identical to running the big model alone. You don't lose quality.
To pull this off, you have to host a second, smaller "Draft Model" in memory alongside your big one. Say goodbye to that extra 2-4GB of VRAM you were hoarding. You trade memory capacity for speed.
Your intern needs to know the boss. If you pair a coding intern with a poetry boss, every guess gets rejected, and you actually lose performance. They must be trained on similar data distributions.
"If you enjoy waiting for your AI to generate text, by all means, stick to standard generation."
But if you actually value the time of the people waiting on these prompts, get a draft model.