What this tool shows
- Text becomes tokens. The model predicts chunks, not whole words.
- Next-token prediction is probabilistic. Many continuations are possible at each step.
- Temperature changes risk. Low sharpens the top choice. High flattens the distribution.
- Greedy ≠ sampling. Sampling rolls a weighted die — the same model, different text.
- Confidence is not truth. High probability means "fits the learned pattern," not "is correct or fair."
- Human judgment matters. The model continues a pattern. People decide goal, context, ethics, meaning.
If you did BookBot — this is BookBot at scale.
In BookBot, you built a bigram table from a children's book: which word follows which, and how often. You rolled a die to pick the next word. Low-loaded dice meant always picking the top word (low temperature). Loose dice meant picking from the full distribution (high temperature).
This tool runs the same process — for every token, across millions of learned patterns, thousands of times per second. The bar chart is the probability table. The temperature slider is the die. The difference is scale and training data, not the mechanism.
Try this before clicking Next Token: Look at the current prompt and the generated text so far. What word do you predict will have the tallest bar? Write it down — then click and check.
The key line: A likely continuation is not the same as a true one. The model predicts what fits the learned pattern — not what is accurate, fair, or meaningful. Open the full Confidence Is Not Truth bridge.
Human · Machine · System Reflection
HumanWhat prompt, mode, and temperature did you choose? Why? What would you change?
MachineWhich token became most likely? Which unlikely token still had a real chance?
SystemWhat assumptions or defaults shaped this continuation? Where did training data show up?
Workshop sequence (10 min):
- Tokens first. Switch to Explore Tokens. Type a simple sentence, then an unusual word. Ask: where did the model draw the boundary? Why might "unhappiness" split differently than "the"?
- Greedy at low temperature. Switch to Generate. Set temperature to 0.1, mode to Greedy. Run Auto. Ask: does this feel predictable? Why?
- Raise the temperature. Reset, move to 1.5, run again. Ask: what changed? Which tokens appeared that wouldn't have at 0.1?
- Greedy vs. Sample. At the same temperature, toggle between Greedy and Sample and reset between runs. Ask: same model, same temperature — why different text?
Debrief questions:
- The top token had 38% probability. Does that mean it's 38% likely to be true?
- At temperature 2.0, "dances" and "forgets" became plausible continuations for a science explanation. What does that tell you about what the model learned?
- If you ran this 100 times in Greedy mode at temperature 0.1, would you always get the same text? What about Sample mode?
- The model never "read" the sentence for meaning — it predicted the next token. What would it mean for a model to actually understand what it's continuing?
- Where does human judgment enter the loop in a real writing tool built on this mechanism?