Interactive tokenizer and temperature workspace

Teaching model. This visualizer uses curated examples, not a live LLM. Distributions are hand-authored so the mechanism stays visible, reliable, and discussable in a classroom.
Language models don't read words. They read tokens — subword pieces that may or may not align with what we think of as a "word." Common words are usually one token. Uncommon or compound words get split. Punctuation, spaces, and numbers each become their own tokens. The model never sees text — it sees a sequence of token numbers.
Type or paste any text
Tokens
Why this matters: When a model's "context window" is 128,000 tokens, that's not 128,000 words. English averages about 1.3 tokens per word, so 128K tokens ≈ 100K words. Non-English text, code, and unusual words use more tokens per word — they consume the context window faster.

This is a simplified approximation. Real tokenizers use Byte-Pair Encoding (BPE) trained on billions of text samples. The boundaries here capture the key insight: text → subword pieces → numbers. Hover any token chip to see its ID.

What this tool shows

  • Text becomes tokens. The model predicts chunks, not whole words.
  • Next-token prediction is probabilistic. Many continuations are possible at each step.
  • Temperature changes risk. Low sharpens the top choice. High flattens the distribution.
  • Greedy ≠ sampling. Sampling rolls a weighted die — the same model, different text.
  • Confidence is not truth. High probability means "fits the learned pattern," not "is correct or fair."
  • Human judgment matters. The model continues a pattern. People decide goal, context, ethics, meaning.
If you did BookBot — this is BookBot at scale.

In BookBot, you built a bigram table from a children's book: which word follows which, and how often. You rolled a die to pick the next word. Low-loaded dice meant always picking the top word (low temperature). Loose dice meant picking from the full distribution (high temperature).

This tool runs the same process — for every token, across millions of learned patterns, thousands of times per second. The bar chart is the probability table. The temperature slider is the die. The difference is scale and training data, not the mechanism.

Try this before clicking Next Token: Look at the current prompt and the generated text so far. What word do you predict will have the tallest bar? Write it down — then click and check.

The key line: A likely continuation is not the same as a true one. The model predicts what fits the learned pattern — not what is accurate, fair, or meaningful. Open the full Confidence Is Not Truth bridge.

HumanWhat prompt, mode, and temperature did you choose? Why? What would you change?
MachineWhich token became most likely? Which unlikely token still had a real chance?
SystemWhat assumptions or defaults shaped this continuation? Where did training data show up?
Workshop sequence (10 min):
  1. Tokens first. Switch to Explore Tokens. Type a simple sentence, then an unusual word. Ask: where did the model draw the boundary? Why might "unhappiness" split differently than "the"?
  2. Greedy at low temperature. Switch to Generate. Set temperature to 0.1, mode to Greedy. Run Auto. Ask: does this feel predictable? Why?
  3. Raise the temperature. Reset, move to 1.5, run again. Ask: what changed? Which tokens appeared that wouldn't have at 0.1?
  4. Greedy vs. Sample. At the same temperature, toggle between Greedy and Sample and reset between runs. Ask: same model, same temperature — why different text?

Debrief questions:

  • The top token had 38% probability. Does that mean it's 38% likely to be true?
  • At temperature 2.0, "dances" and "forgets" became plausible continuations for a science explanation. What does that tell you about what the model learned?
  • If you ran this 100 times in Greedy mode at temperature 0.1, would you always get the same text? What about Sample mode?
  • The model never "read" the sentence for meaning — it predicted the next token. What would it mean for a model to actually understand what it's continuing?
  • Where does human judgment enter the loop in a real writing tool built on this mechanism?