Next-Token Prediction Game

§ A · What it makes visible

what the screen literally shows

Fig. 01

Human vs. model distribution

When the room’s next-word guesses are pasted in, the distribution reveals how context, genre, and expectation shape human prediction — then the model’s probabilities show where and why they diverge.

Fig. 02

Genre, context, register

The same sentence stem in a news article, a recipe, or a text message produces different guesses. Register shapes what feels obvious — and the model’s training data carries the same shaping forces.

Fig. 03

Prediction as probability

Every next token is a ranked distribution, not a single answer. The game makes that distribution visible and discussable before the model generates its single output from the top of it.

§ B · How to investigate it

run it like an experiment, not a toy

Collect room guesses before revealing the model. The comparison is the experiment — not the answer.

01 · Collect first

Before seeing the model

Write down your own next-word guess, then collect several from the room via Zoom chat. Keep them hidden before revealing model probabilities.

stem: “The museum was completely ___”

02 · Paste and compare

Room vs. model

Enter the guesses, then reveal the model’s top-k. Where does the room cluster? Where does the model diverge? Where do they agree?

room: “empty” · model: 71% “empty”

03 · Change the context

More or less preceding text

Run the same stem with more or less preceding context. How does adding one sentence before the stem shift the distribution?

short stem vs. full paragraph · same last phrase

04 · Name the gap

Not “the model was wrong”

Name what the model weighted that the room didn’t. Genre expectation? Register? A statistical pattern in a specific training domain?

room spread: “silent/empty/eerie” · model top-1: “empty” at 71% — genre defaults

§ C · Debrief questions

after the investigation

What made your next-word guess feel obvious?

Where did the room cluster, and where did it spread out?

What context clue had the most effect on predictions?

When does model probability match human intuition, and when does it diverge?