Count the Next Token

§ A · What it makes visible

what the screen literally shows

Fig. 01

Counting as the foundation

Before any prediction, there is counting: how many times did each word follow this context word in training data? The bigram table makes that count visible before the probability appears.

Fig. 02

Division as prediction

The model doesn’t “know” the next word. It divides: count of this continuation divided by total count of all continuations. Probability is frequency division, made explicit.

Fig. 03

The full arithmetic chain

Count → divide → probability → sample. Every token prediction follows this chain. Count the Next Token exposes each step so the mechanism is inspectable — not just described.

§ B · How to investigate it

run it like an experiment, not a toy

Follow the arithmetic. Don’t skip to the probability — trace the count that produced it.

01 · Pick a context word

Watch the table build

Choose a common word and watch the bigram table build. How many continuations are possible? How evenly are they distributed?

context: “the” → many continuations

02 · Follow the arithmetic

Count ÷ total = probability

Identify the most frequent continuation. Divide its count by the total count of all continuations. Does the probability match what you would predict?

“cat”: 3 of 12 total → 0.25

03 · Change the context

Specific vs. common words

Run the same setup with a more specific context word. What happens to the distribution — does it sharpen or spread?

“the” (flat) vs. “Thursday” (peaked)

04 · Name what changes

Not “the probabilities shifted”

Name the specific mechanism: does a rare word become likely because of a particular pattern in the training data? Name that pattern.

“Thursday” → “night” at 68% — sports broadcast schedules in training data

§ C · Debrief questions

after the investigation

What did you notice when you could see the counts instead of just the probabilities?

Where did the arithmetic produce a result that surprised you?

What would a very different training corpus do to this bigram table?

When does frequency become a kind of knowledge, and when does it mislead?