When you look at a photograph, you see a person, a place, a mood, a story. A model processes numbers. It extracts patterns from those numbers. It compresses those patterns into a vector. None of these steps involve seeing in any sense that resembles human vision. Three lenses show you what's actually happening at each stage.
Choose a lens
You see
A face. Someone familiar or unfamiliar. An expression — tired, happy, uncertain. A person with a story.
The model processes
A grid of numbers. Each pixel: three values (red, green, blue), each between 0 and 255. For a 512×512 image: 786,432 numbers. No face. No person. No story.
Sample — 8×8 patch of pixel values
210
195
220
185
200
215
190
205
178
192
208
170
185
200
175
188
215
198
182
210
192
178
205
188
170
200
215
183
196
178
210
192
185
205
170
195
212
188
200
178
192
215
182
198
172
205
188
210
195
178
200
185
215
192
170
205
188
200
178
212
195
182
210
188
What this is: A tiny 8×8 region of a skin-tone patch — roughly what the model starts with before any processing. No face is visible at this scale. The full image is thousands of patches like this. At this stage, "seeing" means reading numbers arranged in a grid.
You see
Eyes, a nose, a jawline, skin, light and shadow — assembled instantly into a face. Recognition is immediate, automatic, and connected to a lifetime of learned faces.
The model processes
Edges, gradients, regions of contrast. Patterns that — across millions of training examples — have co-occurred with labels like "eye," "face," "person." Still no face. Just correlations between pixel patterns and labels.
Feature activations — early layer of a vision model
H-edge0.87
V-edge0.42
Curve0.91
Contrast0.65
Texture0.38
Symmetry0.79
Skin-tone0.94
Oval-shape0.83
What this is: The model has detected patterns that strongly co-occur with faces in training data — curves, symmetry, skin-tone ranges, oval outlines. These are not a face. They are activations in a mathematical function. The "face" label emerges from the combination of these activations, not from any single one. This is called feature extraction — and it's how early layers of vision models work.
You see
A complete image — objects, relationships, mood, context. You can compare it to other images by meaning: "this reminds me of..." or "this is similar to..."
The model processes
A vector: a list of thousands of numbers that compress the image's visual content into a point in a high-dimensional mathematical space. Similarity is distance. Meaning is position.
What this is: The entire image compressed into a list of decimal numbers. This vector is what the model actually compares when it asks "is this similar to that?" Two images are "similar" if their vectors are close in mathematical space — not if they mean the same thing to a human viewer. This compression is powerful, but it's not vision. It's geometry.
What human vision actually involves — and the model doesn't have
MemoryThat face belongs to someone. Last time I saw them they looked different. Something changed.
ContextThat setting means something specific. I know what a hospital looks like because I've been in one.
EmotionThis image makes me uncomfortable. This one is beautiful. This one makes me sad for a reason I can name.
StakesI care what this image means. What I do with it matters to me personally.
Key line
"The model processes numbers. What those numbers represent, and whether they mean anything, depends on how we design and interpret the system — not on the model itself."
This gap creates both failures and surprising capabilities. Failures: the model has no way to know it classified a skin rash incorrectly, or that the "face" it detected was a drawing, or that the person in the image would not consent to being described this way. Capabilities: vector similarity can find images that look alike across huge datasets faster than any human could. Understanding the mechanism explains both.
Now open the tools
In the Feature Extraction & Pixel Resolution tool, adjust resolution and watch which features survive — and which disappear. In the Diffusion Step-Through Viewer, watch the model convert noise into structure one denoising step at a time, using these same underlying operations.