Bridge 4 Before Session 2: Images

What Does the Machine See?

When you look at a photograph, you see a person, a place, a mood, a story. A model processes numbers. It extracts patterns from those numbers. It compresses those patterns into a vector. None of these steps involve seeing in any sense that resembles human vision. Three lenses show you what's actually happening at each stage.

Choose a lens

You see

A face. Someone familiar or unfamiliar. An expression — tired, happy, uncertain. A person with a story.

The model processes

A grid of numbers. Each pixel: three values (red, green, blue), each between 0 and 255. For a 512×512 image: 786,432 numbers. No face. No person. No story.

Sample — 8×8 patch of pixel values

210

195

220

185

200

215

190

205

178

192

208

170

185

200

175

188

215

198

182

210

192

178

205

188

170

200

215

183

196

178

210

192

185

205

170

195

212

188

200

178

192

215

182

198

172

205

188

210

195

178

200

185

215

192

170

205

188

200

178

212

195

182

210

188

What this is: A tiny 8×8 region of a skin-tone patch — roughly what the model starts with before any processing. No face is visible at this scale. The full image is thousands of patches like this. At this stage, "seeing" means reading numbers arranged in a grid.

You see

Eyes, a nose, a jawline, skin, light and shadow — assembled instantly into a face. Recognition is immediate, automatic, and connected to a lifetime of learned faces.

The model processes

Edges, gradients, regions of contrast. Patterns that — across millions of training examples — have co-occurred with labels like "eye," "face," "person." Still no face. Just correlations between pixel patterns and labels.

Feature activations — early layer of a vision model

H-edge 0.87

V-edge 0.42

Curve 0.91

Contrast 0.65

Texture 0.38

Symmetry 0.79

Skin-tone 0.94

Oval-shape 0.83

What this is: The model has detected patterns that strongly co-occur with faces in training data — curves, symmetry, skin-tone ranges, oval outlines. These are not a face. They are activations in a mathematical function. The "face" label emerges from the combination of these activations, not from any single one. This is called feature extraction — and it's how early layers of vision models work.

You see

A complete image — objects, relationships, mood, context. You can compare it to other images by meaning: "this reminds me of..." or "this is similar to..."

The model processes

A vector: a list of thousands of numbers that compress the image's visual content into a point in a high-dimensional mathematical space. Similarity is distance. Meaning is position.

Image embedding — 768-dimensional vector (first 48 values shown)

[0.832, -0.214, 0.671, 0.118, -0.443, 0.887, -0.062, 0.341,
0.594, -0.778, 0.229, 0.463, -0.115, 0.712, -0.388, 0.045,
0.923, -0.167, 0.534, -0.298, 0.751, 0.083, -0.612, 0.445,
-0.334, 0.877, 0.192, -0.541, 0.368, -0.724, 0.055, 0.889,
0.217, -0.493, 0.661, -0.138, 0.802, 0.324, -0.579, 0.147,
-0.865, 0.411, 0.076, -0.723, 0.558, 0.239, -0.184, 0.693, ...]

What this is: The entire image compressed into a list of decimal numbers. This vector is what the model actually compares when it asks "is this similar to that?" Two images are "similar" if their vectors are close in mathematical space — not if they mean the same thing to a human viewer. This compression is powerful, but it's not vision. It's geometry.

What human vision actually involves — and the model doesn't have

MemoryThat face belongs to someone. Last time I saw them they looked different. Something changed.

ContextThat setting means something specific. I know what a hospital looks like because I've been in one.

EmotionThis image makes me uncomfortable. This one is beautiful. This one makes me sad for a reason I can name.

StakesI care what this image means. What I do with it matters to me personally.

Key line "The model processes numbers. What those numbers represent, and whether they mean anything, depends on how we design and interpret the system — not on the model itself."

This gap creates both failures and surprising capabilities. Failures: the model has no way to know it classified a skin rash incorrectly, or that the "face" it detected was a drawing, or that the person in the image would not consent to being described this way. Capabilities: vector similarity can find images that look alike across huge datasets faster than any human could. Understanding the mechanism explains both.

Now open the tools

In the Feature Extraction & Pixel Resolution tool, adjust resolution and watch which features survive — and which disappear. In the Diffusion Step-Through Viewer, watch the model convert noise into structure one denoising step at a time, using these same underlying operations.

Open Feature Tool → Open Diffusion Viewer →