Latent Differentiation

Measuring the Allogenetic

Latent Differentiation

Measuring the Allogenetic

Philippe Boisnard — Université Paris 8 — Paragraphe-CITU

From Pixels to Latent Spaces

In 2009, Lev Manovich introduced ImagePlot as part of the Cultural Analytics initiative — a pioneering effort to apply computational methods to visual culture at scale. By measuring low-level features such as brightness, saturation, and hue across thousands of artworks, ImagePlot revealed macroscopic patterns invisible to the unaided eye.

Latent Art History (2026) extended this lineage by projecting images and language in a shared latent space — asking not "how bright is this painting?" but "how sacred is it?". Latent Differentiation pursues this trajectory in a more critical register: rather than comparing modes of seeing, it measures the distance between them. What does language add — or subtract — to vision? Where does morphological similarity diverge from semantic proximity? The answer is not a method but a politics.

Three Levels, Three Differentials

The benchmark compares three regimes of representation:

  • Level 1 — Physicality Fourteen interpretable low-level features computed directly from the image (luminosity, saturation, entropy, frequency bands, edges). The image as physical surface: only what pixels objectively contain, with no learning involved.
  • Level 2 — Morphology DINOv2 ViT-B/14 embeddings (768 dimensions): a vision model that learned to recognize visual structures (compositions, shapes, textures, patterns) from millions of images, but without ever being exposed to text. The image as configuration of structural regularities — what the eye sees beyond pixels, but before language.
  • Level 3 — Allogenetic Three vision-language models — SigLIP, OpenAI CLIP, OpenCLIP LAION — that learned by aligning images with their textual descriptions on web-scale corpora. The image as semantic location in a space shaped by language, by what was photographed, captioned, indexed, and made searchable. Different VLMs trained on different corpora produce different readings — hence three of them, to measure their disagreements.

Three differentials emerge: Δ₁₂ isolates what morphology adds to physicality; Δ₁₃ measures the gap between physical surface and semantic meaning; Δ₂₃ — the most decisive — isolates the work of language itself. Where two images are morphologically close but semantically distant, language is doing something.

Allogenetic Memory

Vision-language models do not see images. They navigate latent spaces structured by their training corpora — by what was photographed, annotated, indexed, made available. Their organization of the visual world is irreducible to perception: it is an allogenetic memory, generated by something other than subjective experience.

Empirically, the differential signal Δ₂₃ peaks on photography and vernacular images — not because these are richer, but because they dominate web-scale text-image corpora. On classical painting, where text density is low, Δ₂₃ collapses: morphology and semantics converge. The benchmark thus measures not cultural complexity but the differential politics of the training data itself.

How does the Probe work?

Type a prompt — colonial, sacred, beautiful — and the cloud recolors. But how exactly?

For each of the three vision-language models (SigLIP, OpenAI CLIP, OpenCLIP LAION), the prompt is encoded into a vector in the same space as the image embeddings. We then compute the cosine similarity between this prompt vector and every image in the corpus. Each image receives a score: blue = far from the prompt, orange = close to it.

Switching between models reveals how each one "reads" the same prompt differently against the same corpus. The "VLM variance" mode highlights, in orange, the images on which the three models disagree most — making visible the differential politics of training corpora.

For levels 1 (physicality) and 2 (DINOv2 morphology), which have no native text encoder, we use proxies: a manual feature mapping for level 1 (where applicable — light, dark, colorful…), and a DINOv2 centroid of the top-30 SigLIP matches for level 2. When level 1 cannot be mapped (e.g. colonial, sacred), the score is explicitly null — which is itself the point: those concepts are purely allogenetic, with no signature in physicality.