Representation Overlap Lab

01

The Idea

When an AI safety system draws a line, it is drawing that line in a geometry it did not design. Sentence embeddings turn text into points in high-dimensional space, and those points cluster by statistical patterns in training data — not by intent, not by context, not by who is asking.

The result is that concepts from very different categories often end up as neighbors in that space. A question about medication dosage lands near a question about overdose. A security research description shares vocabulary with an exploitation technique. A policy analysis of extremism occupies the same neighborhood as the thing it analyzes.

This is not a flaw in any particular model. It is a consequence of how language works. Safety systems that use embedding similarity to make decisions inherit the geometry — and all the ambiguity that comes with it.

Why should you care?

If you are building a content filter or reward model, this is a picture of where your rule will fire on benign content — and where it will miss harmful content that learned to sound clinical. If you are doing evals, this is a vocabulary for what “near-miss” means beyond a binary score. If you are new to alignment, this is an entry point that does not require reading papers first.

113

Concept descriptions

across 10 domains

99

High-overlap examples

sitting at band boundaries

10

Domains

biology to law to AI

02

The Geometry

Every text snippet can be turned into a list of numbers — a vector — by a sentence embedding model. The model is trained to place similar texts near each other in this high-dimensional space. “Similar” here means statistically similar: same vocabulary, same sentence structure, same conceptual patterns in the training corpus.

Cosine similarity measures the angle between two vectors. Two texts with similarity 1.0 point in exactly the same direction. Two texts with similarity 0.0 are orthogonal — maximally unrelated in the model's learned space.

Simplified 2D projection of embedding space

When safety categories overlap in embedding space, no single threshold can cleanly separate them.

The 2D map in this lab is a UMAP projection from 384 dimensions down to 2. UMAP attempts to preserve local neighborhood structure, so points that appear close tend to have higher cosine similarity in the original embedding space. Global distances are distorted. Two clusters that look far apart may be more similar than the map suggests, and two clusters that look close may be less similar. The similarity scores shown in the detail panel and neighbor lists are computed directly in the original 384-dimensional space, not from the 2D layout.

03

The Map

Filter:High-overlap only

2D UMAP projection of 384-dimensional embeddings (n_neighbors=15, min_dist=0.1, cosine metric). Visual distance is a teaching aid. Neighbor lists and similarity scores in the panel are computed in the original embedding space.

Benign

Capability-Building

Ambiguous

Policy-Relevant

Abstract Placeholder

⌖

Click any point

to explore its safety band, domain, framing, and nearest cross-band neighbors.

113 of 113 concepts shown·99 high-overlap (≥0.6)

04

Boundary Blur

The boundary blur score measures how evenly an example sits near multiple safety band centroids at once. A score near 1.0 means the concept does not anchor firmly within any single cluster. A score near 0.0 means it sits clearly within one band.

High blur is not a danger signal. It is a geometric property. Many high-blur concepts are straightforwardly benign — they simply use language that crosses categorical boundaries.

Scores in this dataset cluster in the 0.99-1.00 range. This is expected: all-MiniLM-L6-v2 produces cosine similarities in a compressed range across general-topic text, and normalized Shannon entropy over three centroids saturates quickly. The ranking is meaningful even when absolute values are close. A concept scoring 0.994 is meaningfully less anchored than one scoring 1.000. This is an exploration heuristic, not a calibrated risk score.

Top 30 by boundary blur

Select a concept

to see its centroid similarity breakdown.

05

Compare

Select any two examples to see their cosine similarity in full embedding space. Similarity here is a measurement of how the model represents them — not a judgment about whether they should be treated the same way.

Example A

Example B

Cosine similarity: 0.3170

Lower similarity. The embedding model distinguishes these reasonably well, assuming similar phrasing.

Benign

How CRISPR-Cas9 gene editing works

Biology · Educational

Explanation of CRISPR-Cas9 mechanism and applications in medicine and research.

Overlap

0.70

Blur

1.00

Capability-Building

Antibiotic resistance mechanisms in bacteria

Biology · Technical

Mechanisms of antibiotic resistance and relevance to drug design and stewardship.

Overlap

0.80

Blur

1.00

06

Why This Matters for Safety

Safety categories rarely fail in neat boxes. When a classifier draws a line in embedding space, the line is straight. The actual distribution of harmful and benign content is not.

This creates two structural problems. First, false positives: legitimate content that lives near restricted content gets caught by the same rule. A clinical description of a medical risk, a security researcher's explanation, a policy analysis of extremism — each can land near the content it discusses. Second, false negatives: harmful content that successfully adopts clinical or educational language can slip through a threshold that would have caught its more blunt predecessors.

Neither problem is fully fixable by improving the threshold. Both are consequences of how the underlying representation space is shaped.

Intended uses

→Researchers and students studying embedding-based classification
→Safety practitioners building intuition for where thresholds will fail
→Policy analysts explaining the representation overlap problem
→Educators building curricula around AI safety and dual-use technology

Not intended for

→Making moderation decisions — this is not a classifier
→Building training datasets — do not use this to label data
→Justifying restrictions or permissions based on proximity
→Claiming any safety system is correct or incorrect

07

Methods

1

Hand-curated dataset

113 concept descriptions across 10 domains and 5 safety bands, written and reviewed individually. No entry provides actionable guidance for causing harm. Abstract risk placeholders name types of restricted content without reproducing them. Source: data/safe_examples_seed.csv

2

Sentence embeddings

Embeddings computed using all-MiniLM-L6-v2from sentence-transformers — a general-purpose semantic similarity model producing 384-dimensional vectors. The model was not trained for safety classification. Its neighborhoods reflect general language statistics.

3

Cosine similarity matrix

Pairwise cosine similarities computed for all 113 examples, producing a 113×113 matrix. The overlap score for each example is the fraction of its 10 nearest neighbors that belong to a different safety band.

4

UMAP projection

384-dimensional embeddings projected to 2D using UMAP (n_neighbors=15, min_dist=0.1, cosine metric, random_state=42). The 2D layout preserves local neighborhood structure but distorts global distances. If UMAP is unavailable, PCA is used as a fallback.

5

Boundary blur score

Normalized Shannon entropy of each example's cosine similarities to three reference band centroids (benign, ambiguous, policy-relevant). High entropy means the example sits roughly equidistant from all three centroids. This is an exploration heuristic — not a safety signal.

6

Static export

All artifacts are precomputed offline using the Python pipeline in scripts/ and exported to public/data/. The web app loads JSON at build time. No model inference happens at runtime.

Limitations

Projection distortion. The 2D map loses information. Points that appear close may be further apart in 384-dimensional space.

Model bias. all-MiniLM-L6-v2 encodes the statistical patterns of its training corpus, including biases about language, framing, and domain. The neighborhoods reflect those biases.

Non-representative dataset.113 examples designed to illustrate overlap — not a sample of any real query distribution. Do not use as a benchmark.

Editorial categories. The five safety bands are editorial judgments, not ground truth validated against any benchmark.

Model scope. all-MiniLM-L6-v2 is a small general-purpose semantic similarity model, not a frontier language model. Its neighborhoods reflect statistical patterns in general text, not the internal representations of deployed safety classifiers or frontier chat models. Results should not be generalized to those systems.

Single metric. Cosine similarity in this embedding space is one view of conceptual proximity. Different embedding models, tokenizers, or distance metrics can produce substantially different neighborhoods. The patterns shown here are specific to all-MiniLM-L6-v2 and cosine similarity.

Heuristic scores. The overlap score and boundary blur score are geometric exploration heuristics. They are not calibrated against any safety benchmark and should not be used to draw conclusions about real-world safety system performance or content risk.

Intended scope. This project is intended to build intuition about embedding geometry, not to settle any technical question about AI safety systems. Treat it as a starting point for thinking, not a research result.

Full artifact metadata (embedding model, UMAP parameters, generation timestamp) is available at /data/metadata.json.

08

Start Here

If you want to go deeper on representation, safety, or the ideas behind this lab, here are honest starting points.

Representation Engineering (Zou et al., 2023)

Introduces a framework for understanding AI behavior through linear representations. Directly relevant to why embedding geometry matters for safety.

Towards Monosemanticity (Anthropic, 2023)

Shows that model internals are more entangled than clean feature boundaries suggest — the representational version of the overlap problem.

Sentence-BERT (Reimers & Gurevych, 2019)

The technical foundation for all-MiniLM-L6-v2. Understanding how the model was trained clarifies what its neighborhoods actually reflect.

UMAP: Uniform Manifold Approximation and Projection (McInnes et al., 2018)

The algorithm used for 2D projection. Knowing its assumptions — especially that it distorts global distances — is important for reading the map correctly.

On the Dangers of Stochastic Parrots (Bender et al., 2021)

Raises questions about what large language models encode and reproduce. Relevant to understanding what semantic embedding spaces actually contain.