Latent Space Lab — An Interactive Essay
The hard cases in AI safety rarely live in clean boxes.
Representation
Overlap
Why safety boundaries are not always cleanly separable.
Plain language. Technical terms explained inline.
~8 min read · includes interactive map and comparisons
Begin reading01
The Idea
When an AI safety system draws a line, it is drawing that line in a geometry it did not design. Sentence embeddings turn text into points in high-dimensional space, and those points cluster by statistical patterns in training data — not by intent, not by context, not by who is asking.
The result is that concepts from very different categories often end up as neighbors in that space. A question about medication dosage lands near a question about overdose. A security research description shares vocabulary with an exploitation technique. A policy analysis of extremism occupies the same neighborhood as the thing it analyzes.
This is not a flaw in any particular model. It is a consequence of how language works. Safety systems that use embedding similarity to make decisions inherit the geometry — and all the ambiguity that comes with it.
Why should you care?
If you are building a content filter or reward model, this is a picture of where your rule will fire on benign content — and where it will miss harmful content that learned to sound clinical. If you are doing evals, this is a vocabulary for what “near-miss” means beyond a binary score. If you are new to alignment, this is an entry point that does not require reading papers first.
113
Concept descriptions
across 10 domains
99
High-overlap examples
sitting at band boundaries
10
Domains
biology to law to AI
02
The Geometry
Every text snippet can be turned into a list of numbers — a vector — by a sentence embedding model. The model is trained to place similar texts near each other in this high-dimensional space. “Similar” here means statistically similar: same vocabulary, same sentence structure, same conceptual patterns in the training corpus.
Cosine similarity measures the angle between two vectors. Two texts with similarity 1.0 point in exactly the same direction. Two texts with similarity 0.0 are orthogonal — maximally unrelated in the model's learned space.
Simplified 2D projection of embedding space
When safety categories overlap in embedding space, no single threshold can cleanly separate them.
The 2D map in this lab is a UMAP projection from 384 dimensions down to 2. UMAP attempts to preserve local neighborhood structure, so points that appear close tend to have higher cosine similarity in the original embedding space. Global distances are distorted. Two clusters that look far apart may be more similar than the map suggests, and two clusters that look close may be less similar. The similarity scores shown in the detail panel and neighbor lists are computed directly in the original 384-dimensional space, not from the 2D layout.
03
The Map
2D UMAP projection of 384-dimensional embeddings (n_neighbors=15, min_dist=0.1, cosine metric). Visual distance is a teaching aid. Neighbor lists and similarity scores in the panel are computed in the original embedding space.
Click any point
to explore its safety band, domain, framing, and nearest cross-band neighbors.
04
Boundary Blur
The boundary blur score measures how evenly an example sits near multiple safety band centroids at once. A score near 1.0 means the concept does not anchor firmly within any single cluster. A score near 0.0 means it sits clearly within one band.
High blur is not a danger signal. It is a geometric property. Many high-blur concepts are straightforwardly benign — they simply use language that crosses categorical boundaries.
Scores in this dataset cluster in the 0.99-1.00 range. This is expected: all-MiniLM-L6-v2 produces cosine similarities in a compressed range across general-topic text, and normalized Shannon entropy over three centroids saturates quickly. The ranking is meaningful even when absolute values are close. A concept scoring 0.994 is meaningfully less anchored than one scoring 1.000. This is an exploration heuristic, not a calibrated risk score.
Top 30 by boundary blur
Select a concept
to see its centroid similarity breakdown.
05
Compare
Select any two examples to see their cosine similarity in full embedding space. Similarity here is a measurement of how the model represents them — not a judgment about whether they should be treated the same way.
Cosine similarity: 0.3170
Lower similarity. The embedding model distinguishes these reasonably well, assuming similar phrasing.
How CRISPR-Cas9 gene editing works
Biology · Educational
Explanation of CRISPR-Cas9 mechanism and applications in medicine and research.
Overlap
0.70
Blur
1.00
Antibiotic resistance mechanisms in bacteria
Biology · Technical
Mechanisms of antibiotic resistance and relevance to drug design and stewardship.
Overlap
0.80
Blur
1.00
06
Why This Matters for Safety
Safety categories rarely fail in neat boxes. When a classifier draws a line in embedding space, the line is straight. The actual distribution of harmful and benign content is not.
This creates two structural problems. First, false positives: legitimate content that lives near restricted content gets caught by the same rule. A clinical description of a medical risk, a security researcher's explanation, a policy analysis of extremism — each can land near the content it discusses. Second, false negatives: harmful content that successfully adopts clinical or educational language can slip through a threshold that would have caught its more blunt predecessors.
Neither problem is fully fixable by improving the threshold. Both are consequences of how the underlying representation space is shaped.
Intended uses
- →Researchers and students studying embedding-based classification
- →Safety practitioners building intuition for where thresholds will fail
- →Policy analysts explaining the representation overlap problem
- →Educators building curricula around AI safety and dual-use technology
Not intended for
- →Making moderation decisions — this is not a classifier
- →Building training datasets — do not use this to label data
- →Justifying restrictions or permissions based on proximity
- →Claiming any safety system is correct or incorrect
07
Methods
Hand-curated dataset
data/safe_examples_seed.csvSentence embeddings
Cosine similarity matrix
UMAP projection
Boundary blur score
Static export
scripts/ and exported to public/data/. The web app loads JSON at build time. No model inference happens at runtime.Limitations
Projection distortion. The 2D map loses information. Points that appear close may be further apart in 384-dimensional space.
Model bias. all-MiniLM-L6-v2 encodes the statistical patterns of its training corpus, including biases about language, framing, and domain. The neighborhoods reflect those biases.
Non-representative dataset.113 examples designed to illustrate overlap — not a sample of any real query distribution. Do not use as a benchmark.
Editorial categories. The five safety bands are editorial judgments, not ground truth validated against any benchmark.
Model scope. all-MiniLM-L6-v2 is a small general-purpose semantic similarity model, not a frontier language model. Its neighborhoods reflect statistical patterns in general text, not the internal representations of deployed safety classifiers or frontier chat models. Results should not be generalized to those systems.
Single metric. Cosine similarity in this embedding space is one view of conceptual proximity. Different embedding models, tokenizers, or distance metrics can produce substantially different neighborhoods. The patterns shown here are specific to all-MiniLM-L6-v2 and cosine similarity.
Heuristic scores. The overlap score and boundary blur score are geometric exploration heuristics. They are not calibrated against any safety benchmark and should not be used to draw conclusions about real-world safety system performance or content risk.
Intended scope. This project is intended to build intuition about embedding geometry, not to settle any technical question about AI safety systems. Treat it as a starting point for thinking, not a research result.
Full artifact metadata (embedding model, UMAP parameters, generation timestamp) is available at /data/metadata.json.
08
Start Here
If you want to go deeper on representation, safety, or the ideas behind this lab, here are honest starting points.
Representation Engineering (Zou et al., 2023)
Introduces a framework for understanding AI behavior through linear representations. Directly relevant to why embedding geometry matters for safety.
Towards Monosemanticity (Anthropic, 2023)
Shows that model internals are more entangled than clean feature boundaries suggest — the representational version of the overlap problem.
Sentence-BERT (Reimers & Gurevych, 2019)
The technical foundation for all-MiniLM-L6-v2. Understanding how the model was trained clarifies what its neighborhoods actually reflect.
UMAP: Uniform Manifold Approximation and Projection (McInnes et al., 2018)
The algorithm used for 2D projection. Knowing its assumptions — especially that it distorts global distances — is important for reading the map correctly.
On the Dangers of Stochastic Parrots (Bender et al., 2021)
Raises questions about what large language models encode and reproduce. Relevant to understanding what semantic embedding spaces actually contain.
More visual essays in this series
Explores common AI safety failure modes.
Explores why chain-of-thought explanations can be unfaithful.
Explores why individual neurons can represent multiple concepts at once.