"oy" is ѹ: 571,753 multi-character confusable pairs nobody is checking for

I tested 190 million two-letter combinations across 245 fonts. 571,753 look like a single character.

Single-character confusables are well-studied. Cyrillic а looks like Latin a; Unicode Technical Report #39 (TR39) covers that. But “rn” looking like “m”? “oy” being geometrically identical to Cyrillic ѹ? No standard covers multi-character confusables systematically.

I tested 190 million bigram-to-character comparisons across 245 fonts. 676 Latin lowercase bigrams against 133,000 single-character targets from every writing system available on macOS. Three-stage geometric filter cascade, no pixel rendering, comparison directly on font vector outlines. 571,753 unique confusable pairs discovered.

This is the first empirical multi-character confusable dataset.

The problem

Unicode Technical Report #39 maintains confusables.txt, the canonical source for character-level visual confusion data. It maps single characters to single targets: Cyrillic о to Latin o, Greek Α to Latin A. It also contains 2,103 multi-character mappings, but those were hand-curated. There is no systematic discovery methodology behind them.

The canonical example everyone cites is “rn” looking like “m” in sans-serif fonts. It appears in every Unicode security talk. Nobody had measured it across fonts until now.

Multi-character confusables are harder to detect than single-character ones. Every detection system today checks individual characters against a lookup table. None check whether a sequence of characters is visually identical to a different single character. The security implications extend beyond domains to every context where Unicode identifiers are displayed: usernames, package names, repository names, file paths. In these contexts, confusable detection either does not exist or only checks single characters.

What changed: geometric comparison on vector outlines

Previous confusable-vision work used pixel rendering + SSIM. That works for single characters but is too expensive for multi-character search. Even after accounting for per-font coverage, the search space is 190 million comparisons. SSIM on rendered pixels is not tractable at that scale.

The new approach compares directly on the font’s vector outlines (Bezier curves), with no rendering step. Two geometric metrics:

Raycasting signatures. Cast parallel rays through glyph outlines from 36 angles, count intersections. This produces a topological fingerprint (1,800 values per glyph) that captures stroke structure, counters, and holes. Precomputed for all 133,000 targets in a signature bank.

Signed Distance Fields (SDF). For pairs that survive raycasting, compute a 128x128 grid where each cell stores the signed distance from that point to the nearest outline edge. Negative values are inside the glyph, positive values are outside. Two SDF grids can then be compared numerically.

L2 distance measures how different two SDF grids are. It is the root-mean-square of per-pixel differences: take the difference at every grid cell, square them, average, and take the square root. Lower means more similar. L2 = 0.000 means the outlines are geometrically identical.

NCC (normalised cross-correlation) measures whether two SDF grids have the same shape, ignoring differences in scale and offset. It correlates the two grids after subtracting their means and normalising by their standard deviations. NCC = 1.0 means identical shape; lower means less similar. NCC is the inverse of L2’s direction: L2 goes down as similarity increases, NCC goes up.

Both metrics appear in the tables throughout this post. L2 is the primary ranking metric; NCC confirms the match quality.

A three-stage filter cascade eliminates 99.07% of comparisons before SDF:

StagePairsRemaining
Total search space190,091,200100%
After width filter (advance width within 15%)69,823,70836.7%
After raycasting filter (topological match, threshold 2.0)1,768,5870.93%
Discoveries (SDF L2 < 15.0)1,543,9000.81%
Unique bigram-to-target pairs571,753-

The width filter alone eliminates 63.3%. Most single characters have advance widths nothing like a two-character sequence. Raycasting catches topological mismatches (different number of counters, strokes, or holes). SDF scoring handles the geometrically close survivors.

This is font-specific by design. The same bigram may be confusable in Helvetica (harmonised metrics) but not in Papyrus (decorative letterforms). The data captures per-font variation.

91 minutes total runtime. 13 worker threads with spatially-indexed SDF computation. Monospace fonts correctly produce zero discoveries (bigrams always have double width). Single-script Noto fonts correctly produce zero cross-script matches.

Headline findings

”oy” is ѹ

L2 = 0.000 in Helvetica. Geometrically identical outlines. Confirmed across 16 fonts.

Latin "oy" vs Cyrillic ѹ rendered side by side in 5 fonts

FontL2NCC
Helvetica0.0001.0000
Arial Unicode MS0.0331.0000
Geneva0.0931.0000
DIN Condensed1.0600.9981
Iowan Old Style1.1620.9939
Charter1.1730.9949
Helvetica Neue1.7780.9886
Superclarendon1.8010.9877
Tahoma1.8880.9846
Seravek1.8990.9842
Microsoft Sans Serif2.5110.9744
Times New Roman3.0040.9751
Arial3.4990.9488
Lucida Grande4.0910.9330
Copperplate4.4790.9359
Snell Roundhand5.3970.9441

U+0479 (ѹ) is Cyrillic “uk”, a historical digraph from Old Church Slavonic. Font designers draw it as o+y because that is what it is. The outlines are not merely similar; in Helvetica, they are the same curves.

This is the multi-character analogue of Cyrillic о matching Latin o. But no detection system checks for it.

The IDN attack is tempting to reach for here, but the reality is more nuanced. Three layers of protection exist, each independent:

ICANN Root Zone LGR blocks domain registration. The Cyrillic LGR contains only 86 code points focused on modern living languages. U+0479 is classified as “obsolete (Old Church Slavonic)” and excluded. The Russian and Ukrainian reference LGRs (33 code points each) do not include it either. A domain containing ѹ is not registrable in practice.

TR39 IdentifierStatus blocks well-implemented identifier systems. U+0479 is not listed in IdentifierStatus.txt and defaults to Restricted. Its IdentifierType is explicitly Obsolete (range 0460..0481, 34 Old Church Slavonic characters). But that restriction is based on script classification, not visual similarity. No one restricted it because it looks like “oy”; it was restricted because Old Church Slavonic characters are not in modern use.

confusables.txt does not cover it at all. U+0479 does not appear anywhere in confusables.txt. Not as a source, not as a target, not mapped to “oy” or to anything else. The geometric identity with “oy” has never been catalogued.

The gap is real for any platform that accepts broad Unicode without implementing TR39 restrictions. Usernames, display names, package names, repository names: these contexts typically have no confusable detection, let alone multi-character confusable detection. A username “compoy” vs “compѹ” on a platform with permissive Unicode policies renders identically in Helvetica. The geometric identity is real. The gap it reveals in multi-character confusable detection is real. And if oy/ѹ exists undiscovered between an obsolete Cyrillic character and Latin, what multi-character confusables exist between characters that are in the modern IDN repertoire? That is the argument for extending to Cyrillic and Greek bigrams in the next phase.

The “oy” to ѹ confusable also extends to related bigrams. “ey”, “cy”, “ay”, and “uy” all match ѹ across 15-16 fonts, with the first character’s shape being close enough to the о component.

IPA ligatures: designed to be confusable

These are literally designed as ligatures of their component letters. The pipeline discovers them because the geometric comparison sees no difference between the bigram and the single codepoint.

BigramTargetCodepointBest L2Fonts
tsʦU+02A60.80515
lsʪU+02AA0.9403+
dzʣU+02A31.22613
lzʫU+02AB1.1143
dsʥU+02A53.1329
tsʨU+02A81.93315

These validate the methodology. If the pipeline did not find IPA ligatures, something would be wrong.

”rn” is “m”

The textbook multi-character confusable, confirmed empirically at L2 = 1.519 in Shree Devanagari 714. Detected across 95 fonts total (the threshold is generous at L2 < 15.0).

The top 10 fonts for “rn” to m:

FontL2NCC
Shree Devanagari 7141.5190.9912
InaiMathi1.8840.9923
Khmer Sangam MN1.9960.9911
System Font2.1040.9878
Khmer MN2.9290.9871
Arial Narrow3.1440.9855
Myanmar Sangam MN3.2950.9793
Arial3.3450.9608
Bangla Sangam MN3.3750.9651
Optima3.3850.9613

Arial scores L2 = 3.345, not the tightest match. The tightest is Shree Devanagari 714, where the Latin letterforms have particularly uniform stroke widths and minimal kerning between r and n.

The broader “m-like” pattern shows up at the generous L2 < 15.0 threshold, but the quality varies sharply:

BigramFontsBest L2Mean L2Genuine?
rn951.5194.097Yes, arches merge
ra1021.9074.097Marginal
rs1052.3864.411Threshold noise
re1083.1424.721Threshold noise
rc1073.1304.807Threshold noise

“re”, “rc”, and “rs” appear in more fonts than “rn” because at high L2 values (7-15), many bigrams have vaguely m-shaped silhouettes. The r’s terminal combined with the next character’s vertical stroke creates two humps. These are not confusable to a human. The genuine signal is in the best-L2 column: “rn” at 1.519 is a real match. “re” at 3.142 is geometrically detectable but visually distinct.

Cross-script: Latin bigrams matching non-Latin singletons

The novel security-relevant findings are bigrams matching characters from different scripts:

“uc”/“as”/“ue”/“oc” match Cyrillic Zhe (Ж, U+0436) in Savoye LET. A decorative script font where the cursive letterforms converge. L2 ranges from 1.428 to 1.702. Font-specific; would not generalise to sans-serif.

“ri” matches Latin n in Telugu Sangam MN and Myanmar Sangam MN. The Indic font’s Latin letterforms make r+i collapse visually to n. L2 = 1.405 and 1.458 respectively.

“ll” matches ǁ (lateral click, U+01C1) in Papyrus (L2 = 0.394) and Arial Unicode MS (L2 = 1.180). In fonts where l has no serifs and consistent spacing, two l’s are geometrically identical to ǁ.

High-confidence cross-font pairs

322 unique bigram-to-target pairs appear in 3 or more fonts with mean L2 below 5.0. The top 20:

BigramTargetCodepointFontsBest L2Mean L2
oyѹU+0479160.0002.117
eyѹU+0479161.1893.020
lbU+211451.9853.128
cyѹU+0479151.1163.304
ibU+211452.1113.331
sxU+A68572.7823.449
tsʨU+02A8151.9333.521
dsʥU+02A593.1323.575
rpȹU+023982.5653.627
lzʫU+02AB31.1143.641
daʣU+02A362.7763.693
rbȸU+023882.6243.775
dcʣU+02A3123.2013.782
dzʣU+02A3131.2263.797
zaѭU+046D63.2473.831
ayѹU+0479161.5873.878
reU+1D6B132.1903.990
isʦU+02A6162.2544.036
saU+A68572.5513.959
uyѹU+0479162.1304.075

Cross-font consistency is the strongest signal for genuine confusables. A pair appearing in 16 fonts with mean L2 under 3 is not noise.

The noise: what the threshold catches that it shouldn’t

Before looking at the score distribution, context on what the generous L2 < 15.0 threshold lets through.

DIN Condensed dominates the discovery list with 30,056 entries. Its top bigrams (“tj”, “lj”, “ij”, “ti”) each match 400+ single-character targets. 18.3% of DIN Condensed discoveries are accented Latin characters (Latin Extended + Greek Extended ranges).

DIN Condensed "ti" vs accented single characters

The mechanism: DIN Condensed is a narrow typeface. The thinned-down t+i combination falls within L2 threshold of characters like u with macron, u with breve, a with tilde. Geometrically close in a condensed font. A human would not confuse them. The diacritics are visible, just small.

The median DIN Condensed L2 is 8.021, and only 46 of its 30,056 discoveries have L2 below 2.0. This is a threshold calibration issue, not a pipeline failure. Two improvements would address it: tighter per-font-class thresholds (condensed fonts need stricter L2), or a secondary filter that detects diacritic-sized features in the SDF difference field.

The raw dataset includes all discoveries with their scores. Consumers can filter by L2 < 5.0 for high-confidence pairs, or require cross-font consistency of 3+ fonts.

L2 score distribution

L2 rangeDiscoveriesCumulative %
0.0 - 0.540.0%
0.5 - 1.020.0%
1.0 - 1.5240.0%
1.5 - 2.0960.0%
2.0 - 3.01,0290.1%
3.0 - 4.05,3080.4%
4.0 - 5.017,7061.6%
5.0 - 7.0114,5009.0%
7.0 - 10.0497,91041.3%
10.0 - 15.0907,321100.0%

The distribution is heavily right-skewed. 126 discoveries have L2 below 2.0: near-identical outlines. 24,169 have L2 below 5.0: geometrically close enough that a human might confuse them. The long tail from 7.0 to 15.0 is mostly noise that a tighter threshold would eliminate.

Font coverage

The top 15 fonts by discovery count:

FontDiscoveries
Arial Unicode MS141,477
Heiti SC103,676
Apple SD Gothic Neo91,505
Heiti TC78,254
Hiragino Sans63,917
Apple Symbols58,127
Hiragino Maru Gothic Pro55,760
Hiragino Sans GB52,713
Hiragino Kaku Gothic Std43,877
STIX Two Math41,773
Hiragino Kaku Gothic ProN40,109
Geneva34,922
DIN Condensed30,056
Euphemia UCAS29,140
Tahoma28,003

CJK fonts dominate because they contain the most glyphs. Arial Unicode MS has 35,841 targets in the signature bank, so it has the most opportunities for bigram matches. Fonts with zero discoveries: Andale Mono, Courier New, Menlo, Monaco, PT Mono (all monospace), Ayuthaya, Chalkduster, and all 103 Noto single-script fonts.

The monospace result is a built-in validation: bigrams in monospace fonts are always exactly double the width of any single character, so the width filter eliminates 100% of comparisons.

Overlap with TR39

Of the 2,103 multi-character mappings in confusables.txt, 1,848 were scored in the earlier Phase 4 analysis. 15 of those appear as discoveries in Phase 5:

MappingPhase 4Phase 5
rn to mScoredRediscovered
dz to ʣScoredRediscovered
ts to ʦScoredRediscovered
ls to ʪScoredRediscovered
lz to ʫScoredRediscovered
ae to æScoredRediscovered
ae to ӕScoredRediscovered
oe to œScoredRediscovered
ll to ǁScoredRediscovered
ll to װScoredRediscovered
ll to ‖ScoredRediscovered
ll to ∥ScoredRediscovered
oo to ꚙScoredRediscovered
oo to ∞ScoredRediscovered
lt to ₶ScoredRediscovered

The low overlap (15/1,848) is expected. Phase 4 scored known TR39 multi-character mappings where the source is a single character that decomposes to a multi-character target (e.g. ʣ to “dz”). Phase 5 searches in the opposite direction: Latin bigrams that match single characters (e.g. “dz” to ʣ). The confusables.txt multi-character mappings are mostly decompositions (single char to multi-char sequence), not compositions (multi-char to single char). Only 15 TR39 entries have a Latin-lowercase bigram as their target sequence and a single character as their source that also appears in the Phase 5 signature bank. These are complementary datasets.

The remaining 571,738 Phase 5 pairs have no corresponding entry in TR39’s multi-character mappings. These are novel discoveries.

What this doesn’t cover yet

Latin lowercase only. 676 bigrams from a-z. Mixed case, digits, and cross-script source bigrams (Cyrillic, Greek, Devanagari) are the next phase. The architecture supports this without changes; just wider input loops.

Static kern table only. The pipeline applies kerning from the font’s kern table but does not run a full OpenType shaping engine (GSUB/GPOS). For Latin bigrams this is sufficient. For Arabic, Devanagari, and other complex scripts with contextual shaping, the pipeline would need HarfBuzz integration.

No trigrams. “rn” is a bigram. Longer sequences expand the search space exponentially. The filter cascade can handle it, but the combinatorics need a smarter enumeration strategy than exhaustive search.

macOS system fonts only. 245 fonts from the macOS font library. Windows and Linux font sets differ. The architecture is portable; the font registry is not.

L2 threshold is generous. 15.0 catches genuine confusables and noise. The curated dataset (L2 < 5.0 or cross-font consistency >= 3 fonts) is the recommended consumption format.

What this means for detection systems

No detection system currently checks multi-character confusables at this granularity. TR39’s skeleton algorithm maps sequences to sequences, but the confusable data it draws from has only 2,103 multi-character entries, hand-curated, not discovered empirically.

The dataset enables a new class of check: does this string contain a substring that is geometrically identical to a character from a different script in the user’s font? That is not a lookup-table problem. It is a substring scanning problem, requiring different architecture from current confusable detection.

The practical attack surface is not IDN domains (where ICANN’s Root Zone LGR restricts the repertoire to modern characters) but rather the broader space of Unicode identifiers: usernames on social platforms, package names on registries like npm and PyPI, GitHub repository names, and display names in messaging apps. These contexts typically accept any PVALID character without restricting to a curated repertoire. Multi-character confusable detection does not exist in any of them.

The scored dataset will feed the same weight-based detection as single-character confusables in a future namespace-guard release. Multi-character confusable distance will be available per-font, same as single-character pairs.

How to reproduce

git clone https://github.com/paultendo/confusable-vision
cd confusable-vision
npm install

# Build signature bank (one-time, ~45 min)
npm run build-signature-bank

# Run multi-character discovery (~91 min)
npm run discover-multichar-sdf

# Output: data/output/multichar-discoveries-sdf.jsonl

The JSONL output contains one line per discovery with bigram, target character, target codepoint, font, SDF L2 distance, SDF NCC, and raycasting distance. Filter with jq 'select(.sdfL2 < 5.0)' for high-confidence pairs.

Series context

This post is part of the confusable-vision series: