793 Unicode characters look like Latin letters but aren't (yet) in confusables.txt

I scanned 23,317 identifier-safe codepoints across 230 fonts. Nearly half the discoveries are vertical strokes from obscure scripts.

In the previous post, I validated every TR39 confusable pair visually: render both characters, measure SSIM, put a number on it. That covered 1,418 pairs already in confusables.txt. But confusables.txt only maps characters that the Unicode Consortium has reviewed. What about the rest of Unicode?

The natural next question: are there characters outside confusables.txt that look like Latin letters? Characters that a human would confuse, but that no confusable detection system currently flags?

The answer is yes. 793 of them, from 96 different scripts.

The search space

Milestone 2 of confusable-vision scans every Unicode character that meets three criteria: identifier-safe (General Category Letter or Number), not already in confusables.txt, and not CJK/Hangul (excluded for this pass because logographic characters are structurally very different from Latin, though a future M2b scan should verify that assumption). That gives 23,317 candidates.

Not every candidate has font coverage. Fontconfig found at least one system font for 12,555 characters (53.8%). The remaining 10,762 have no installed font that contains them, so they cannot be rendered or scored. The covered candidates average 7.1 fonts each.

Each candidate is compared against 36 Latin targets: a-z and 0-9. The pipeline:

flowchart TD
A["23,317 candidates"]:::muted --> B["fontconfig<br/>per-character query"]:::logic
B --> C["12,555 with coverage<br/>(53.8%)"]:::muted
C --> D["89,478 targeted<br/>render jobs"]:::logic
D --> E["pHash prefilter<br/>+ SSIM scoring"]:::accent
E --> F["793 novel<br/>high-risk pairs"]:::muted

Scaling the scoring

The combinatorial challenge is real. 12,555 candidates times 36 targets times multiple fonts per character gives hundreds of millions of naive comparisons. Two optimisations made this tractable:

Same-font pHash prefilter. Before computing SSIM (expensive), each candidate/target pair in the same font is compared using perceptual hash distance. Pairs with pHash similarity below 0.3 are skipped. This eliminates the vast majority of structurally dissimilar pairings at near-zero cost.

Top-1-by-pHash cross-font selection. For candidates in non-standard fonts (Noto Sans variants, CJK fonts, etc.), instead of comparing against all 74 target renders for each Latin letter, the pipeline finds the single best target render by pHash similarity and computes SSIM only for that pair. This reduces cross-font SSIM from O(74) to O(1) per source render.

Result: 2,904,376 SSIM comparisons in 928 seconds (15.5 minutes), approximately 3,130 SSIM/second. The output is a 572 MB JSON file (candidate-scores.json), streamed using Node’s streaming JSON writer to avoid the V8 string length limit. extract-discoveries.ts then stream-parses it into a 1.5 MB candidate-discoveries.json containing the 793 high-scoring pairs.

The headline: 793 novel confusables

BandCount%Description
High (>= 0.7)7930.2%Novel confusables not in TR39
Medium (0.3-0.7)34,5228.1%Somewhat similar
Low (< 0.3)391,19491.7%Not visually confusable
Total426,509

The hit rate is lower than M1b (0.2% vs 3.5%). That is expected: M1b scored a curated list that the Unicode Consortium already flagged as confusable. M2 searches the full remaining space, where most characters look nothing like Latin.

Within the 793 discoveries, the distribution by SSIM range:

SSIM rangeCount% of discoveries
>= 0.95212.6%
0.90 - 0.95556.9%
0.80 - 0.9019124.1%
0.70 - 0.8052666.3%

The majority (66%) fall in the 0.70-0.80 range: visually confusable but not pixel-identical. The 21 pairs scoring above 0.95 are the most dangerous: near-indistinguishable from their Latin counterparts.

Nearly half are vertical strokes

Shape categoryTargetsCount%
Vertical strokel, i, j37747.5%
Roundo, c, e, d, b, n, p, q15319.3%
Other letterst, s, f, r, h, m, u, y, a14718.5%
Numeral0-9607.6%
Angularx, v, w, z, k567.1%

Nearly half of all novel confusables target “l”, “i”, or “j”, the simplest Latin glyphs. A vertical stroke is the most common glyph shape across all writing systems: tally marks, vowel carriers, numeral ones, and vertical punctuation all reduce to a single line at 48x48 resolution.

The “l” problem is the clearest illustration. Lowercase L in a sans-serif font is just a vertical stroke. confusable-vision found 143 novel confusables for the letter L alone, more than all 49 high-risk TR39 pairs combined. The per-target breakdown:

TargetDiscoveriesNotes
l143Vertical stroke, universal across scripts
i125Vertical stroke (with or without dot)
j109Vertical stroke with descender
o65Circle, common numeral/vowel shape
t62Cross shape
x22Diagonal cross
c22Open curve
n20Arch
817Double circle
u17Open arch
v16Angular open
b14Vertical + circle
m13Double arch
Other148Remaining 21 targets

The first three targets (l, i, j) are all vertical-stroke-dominant letters, accounting for 377 of 793 discoveries. This is a structural property of Latin letterforms, not a quirk of the dataset.

The top 20

Ranked by mean SSIM. None of these are in confusables.txt.

RankCodepointNameTargetSSIMFont
1U+A7FELATIN EPIGRAPHIC LETTER I LONGAl0.998Geneva (same-font)
2U+16B50PAHAWH HMONG DIGIT ZEROl0.986Noto Sans Pahawh Hmong vs Skia
3U+10889NABATAEAN LETTER KAPHl0.986Noto Sans Nabataean vs Skia
4U+A781LATIN SMALL LETTER TURNED Ll0.986Geneva (same-font)
5U+A771LATIN SMALL LETTER DUMd0.985Geneva (same-font)
6U+1BC07DUPLOYAN LETTER Il0.981Noto Sans Duployan vs Skia
7U+10D31HANIFI ROHINGYA VOWEL Al0.978Noto Sans Hanifi Rohingya vs Skia
8U+1E822MENDE KIKAKUI DIGIT ONEl0.978Noto Sans Mende Kikakui vs Skia
9U+16A59MRO DIGIT NINEl0.978Noto Sans Mro vs Skia
10U+109C0MEROITIC CURSIVE NUMBER ONEl0.978Noto Sans Meroitic vs Skia
11U+108EDHATRAN NUMBER ONEl0.976Noto Sans Hatran vs Skia
12U+108FBHATRAN LOW NUMERAL SIGNl0.976Noto Sans Hatran vs Skia
13U+1E951ADLAM SMALL LETTER Il0.973Noto Sans Adlam vs Skia
14U+10A9DOLD NORTH ARABIAN NUMBER ONEl0.972Noto Sans Old North Arabian vs Skia
15U+0C79TELUGU DIGIT THREEl0.969Telugu MN vs Skia
16U+A621VAI DIGIT ONEl0.963Noto Sans Vai vs Skia
17U+11AE5PAU CIN HAU LETTER PAl0.960Noto Sans Pau Cin Hau vs Skia
18U+A76FLATIN SMALL LETTER CON90.958Geneva (same-font)
19U+A9D0JAVANESE DIGIT ZEROo0.958Noto Sans Javanese vs Avenir
20U+10CA5OLD HUNGARIAN SMALL LETTER ECSl0.956Noto Sans Old Hungarian vs Skia

17 of the top 20 target “l”, from 15 different scripts: Latin Extended (ranks 1, 4 in Geneva), Pahawh Hmong, Nabataean, Duployan, Hanifi Rohingya, Mende Kikakui, Mro, Meroitic, Hatran, Adlam, Old North Arabian, Telugu, Vai, Pau Cin Hau, and Old Hungarian. All render as a simple vertical bar that is near-identical to Latin lowercase L. The recurrence of this shape across unrelated writing systems is the single strongest pattern in the data.

Pair #5 (U+A771, Latin Small Letter Dum) is notable: a Latin Extended character that looks identical to “d” in Geneva, a within-Latin confusable that TR39 missed. Pair #18 (U+A76F, Latin Small Letter Con vs “9”) is another within-Latin surprise. And pair #19 (Javanese digit zero vs “o”) is a cross-script find where a digit from an unrelated numeral system converges with a Latin letter.

The non-obvious finds

The vertical-stroke “l” lookalikes dominate the top of the list. But the more interesting security findings are characters that mimic structurally complex Latin letters. These are harder to catch with simple heuristics:

CodepointNameTargetSSIMFont
U+A9D0JAVANESE DIGIT ZEROo0.958Noto Sans Javanese vs Avenir
U+10347GOTHIC LETTER GIBAx0.941Noto Sans Gothic vs Menlo
U+2CADCOPTIC SMALL LETTER CRYPTOGRAMMIC NIx0.925Noto Sans Coptic vs Menlo
U+17F4KHMER SYMBOL BUON KOETv0.928Khmer MN vs Tahoma
U+07D5NKO LETTER BAb0.922Noto Sans NKo vs Futura
U+07CENKO LETTER YAu0.916Noto Sans NKo vs Arial
U+2C91COPTIC SMALL LETTER EIe0.897Noto Sans Coptic vs Arial
U+10336GOTHIC LETTER KUSMAz0.884Noto Sans Gothic vs Menlo
U+10CC2OLD HUNGARIAN SMALL LETTER ECx0.883Noto Sans Old Hungarian vs Arial
U+1D5C6MATHEMATICAL SANS-SERIF SMALL Mm0.878STIX Two Math vs Avenir
Six confusable pairs rendered side by side at 48x48 greyscale. Gothic giba next to Latin x, Javanese zero next to Latin o, Coptic ni next to Latin x, Khmer buon koet next to Latin v, NKo ba next to Latin b, and NKo ya next to Latin u. Each pair is visually near-identical despite coming from completely unrelated scripts.
Actual 48x48 greyscale renders from confusable-vision. Left character in each pair is the novel confusable source; right is the Latin target. None of these pairs are in confusables.txt.

Gothic giba looks like x. Javanese zero looks like o. NKo ba looks like b. These are not vertical stroke degeneracy. They are genuine glyph-level convergence between unrelated writing systems. A rule that says “flag vertical strokes” would miss all of them. Only pixel comparison catches them.

96 scripts, one long tail

The 793 discoveries come from 96 distinct scripts/fonts. The report groups by rendering font rather than Unicode script property, because what matters is which font file supplies the glyph. The top 20 account for 463 of 793 pairs (58%), while 76 scripts contribute 5 or fewer each.

Script/FontNovel pairsAvg SSIMNotes
Shared Latin fonts (Arial, etc.)1070.765Latin Extended, IPA, modifier letters
Geneva440.800Latin Extended-D, Cherokee Supplement
Old Hungarian200.831Historical Turkic script
Duployan200.81019th-century shorthand system
Euphemia UCAS190.777Unified Canadian Aboriginal Syllabics
Mende Kikakui180.823West African script
Vai180.781West African syllabary
Tifinagh170.793Berber script
Mro150.782Chin Hills script (Myanmar/Bangladesh)
Pau Cin Hau150.790Another Chin script
Gothic140.7934th-century Germanic script
NKo140.792West African script for Manding languages
Tamil Sangam MN140.792Tamil script
Coptic130.791Egyptian Christian script
Nabataean120.827Ancient Aramaic-derived script
Hatran120.861Ancient Mesopotamian script
Pahawh Hmong110.804Southeast Asian script
Ugaritic110.772Ancient cuneiform alphabetic
Adlam100.810Modern West African script
Lydian100.808Ancient Anatolian script

The geographic spread tells the story. West African scripts (Vai, Mende Kikakui, NKo, Adlam, Bamum) contribute around 70 pairs collectively. Ancient Near Eastern scripts (Nabataean, Hatran, Ugaritic, Lydian, and others) contribute another ~70. Historical European scripts (Gothic, Old Hungarian, Coptic, Old Permic, Old Italic) add ~70 more. Southeast Asian scripts (Pahawh Hmong, Mro, Pau Cin Hau, Javanese, Khmer, and others) contribute around 65. Canadian Aboriginal Syllabics alone contributes 19.

The gap is clear. TR39 confusables.txt focuses heavily on Cyrillic, Greek, and Armenian, the scripts most commonly used in IDN homograph attacks. These 96 scripts are largely unexamined. The Unicode Consortium’s curation has focused on the scripts most encountered in modern computing, while leaving smaller scripts unexamined for Latin visual similarity.

What this means for confusable detection

MetricConfusables.txt (M1b)Novel discoveries (M2)
Input pairs/candidates1,41823,317
Characters with font coverage1,341 (94.6%)12,555 (53.8%)
SSIM comparisons235,6252,904,376
High-risk pairs (>= 0.7)49 (3.5%)793 (0.2% of scored)
Computation time65s928s

The 793 novel discoveries represent a 16x increase over TR39’s 49 high-risk pairs from the M1b analysis. confusables.txt is not wrong; it is incomplete. Its scope is NFKC-based, derived from canonical decomposition relationships, not visual rendering.

The biggest gaps:

  1. Vertical strokes from obscure scripts. 377 characters that render as near-identical vertical bars. Pahawh Hmong, Nabataean, Duployan, Hatran, Mende Kikakui, and others. Easy to flag with a shape rule, but currently unflagged.
  2. Numeral lookalikes. 60 pairs where digits from other scripts render identically to Latin letters or digits. Javanese zero to Latin o (0.958), Pahawh Hmong zero to Latin l (0.986), Telugu three to Latin l (0.969).
  3. Historical and minority scripts. Coptic, Gothic, Old Hungarian, and Lydian have small user populations but valid identifier characters and active Noto Sans font coverage.

The practical risk depends on context, and to quantify that I annotated every discovery against three Unicode property databases.

How dangerous are these, exactly?

Each of the 793 novel discoveries was checked against UAX #31 (XID_Start, XID_Continue), IDNA 2008 (PVALID status), and TR39 (Identifier_Status = Allowed). The results:

PropertyCount% of 793
XID_Continue (valid in identifiers)71590.2%
IDNA PVALID (valid in domain names)65782.8%
XID_Continue AND IDNA PVALID59174.5%
TR39 Identifier_Status = Allowed607.6%

74.5% of these 793 discoveries are valid in both JavaScript identifiers and internationalized domain names. An attacker can use 591 of these characters in variable names, function names, and domain labels. They are not blocked by IDNA 2008, not blocked by UAX #31, and not flagged by confusables.txt.

The TR39 Allowed number is low (7.6%) because most discoveries come from scripts that Unicode classifies as Restricted (Gothic, Old Hungarian, Pahawh Hmong, etc.). But Restricted status does not prevent exploitation in all contexts. JavaScript engines accept any XID_Continue character in identifiers regardless of TR39 restriction. And 657 of these characters pass IDNA 2008 validation, meaning they could appear in internationalized domain names.

The context-specific breakdown:

  • JavaScript identifiers: 715 of 793 (90.2%) are XID_Continue. An attacker can define a variable using Gothic giba that looks like x, or NKo ba that looks like b. JS engines will accept it.
  • Domain names: 657 of 793 (82.8%) are IDNA PVALID. The BMP characters (Latin Extended, Coptic, NKo, Tifinagh) are directly usable in IDN homograph attacks. Most SMP historical scripts are also PVALID.
  • Package names: npm restricts to ASCII, but PyPI and crates.io accept wider Unicode ranges. The 60 TR39 Allowed characters are the ones most likely to pass all registry filters.

Pixel-identical but not glyph-identical

A separate question: when two characters produce the same pixels, do they share the same glyph in the font file? Or does the font contain two separate outlines that happen to look identical?

To answer this, I used fontkit to check the cmap glyph IDs for every same-font pair with SSIM >= 0.999 across both the TR39 pairs (M1b) and novel discoveries (M2). If two codepoints map to the same glyph ID, that is intentional glyph reuse by the font designer. If they map to different glyph IDs, the pixel identity is raster coincidence: separate outlines that happen to produce the same pixels at rendering size.

The result: zero glyph reuse across all 85 pixel-identical pairs. Every pixel-identical confusable (78 TR39 pairs, 7 novel pairs) uses separate glyph IDs in every font tested, including Arial, Helvetica, Geneva, Times New Roman, and dozens more. The fonts deliberately maintain separate glyph entries for Cyrillic, Greek, Roman numeral, and other confusable characters, even when the glyph outlines are visually identical to their Latin counterparts.

This means the pixel identity is not a font aliasing shortcut. It is a rendering outcome from outline-identical but structurally separate glyphs. Fonts could differentiate these characters visually without restructuring their cmap tables, since the table already separates them. They just happen to draw the same shape.

How to reproduce

git clone https://github.com/paultendo/confusable-vision
cd confusable-vision
npm install

# Build candidate list (23,317 identifier-safe chars not in confusables.txt)
npx tsx scripts/build-candidates.ts

# Render candidates with font discovery (~40 min, 89,478 PNGs)
npx tsx scripts/build-index.ts --candidates

# Score all candidate/target pairs (~15.5 min, 2.9M comparisons)
npx tsx scripts/score-candidates.ts

# Extract high-risk discoveries
npx tsx scripts/extract-discoveries.ts

The full results are at candidate-discoveries.json in the repo (CC-BY-4.0). The technical report is at REPORT.md. Every number in this post is drawn from the report or the scored output.

Series context

This is the sixth post in a series on Unicode identifier security:

  1. confusables.txt and NFKC disagree on 31 characters
  2. Unicode ships one confusable map. You need two.
  3. A threat model for Unicode identifier spoofing
  4. Making Unicode risk measurable
  5. I rendered 1,418 Unicode confusable pairs across 230 fonts
  6. This post: 793 novel confusables discovered outside TR39

Next: cross-script validation and CJK/Hangul verification (M2b).

Submitted to Unicode

These 793 discoveries have been submitted to the Unicode Consortium for review, as a follow-up to the initial submission reporting 31 NFKC/confusables.txt conflicts. The submission includes the full scored dataset, per-pair SSIM scores, font coverage, and same-font/cross-font breakdowns. Whether any of these pairs warrant inclusion in a future revision of confusables.txt is up to the Unicode working group.


confusable-vision is MIT-licensed. The discovery data is CC-BY-4.0. The full candidate analysis, 96-script breakdown, and all render artifacts are in the repo. namespace-guard (v0.16.0, zero dependencies, MIT) integrates these discoveries via optional visual weights in confusableDistance().

Thanks to Akshat Gupta and colleagues for the GlyphNet paper, whose empirical findings on greyscale rendering informed confusable-vision’s methodology.