793 Unicode characters look like Latin letters but aren't (yet) in confusables.txt
I scanned 23,317 identifier-safe codepoints across 230 fonts. Nearly half the discoveries are vertical strokes from obscure scripts.
In the previous post, I validated every TR39 confusable pair visually: render both characters, measure SSIM, put a number on it. That covered 1,418 pairs already in confusables.txt. But confusables.txt only maps characters that the Unicode Consortium has reviewed. What about the rest of Unicode?
The natural next question: are there characters outside confusables.txt that look like Latin letters? Characters that a human would confuse, but that no confusable detection system currently flags?
The answer is yes. 793 of them, from 96 different scripts.
The search space
Milestone 2 of confusable-vision scans every Unicode character that meets three criteria: identifier-safe (General Category Letter or Number), not already in confusables.txt, and not CJK/Hangul (excluded for this pass because logographic characters are structurally very different from Latin, though a future M2b scan should verify that assumption). That gives 23,317 candidates.
Not every candidate has font coverage. Fontconfig found at least one system font for 12,555 characters (53.8%). The remaining 10,762 have no installed font that contains them, so they cannot be rendered or scored. The covered candidates average 7.1 fonts each.
Each candidate is compared against 36 Latin targets: a-z and 0-9. The pipeline:
flowchart TD A["23,317 candidates"]:::muted --> B["fontconfig<br/>per-character query"]:::logic B --> C["12,555 with coverage<br/>(53.8%)"]:::muted C --> D["89,478 targeted<br/>render jobs"]:::logic D --> E["pHash prefilter<br/>+ SSIM scoring"]:::accent E --> F["793 novel<br/>high-risk pairs"]:::muted
Scaling the scoring
The combinatorial challenge is real. 12,555 candidates times 36 targets times multiple fonts per character gives hundreds of millions of naive comparisons. Two optimisations made this tractable:
Same-font pHash prefilter. Before computing SSIM (expensive), each candidate/target pair in the same font is compared using perceptual hash distance. Pairs with pHash similarity below 0.3 are skipped. This eliminates the vast majority of structurally dissimilar pairings at near-zero cost.
Top-1-by-pHash cross-font selection. For candidates in non-standard fonts (Noto Sans variants, CJK fonts, etc.), instead of comparing against all 74 target renders for each Latin letter, the pipeline finds the single best target render by pHash similarity and computes SSIM only for that pair. This reduces cross-font SSIM from O(74) to O(1) per source render.
Result: 2,904,376 SSIM comparisons in 928 seconds (15.5 minutes), approximately 3,130 SSIM/second. The output is a 572 MB JSON file (candidate-scores.json), streamed using Node’s streaming JSON writer to avoid the V8 string length limit. extract-discoveries.ts then stream-parses it into a 1.5 MB candidate-discoveries.json containing the 793 high-scoring pairs.
The headline: 793 novel confusables
| Band | Count | % | Description |
|---|---|---|---|
| High (>= 0.7) | 793 | 0.2% | Novel confusables not in TR39 |
| Medium (0.3-0.7) | 34,522 | 8.1% | Somewhat similar |
| Low (< 0.3) | 391,194 | 91.7% | Not visually confusable |
| Total | 426,509 |
The hit rate is lower than M1b (0.2% vs 3.5%). That is expected: M1b scored a curated list that the Unicode Consortium already flagged as confusable. M2 searches the full remaining space, where most characters look nothing like Latin.
Within the 793 discoveries, the distribution by SSIM range:
| SSIM range | Count | % of discoveries |
|---|---|---|
| >= 0.95 | 21 | 2.6% |
| 0.90 - 0.95 | 55 | 6.9% |
| 0.80 - 0.90 | 191 | 24.1% |
| 0.70 - 0.80 | 526 | 66.3% |
The majority (66%) fall in the 0.70-0.80 range: visually confusable but not pixel-identical. The 21 pairs scoring above 0.95 are the most dangerous: near-indistinguishable from their Latin counterparts.
Nearly half are vertical strokes
| Shape category | Targets | Count | % |
|---|---|---|---|
| Vertical stroke | l, i, j | 377 | 47.5% |
| Round | o, c, e, d, b, n, p, q | 153 | 19.3% |
| Other letters | t, s, f, r, h, m, u, y, a | 147 | 18.5% |
| Numeral | 0-9 | 60 | 7.6% |
| Angular | x, v, w, z, k | 56 | 7.1% |
Nearly half of all novel confusables target “l”, “i”, or “j”, the simplest Latin glyphs. A vertical stroke is the most common glyph shape across all writing systems: tally marks, vowel carriers, numeral ones, and vertical punctuation all reduce to a single line at 48x48 resolution.
The “l” problem is the clearest illustration. Lowercase L in a sans-serif font is just a vertical stroke. confusable-vision found 143 novel confusables for the letter L alone, more than all 49 high-risk TR39 pairs combined. The per-target breakdown:
| Target | Discoveries | Notes |
|---|---|---|
| l | 143 | Vertical stroke, universal across scripts |
| i | 125 | Vertical stroke (with or without dot) |
| j | 109 | Vertical stroke with descender |
| o | 65 | Circle, common numeral/vowel shape |
| t | 62 | Cross shape |
| x | 22 | Diagonal cross |
| c | 22 | Open curve |
| n | 20 | Arch |
| 8 | 17 | Double circle |
| u | 17 | Open arch |
| v | 16 | Angular open |
| b | 14 | Vertical + circle |
| m | 13 | Double arch |
| Other | 148 | Remaining 21 targets |
The first three targets (l, i, j) are all vertical-stroke-dominant letters, accounting for 377 of 793 discoveries. This is a structural property of Latin letterforms, not a quirk of the dataset.
The top 20
Ranked by mean SSIM. None of these are in confusables.txt.
| Rank | Codepoint | Name | Target | SSIM | Font |
|---|---|---|---|---|---|
| 1 | U+A7FE | LATIN EPIGRAPHIC LETTER I LONGA | l | 0.998 | Geneva (same-font) |
| 2 | U+16B50 | PAHAWH HMONG DIGIT ZERO | l | 0.986 | Noto Sans Pahawh Hmong vs Skia |
| 3 | U+10889 | NABATAEAN LETTER KAPH | l | 0.986 | Noto Sans Nabataean vs Skia |
| 4 | U+A781 | LATIN SMALL LETTER TURNED L | l | 0.986 | Geneva (same-font) |
| 5 | U+A771 | LATIN SMALL LETTER DUM | d | 0.985 | Geneva (same-font) |
| 6 | U+1BC07 | DUPLOYAN LETTER I | l | 0.981 | Noto Sans Duployan vs Skia |
| 7 | U+10D31 | HANIFI ROHINGYA VOWEL A | l | 0.978 | Noto Sans Hanifi Rohingya vs Skia |
| 8 | U+1E822 | MENDE KIKAKUI DIGIT ONE | l | 0.978 | Noto Sans Mende Kikakui vs Skia |
| 9 | U+16A59 | MRO DIGIT NINE | l | 0.978 | Noto Sans Mro vs Skia |
| 10 | U+109C0 | MEROITIC CURSIVE NUMBER ONE | l | 0.978 | Noto Sans Meroitic vs Skia |
| 11 | U+108ED | HATRAN NUMBER ONE | l | 0.976 | Noto Sans Hatran vs Skia |
| 12 | U+108FB | HATRAN LOW NUMERAL SIGN | l | 0.976 | Noto Sans Hatran vs Skia |
| 13 | U+1E951 | ADLAM SMALL LETTER I | l | 0.973 | Noto Sans Adlam vs Skia |
| 14 | U+10A9D | OLD NORTH ARABIAN NUMBER ONE | l | 0.972 | Noto Sans Old North Arabian vs Skia |
| 15 | U+0C79 | TELUGU DIGIT THREE | l | 0.969 | Telugu MN vs Skia |
| 16 | U+A621 | VAI DIGIT ONE | l | 0.963 | Noto Sans Vai vs Skia |
| 17 | U+11AE5 | PAU CIN HAU LETTER PA | l | 0.960 | Noto Sans Pau Cin Hau vs Skia |
| 18 | U+A76F | LATIN SMALL LETTER CON | 9 | 0.958 | Geneva (same-font) |
| 19 | U+A9D0 | JAVANESE DIGIT ZERO | o | 0.958 | Noto Sans Javanese vs Avenir |
| 20 | U+10CA5 | OLD HUNGARIAN SMALL LETTER ECS | l | 0.956 | Noto Sans Old Hungarian vs Skia |
17 of the top 20 target “l”, from 15 different scripts: Latin Extended (ranks 1, 4 in Geneva), Pahawh Hmong, Nabataean, Duployan, Hanifi Rohingya, Mende Kikakui, Mro, Meroitic, Hatran, Adlam, Old North Arabian, Telugu, Vai, Pau Cin Hau, and Old Hungarian. All render as a simple vertical bar that is near-identical to Latin lowercase L. The recurrence of this shape across unrelated writing systems is the single strongest pattern in the data.
Pair #5 (U+A771, Latin Small Letter Dum) is notable: a Latin Extended character that looks identical to “d” in Geneva, a within-Latin confusable that TR39 missed. Pair #18 (U+A76F, Latin Small Letter Con vs “9”) is another within-Latin surprise. And pair #19 (Javanese digit zero vs “o”) is a cross-script find where a digit from an unrelated numeral system converges with a Latin letter.
The non-obvious finds
The vertical-stroke “l” lookalikes dominate the top of the list. But the more interesting security findings are characters that mimic structurally complex Latin letters. These are harder to catch with simple heuristics:
| Codepoint | Name | Target | SSIM | Font |
|---|---|---|---|---|
| U+A9D0 | JAVANESE DIGIT ZERO | o | 0.958 | Noto Sans Javanese vs Avenir |
| U+10347 | GOTHIC LETTER GIBA | x | 0.941 | Noto Sans Gothic vs Menlo |
| U+2CAD | COPTIC SMALL LETTER CRYPTOGRAMMIC NI | x | 0.925 | Noto Sans Coptic vs Menlo |
| U+17F4 | KHMER SYMBOL BUON KOET | v | 0.928 | Khmer MN vs Tahoma |
| U+07D5 | NKO LETTER BA | b | 0.922 | Noto Sans NKo vs Futura |
| U+07CE | NKO LETTER YA | u | 0.916 | Noto Sans NKo vs Arial |
| U+2C91 | COPTIC SMALL LETTER EI | e | 0.897 | Noto Sans Coptic vs Arial |
| U+10336 | GOTHIC LETTER KUSMA | z | 0.884 | Noto Sans Gothic vs Menlo |
| U+10CC2 | OLD HUNGARIAN SMALL LETTER EC | x | 0.883 | Noto Sans Old Hungarian vs Arial |
| U+1D5C6 | MATHEMATICAL SANS-SERIF SMALL M | m | 0.878 | STIX Two Math vs Avenir |

Gothic giba looks like x. Javanese zero looks like o. NKo ba looks like b. These are not vertical stroke degeneracy. They are genuine glyph-level convergence between unrelated writing systems. A rule that says “flag vertical strokes” would miss all of them. Only pixel comparison catches them.
96 scripts, one long tail
The 793 discoveries come from 96 distinct scripts/fonts. The report groups by rendering font rather than Unicode script property, because what matters is which font file supplies the glyph. The top 20 account for 463 of 793 pairs (58%), while 76 scripts contribute 5 or fewer each.
| Script/Font | Novel pairs | Avg SSIM | Notes |
|---|---|---|---|
| Shared Latin fonts (Arial, etc.) | 107 | 0.765 | Latin Extended, IPA, modifier letters |
| Geneva | 44 | 0.800 | Latin Extended-D, Cherokee Supplement |
| Old Hungarian | 20 | 0.831 | Historical Turkic script |
| Duployan | 20 | 0.810 | 19th-century shorthand system |
| Euphemia UCAS | 19 | 0.777 | Unified Canadian Aboriginal Syllabics |
| Mende Kikakui | 18 | 0.823 | West African script |
| Vai | 18 | 0.781 | West African syllabary |
| Tifinagh | 17 | 0.793 | Berber script |
| Mro | 15 | 0.782 | Chin Hills script (Myanmar/Bangladesh) |
| Pau Cin Hau | 15 | 0.790 | Another Chin script |
| Gothic | 14 | 0.793 | 4th-century Germanic script |
| NKo | 14 | 0.792 | West African script for Manding languages |
| Tamil Sangam MN | 14 | 0.792 | Tamil script |
| Coptic | 13 | 0.791 | Egyptian Christian script |
| Nabataean | 12 | 0.827 | Ancient Aramaic-derived script |
| Hatran | 12 | 0.861 | Ancient Mesopotamian script |
| Pahawh Hmong | 11 | 0.804 | Southeast Asian script |
| Ugaritic | 11 | 0.772 | Ancient cuneiform alphabetic |
| Adlam | 10 | 0.810 | Modern West African script |
| Lydian | 10 | 0.808 | Ancient Anatolian script |
The geographic spread tells the story. West African scripts (Vai, Mende Kikakui, NKo, Adlam, Bamum) contribute around 70 pairs collectively. Ancient Near Eastern scripts (Nabataean, Hatran, Ugaritic, Lydian, and others) contribute another ~70. Historical European scripts (Gothic, Old Hungarian, Coptic, Old Permic, Old Italic) add ~70 more. Southeast Asian scripts (Pahawh Hmong, Mro, Pau Cin Hau, Javanese, Khmer, and others) contribute around 65. Canadian Aboriginal Syllabics alone contributes 19.
The gap is clear. TR39 confusables.txt focuses heavily on Cyrillic, Greek, and Armenian, the scripts most commonly used in IDN homograph attacks. These 96 scripts are largely unexamined. The Unicode Consortium’s curation has focused on the scripts most encountered in modern computing, while leaving smaller scripts unexamined for Latin visual similarity.
What this means for confusable detection
| Metric | Confusables.txt (M1b) | Novel discoveries (M2) |
|---|---|---|
| Input pairs/candidates | 1,418 | 23,317 |
| Characters with font coverage | 1,341 (94.6%) | 12,555 (53.8%) |
| SSIM comparisons | 235,625 | 2,904,376 |
| High-risk pairs (>= 0.7) | 49 (3.5%) | 793 (0.2% of scored) |
| Computation time | 65s | 928s |
The 793 novel discoveries represent a 16x increase over TR39’s 49 high-risk pairs from the M1b analysis. confusables.txt is not wrong; it is incomplete. Its scope is NFKC-based, derived from canonical decomposition relationships, not visual rendering.
The biggest gaps:
- Vertical strokes from obscure scripts. 377 characters that render as near-identical vertical bars. Pahawh Hmong, Nabataean, Duployan, Hatran, Mende Kikakui, and others. Easy to flag with a shape rule, but currently unflagged.
- Numeral lookalikes. 60 pairs where digits from other scripts render identically to Latin letters or digits. Javanese zero to Latin o (0.958), Pahawh Hmong zero to Latin l (0.986), Telugu three to Latin l (0.969).
- Historical and minority scripts. Coptic, Gothic, Old Hungarian, and Lydian have small user populations but valid identifier characters and active Noto Sans font coverage.
The practical risk depends on context, and to quantify that I annotated every discovery against three Unicode property databases.
How dangerous are these, exactly?
Each of the 793 novel discoveries was checked against UAX #31 (XID_Start, XID_Continue), IDNA 2008 (PVALID status), and TR39 (Identifier_Status = Allowed). The results:
| Property | Count | % of 793 |
|---|---|---|
| XID_Continue (valid in identifiers) | 715 | 90.2% |
| IDNA PVALID (valid in domain names) | 657 | 82.8% |
| XID_Continue AND IDNA PVALID | 591 | 74.5% |
| TR39 Identifier_Status = Allowed | 60 | 7.6% |
74.5% of these 793 discoveries are valid in both JavaScript identifiers and internationalized domain names. An attacker can use 591 of these characters in variable names, function names, and domain labels. They are not blocked by IDNA 2008, not blocked by UAX #31, and not flagged by confusables.txt.
The TR39 Allowed number is low (7.6%) because most discoveries come from scripts that Unicode classifies as Restricted (Gothic, Old Hungarian, Pahawh Hmong, etc.). But Restricted status does not prevent exploitation in all contexts. JavaScript engines accept any XID_Continue character in identifiers regardless of TR39 restriction. And 657 of these characters pass IDNA 2008 validation, meaning they could appear in internationalized domain names.
The context-specific breakdown:
- JavaScript identifiers: 715 of 793 (90.2%) are XID_Continue. An attacker can define a variable using Gothic giba that looks like
x, or NKo ba that looks likeb. JS engines will accept it. - Domain names: 657 of 793 (82.8%) are IDNA PVALID. The BMP characters (Latin Extended, Coptic, NKo, Tifinagh) are directly usable in IDN homograph attacks. Most SMP historical scripts are also PVALID.
- Package names: npm restricts to ASCII, but PyPI and crates.io accept wider Unicode ranges. The 60 TR39 Allowed characters are the ones most likely to pass all registry filters.
Pixel-identical but not glyph-identical
A separate question: when two characters produce the same pixels, do they share the same glyph in the font file? Or does the font contain two separate outlines that happen to look identical?
To answer this, I used fontkit to check the cmap glyph IDs for every same-font pair with SSIM >= 0.999 across both the TR39 pairs (M1b) and novel discoveries (M2). If two codepoints map to the same glyph ID, that is intentional glyph reuse by the font designer. If they map to different glyph IDs, the pixel identity is raster coincidence: separate outlines that happen to produce the same pixels at rendering size.
The result: zero glyph reuse across all 85 pixel-identical pairs. Every pixel-identical confusable (78 TR39 pairs, 7 novel pairs) uses separate glyph IDs in every font tested, including Arial, Helvetica, Geneva, Times New Roman, and dozens more. The fonts deliberately maintain separate glyph entries for Cyrillic, Greek, Roman numeral, and other confusable characters, even when the glyph outlines are visually identical to their Latin counterparts.
This means the pixel identity is not a font aliasing shortcut. It is a rendering outcome from outline-identical but structurally separate glyphs. Fonts could differentiate these characters visually without restructuring their cmap tables, since the table already separates them. They just happen to draw the same shape.
How to reproduce
git clone https://github.com/paultendo/confusable-vision
cd confusable-vision
npm install
# Build candidate list (23,317 identifier-safe chars not in confusables.txt)
npx tsx scripts/build-candidates.ts
# Render candidates with font discovery (~40 min, 89,478 PNGs)
npx tsx scripts/build-index.ts --candidates
# Score all candidate/target pairs (~15.5 min, 2.9M comparisons)
npx tsx scripts/score-candidates.ts
# Extract high-risk discoveries
npx tsx scripts/extract-discoveries.ts
The full results are at candidate-discoveries.json in the repo (CC-BY-4.0). The technical report is at REPORT.md. Every number in this post is drawn from the report or the scored output.
Series context
This is the sixth post in a series on Unicode identifier security:
- confusables.txt and NFKC disagree on 31 characters
- Unicode ships one confusable map. You need two.
- A threat model for Unicode identifier spoofing
- Making Unicode risk measurable
- I rendered 1,418 Unicode confusable pairs across 230 fonts
- This post: 793 novel confusables discovered outside TR39
Next: cross-script validation and CJK/Hangul verification (M2b).
Submitted to Unicode
These 793 discoveries have been submitted to the Unicode Consortium for review, as a follow-up to the initial submission reporting 31 NFKC/confusables.txt conflicts. The submission includes the full scored dataset, per-pair SSIM scores, font coverage, and same-font/cross-font breakdowns. Whether any of these pairs warrant inclusion in a future revision of confusables.txt is up to the Unicode working group.
confusable-vision is MIT-licensed. The discovery data is CC-BY-4.0. The full candidate analysis, 96-script breakdown, and all render artifacts are in the repo. namespace-guard (v0.16.0, zero dependencies, MIT) integrates these discoveries via optional visual weights in confusableDistance().
Thanks to Akshat Gupta and colleagues for the GlyphNet paper, whose empirical findings on greyscale rendering informed confusable-vision’s methodology.