confusables.txt and NFKC disagree on 31 characters
and why your homoglyph detection needs to know about both
If you’ve ever built a login system, you’ve probably dealt with homoglyph attacks: someone registers аdmin with a Cyrillic “а” (U+0430) instead of Latin “a” (U+0061). The characters are visually identical, the slug passes your regex, and now you have an impersonation vector.
The Unicode Consortium maintains an official defence against this: confusables.txt, part of Unicode Technical Standard #39 (Security Mechanisms). It’s a flat file mapping ~6,565 characters to their visual equivalents. Cyrillic а → a, Greek ο → o, Cherokee Ꭺ → A, and thousands more.
It’s worth noting that confusables.txt is designed for detection, not normalization. TR39 itself says skeleton mappings are “not suitable for display to users” and “should definitely not be used as a normalization of identifiers.” The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it — not to silently remap those characters and let it through.
Here’s the problem. If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it), then 31 entries in confusables.txt conflict with what NFKC produces. These entries encode the wrong visual mapping for any system that normalizes input before checking it.
What NFKC normalization does
NFKC (Normalization Form Compatibility Composition) is Unicode’s way of collapsing “compatibility variants” to their canonical form. Fullwidth letters → ASCII, superscripts → normal digits, ligatures → component letters, mathematical styled characters → plain characters:
Hello → Hello (fullwidth → ASCII)
finance → finance (fi ligature → fi)
𝐇ello → Hello (mathematical bold → plain)
This is the right first step for slug/handle validation. You want Hello to become hello, not to be rejected as containing confusable characters. NFKC handles hundreds of these compatibility forms automatically.
NFKC and confusables serve different purposes. NFKC is for normalization: producing a canonical form for storage and comparison. Confusables detection is for security: flagging characters that could fool a human reader. They answer different questions about the same input, and in a well-designed system they’re applied separately rather than chained together to produce a single output.
The conflict
Here’s what nobody seems to talk about: confusables.txt and NFKC sometimes map the same character to different Latin letters.
The classic example is the Long S (ſ, U+017F). This is the archaic letterform you see in 18th-century printing, where “Congress” was printed as “Congreſs.”
- confusables.txt maps ſ → f (visually, ſ does look like f)
- NFKC normalization maps ſ → s (linguistically, ſ is s)
Both are defensible mappings, but they answer different questions. TR39 asks “what does this look like?” NFKC asks “what does this mean?”
Why does this matter? If you normalize with NFKC first (converting ſ to s), then check the confusable map, the ſ→f entry never fires - NFKC already handled the character. That’s fine. But if you build a confusable map from the raw TR39 data and use it without NFKC in front of it, or if someone later reorders the steps, those entries silently produce wrong results: teſt would be flagged as containing an f lookalike instead of being recognized as test.
The full list: 31 entries
This isn’t a single edge case. I found 31 characters where confusables.txt and NFKC disagree:
The Long S
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| ſ | Latin Small Letter Long S | U+017F | f | s |
TR39 sees the visual resemblance to f. But linguistically (and in NFKC), ſ is an archaic form of s. The NFKC mapping is unambiguously correct for any application that cares about meaning rather than just shape.
Capital I → l (16 variants)
confusables.txt maps capital I (and all its styled variants) to lowercase L. This is the classic Il1 ambiguity: in many sans-serif fonts, uppercase I, lowercase l, and digit 1 are nearly indistinguishable.
But NFKC normalizes styled variants of I back to plain I, and lowercasing gives you “i”, not “l”:
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| ℐ | Script Capital I | U+2110 | l | i |
| ℑ | Fraktur Capital I | U+2111 | l | i |
| Ⅰ | Roman Numeral One | U+2160 | l | i |
| I | Fullwidth Latin Capital I | U+FF29 | l | i |
| 𝐈 | Mathematical Bold Capital I | U+1D408 | l | i |
| 𝐼 | Mathematical Italic Capital I | U+1D43C | l | i |
| 𝑰 | Mathematical Bold Italic Capital I | U+1D470 | l | i |
| 𝓘 | Mathematical Script Capital I (Bold) | U+1D4D8 | l | i |
| 𝕀 | Mathematical Double-Struck Capital I | U+1D540 | l | i |
| 𝕴 | Mathematical Fraktur Capital I (Bold) | U+1D574 | l | i |
| 𝖨 | Mathematical Sans-Serif Capital I | U+1D5A8 | l | i |
| 𝗜 | Mathematical Sans-Serif Bold Capital I | U+1D5DC | l | i |
| 𝘐 | Mathematical Sans-Serif Italic Capital I | U+1D610 | l | i |
| 𝙄 | Mathematical Sans-Serif Bold Italic Capital I | U+1D644 | l | i |
| 𝙸 | Mathematical Monospace Capital I | U+1D678 | l | i |
| | Outlined Latin Capital Letter I | U+1CCDE | l | i |
TR39 says all of these look like “l”. It’s right: they often do, in many fonts. But NFKC normalizes them all to plain “I”, which lowercases to “i”. If your system runs NFKC before confusable detection, the confusable map entry for these characters is unreachable - the character has already become “i” by the time you check it.
Digit 0 → o (7 variants)
Same pattern with digit zero. confusables.txt maps styled zeros to the letter “o” (visually similar), but NFKC collapses them to the digit “0”:
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| 𝟎 | Mathematical Bold Digit Zero | U+1D7CE | o | 0 |
| 𝟘 | Mathematical Double-Struck Digit Zero | U+1D7D8 | o | 0 |
| 𝟢 | Mathematical Sans-Serif Digit Zero | U+1D7E2 | o | 0 |
| 𝟬 | Mathematical Sans-Serif Bold Digit Zero | U+1D7EC | o | 0 |
| 𝟶 | Mathematical Monospace Digit Zero | U+1D7F6 | o | 0 |
| 🯰 | Segmented Digit Zero | U+1FBF0 | o | 0 |
| | Outlined Digit Zero | U+1CCF0 | o | 0 |
In a slug context, 0 and o aren’t interchangeable. Your slug regex accepts both, but they mean different things. NFKC correctly preserves the digit.
Digit 1 → l (7 variants)
And the same again with digit one, where confusables.txt sees “l” but NFKC correctly maps to “1”:
| Char | Name | Codepoint | TR39 maps to | NFKC maps to |
|---|---|---|---|---|
| 𝟏 | Mathematical Bold Digit One | U+1D7CF | l | 1 |
| 𝟙 | Mathematical Double-Struck Digit One | U+1D7D9 | l | 1 |
| 𝟣 | Mathematical Sans-Serif Digit One | U+1D7E3 | l | 1 |
| 𝟭 | Mathematical Sans-Serif Bold Digit One | U+1D7ED | l | 1 |
| 𝟷 | Mathematical Monospace Digit One | U+1D7F7 | l | 1 |
| 🯱 | Segmented Digit One | U+1FBF1 | l | 1 |
| | Outlined Digit One | U+1CCF1 | l | 1 |
Why this happens
This isn’t a bug in either standard. TR39 and NFKC have different purposes, and they were designed independently:
confusables.txt answers: “What does this character visually resemble?” It’s designed for the skeleton algorithm, which compares two strings for visual similarity. Mathematical Bold I (𝐈) looks like lowercase l in most fonts. That’s a legitimate visual observation.
NFKC normalization answers: “What is the canonical form of this character?” Mathematical Bold I is semantically the letter I rendered in a bold mathematical style. NFKC strips the styling, yielding plain I.
These are orthogonal concerns. Confusability is about what humans see. NFKC is about what machines should store. The problem arises when you use both (which you should) without reconciling them. UTS #39 doesn’t specify how confusable detection should interact with NFKC normalization, even though most modern systems apply both.
The practical impact
If you build a confusable detection system and also run NFKC normalization, you need to know about these 31 entries:
If you run NFKC first, then check confusables: The 31 entries are unreachable. NFKC has already transformed the character before your confusable check sees it. This is harmless - they’re dead code in your detection map, not a security hole.
If you check confusables without NFKC: Those 31 entries produce incorrect detection results. Your system would flag ſ as an f-lookalike (it’s actually s), flag mathematical zeros as o-lookalikes (they’re actually 0), and flag mathematical ones as l-lookalikes (they’re actually 1). The detection is wrong, even if you’re correctly rejecting rather than remapping.
If you use confusables for remapping (don’t do this): The problems compound. teſt becomes teft instead of test. account10 with a mathematical 1 becomes accountlo. As TR39 states, confusable mappings should not be used as normalization.
What to do about it
The approach depends on how you use confusables:
If you use confusables for detection and rejection (recommended)
Filter your confusable map to exclude any character that NFKC already handles. This keeps your map clean and ensures every entry represents a character your system will actually encounter:
const sourceChar = String.fromCodePoint(sourceCp);
const nfkcResult = sourceChar.normalize("NFKC").toLowerCase();
// NFKC already maps to a Latin letter/digit - skip this entry
// (either same target = redundant, or different target = conflict)
if (/^[a-z0-9]$/.test(nfkcResult)) continue;
// NFKC produces a valid slug fragment - skip (already handled)
if (/^[a-z0-9-]+$/.test(nfkcResult)) continue;
// NFKC doesn't resolve to ASCII - keep this confusable entry
entries.push({ source: sourceCp, target: confusableTarget });
This takes you from ~6,565 raw TR39 entries to ~613 that are meaningful after NFKC. Every remaining entry is a character that survives NFKC unchanged and visually mimics a Latin letter.
In namespace-guard, this is how it works in practice: NFKC is applied first during normalization when storing and comparing slugs. The confusable map then runs on the normalized input as a completely separate validation step — a blocklist. If any character in the normalized slug matches the map, the slug is rejected. No remapping, no skeleton, no merged output. Just: “does this string contain a character that looks like a Latin letter but isn’t one? If yes, reject.”
If you need to run confusables without NFKC
Be aware that 31 entries in confusables.txt encode visual mappings that disagree with the linguistic identity of the character. Whether that matters depends on your use case. For the skeleton algorithm (comparing two strings for visual similarity), these entries are correct. For anything that cares about what a character means, they’ll give you the wrong answer.
Making it reproducible
Rather than hand-curating a confusable map (which becomes stale when Unicode ships new versions), I wrote a generator script that:
- Downloads confusables.txt from unicode.org
- Extracts all single-character → Latin letter/digit mappings
- Filters out NFKC-redundant and NFKC-conflicting entries
- Adds supplemental mappings for known gaps (e.g., Latin small capitals that confusables.txt misses)
- Outputs a TypeScript object literal, grouped by Unicode block
The script prints stats to stderr so you can verify the filtering:
Filtered to 605 entries from TR39
Skipped 31 NFKC-conflict entries (NFKC maps to different Latin char)
Skipped 766 NFKC-handled entries (NFKC produces valid slug fragment)
Added 8 supplemental entries (Latin small capitals)
Total: 613 entries
When a new Unicode version ships, re-run the script and you get an updated map automatically filtered against the current runtime’s NFKC implementation. The exact counts depend on two things: the version of confusables.txt you download, and your runtime’s Unicode data tables (what String.prototype.normalize uses). The numbers in this post are from the current Unicode 16.0 data.
The broader lesson
Unicode is not one monolithic standard. It’s a collection of semi-independent specifications maintained by different working groups. UTR #15 (normalization) and UTS #39 (security) were designed for different use cases and don’t explicitly account for each other.
The 31 conflicting entries aren’t a bug in Unicode. They’re a gap in the documentation. confusables.txt mappings are visual judgments. NFKC mappings are semantic equivalences. Both are correct in their own context. The gap is that UTS #39 doesn’t document how confusable detection should interact with NFKC normalization, even though most modern applications use both.
I’ve reported these 31 conflicts to the Unicode Consortium as a documentation gap in UTS #39.
The NFKC-aware confusable map (613 entries, ~2.5 KB gzipped) ships as part of namespace-guard, a zero-dependency TypeScript library for slug/handle validation. The generator script is at scripts/generate-confusables.ts.
Thanks to ficiek, v4ss42, nemec, LousyBeggar, carrottread, medforddad, and Herb_Derb on r/programming for feedback that shaped this revision.