confusables.txt and NFKC disagree on 31 characters

and why your homoglyph detection is probably wrong

If you’ve ever built a login system, you’ve probably dealt with homoglyph attacks: someone registers аdmin with a Cyrillic “а” (U+0430) instead of Latin “a” (U+0061). The characters are visually identical, the slug passes your regex, and now you have an impersonation vector.

The Unicode Consortium maintains an official defence against this: confusables.txt, part of Unicode Technical Standard #39 (Security Mechanisms). It’s a flat file mapping ~6,565 characters to their visual equivalents. Cyrillic а → a, Greek ο → o, Cherokee Ꭺ → A, and thousands more.

The standard approach is straightforward: build a lookup map from confusables.txt, run every incoming character through it, done. That’s what most libraries do.

There’s a problem. If your application also runs NFKC normalization (which it should, and which ENS, GitHub, and Unicode IDNA all mandate), then 31 entries in confusables.txt are not just redundant, they’re wrong.

What NFKC normalization does

NFKC (Normalization Form Compatibility Composition) is Unicode’s way of collapsing “compatibility variants” to their canonical form. Fullwidth letters → ASCII, superscripts → normal digits, ligatures → component letters, mathematical styled characters → plain characters:

Hello   → Hello     (fullwidth → ASCII)
finance    → finance   (fi ligature → fi)
𝐇ello     → Hello     (mathematical bold → plain)

This is the right first step for slug/handle validation. You want Hello to become hello, not to be rejected as containing confusable characters. NFKC handles hundreds of these compatibility forms automatically.

The conflict

Here’s what nobody seems to talk about: confusables.txt and NFKC sometimes map the same character to different Latin letters.

The classic example is the Long S (ſ, U+017F). This is the archaic letterform you see in 18th-century printing, where “Congress” was printed as “Congreſs.”

  • confusables.txt maps ſ → f (visually, ſ does look like f)
  • NFKC normalization maps ſ → s (linguistically, ſ is s)

Both are defensible mappings, but they answer different questions. TR39 asks “what does this look like?” NFKC asks “what does this mean?”

If you build a pipeline that runs NFKC first (as you should), then applies your confusable map, the confusable entry for ſ is dead code. NFKC already converted it to “s” before your map ever sees it. And if you somehow applied the confusable map first, you’d get the wrong answer: teſt would become teft instead of test.

The full list: 31 entries

This isn’t a single edge case. I found 31 characters where confusables.txt and NFKC disagree:

The Long S

CharNameCodepointTR39 maps toNFKC maps to
ſLatin Small Letter Long SU+017Ffs

TR39 sees the visual resemblance to f. But linguistically (and in NFKC), ſ is an archaic form of s. The NFKC mapping is unambiguously correct for any application that cares about meaning rather than just shape.

Capital I → l (16 variants)

confusables.txt maps capital I (and all its styled variants) to lowercase L. This is the classic Il1 ambiguity: in many sans-serif fonts, uppercase I, lowercase l, and digit 1 are nearly indistinguishable.

But NFKC normalizes styled variants of I back to plain I, and lowercasing gives you “i”, not “l”:

CharNameCodepointTR39 maps toNFKC maps to
Script Capital IU+2110li
Fraktur Capital IU+2111li
Roman Numeral OneU+2160li
Fullwidth Latin Capital IU+FF29li
𝐈Mathematical Bold Capital IU+1D408li
𝐼Mathematical Italic Capital IU+1D43Cli
𝑰Mathematical Bold Italic Capital IU+1D470li
𝓘Mathematical Script Capital I (Bold)U+1D4D8li
𝕀Mathematical Double-Struck Capital IU+1D540li
𝕴Mathematical Fraktur Capital I (Bold)U+1D574li
𝖨Mathematical Sans-Serif Capital IU+1D5A8li
𝗜Mathematical Sans-Serif Bold Capital IU+1D5DCli
𝘐Mathematical Sans-Serif Italic Capital IU+1D610li
𝙄Mathematical Sans-Serif Bold Italic Capital IU+1D644li
𝙸Mathematical Monospace Capital IU+1D678li
𜳞Outlined Latin Capital Letter IU+1CCDEli

TR39 says all of these look like “l”. It’s right: they often do, in many fonts. But NFKC normalizes them all to plain “I”, which lowercases to “i”. In a NFKC-first pipeline, the confusable map entry is never reached. The character has already become “i” by the time you check it.

Digit 0 → o (7 variants)

Same pattern with digit zero. confusables.txt maps styled zeros to the letter “o” (visually similar), but NFKC collapses them to the digit “0”:

CharNameCodepointTR39 maps toNFKC maps to
𝟎Mathematical Bold Digit ZeroU+1D7CEo0
𝟘Mathematical Double-Struck Digit ZeroU+1D7D8o0
𝟢Mathematical Sans-Serif Digit ZeroU+1D7E2o0
𝟬Mathematical Sans-Serif Bold Digit ZeroU+1D7ECo0
𝟶Mathematical Monospace Digit ZeroU+1D7F6o0
🯰Segmented Digit ZeroU+1FBF0o0
𜳰Outlined Digit ZeroU+1CCF0o0

In a slug context, 0 and o aren’t interchangeable. Your slug regex accepts both, but they mean different things. An NFKC-first pipeline correctly preserves the digit.

Digit 1 → l (7 variants)

And the same again with digit one, where confusables.txt sees “l” but NFKC correctly maps to “1”:

CharNameCodepointTR39 maps toNFKC maps to
𝟏Mathematical Bold Digit OneU+1D7CFl1
𝟙Mathematical Double-Struck Digit OneU+1D7D9l1
𝟣Mathematical Sans-Serif Digit OneU+1D7E3l1
𝟭Mathematical Sans-Serif Bold Digit OneU+1D7EDl1
𝟷Mathematical Monospace Digit OneU+1D7F7l1
🯱Segmented Digit OneU+1FBF1l1
𜳱Outlined Digit OneU+1CCF1l1

Why this happens

This isn’t a bug in either standard. TR39 and NFKC have different purposes:

confusables.txt answers: “What does this character visually resemble?” It’s designed for the skeleton algorithm, which compares two strings for visual similarity. Mathematical Bold I (𝐈) looks like lowercase l in most fonts. That’s a legitimate visual observation.

NFKC normalization answers: “What is the canonical form of this character?” Mathematical Bold I is semantically the letter I rendered in a bold mathematical style. NFKC strips the styling, yielding plain I.

The problem arises when you combine both in a pipeline (which you should). The two standards don’t know about each other. No one at Unicode has published guidance on what happens when you run NFKC before a confusable check, even though that’s the only correct order for applications that normalize input.

The practical impact

If you naively dump all of confusables.txt into a lookup map and also run NFKC normalization, you have two problems:

1. Dead code. 31 entries in your map will never trigger. NFKC transforms the source character before it reaches your map. These entries consume memory and slow down audits without providing any security value.

2. Wrong answers if stage order is reversed. If someone tries to “optimise” by running the confusable map first (before NFKC), those 31 entries produce incorrect results. teſt becomes teft. 𝐈nbox becomes lnbox. Mathematical Bold 0 (𝟎) in a numeric identifier gets silently converted to the letter o.

Neither outcome is catastrophic on its own, but it’s the kind of subtle incorrectness that erodes trust in security code.

The fix

The fix is simple: make your confusable map NFKC-aware. Before including an entry from confusables.txt, check whether NFKC normalization already handles that character, and if so, whether the two mappings agree:

const sourceChar = String.fromCodePoint(sourceCp);
const nfkcResult = sourceChar.normalize("NFKC").toLowerCase();

// Case 1: NFKC already maps to the same target → skip (redundant)
if (nfkcResult === confusableTarget) continue;

// Case 2: NFKC maps to a DIFFERENT Latin letter/digit → skip (conflict)
if (/^[a-z0-9]$/.test(nfkcResult) && nfkcResult !== confusableTarget) {
  nfkcConflicts++;
  continue;
}

// Case 3: NFKC produces a valid slug fragment → skip (already handled)
if (/^[a-z0-9-]+$/.test(nfkcResult)) continue;

// Case 4: NFKC doesn't resolve to ASCII → keep the confusable mapping
entries.push({ source: sourceCp, target: confusableTarget });

Case 1 handles the ~766 Mathematical Alphanumeric entries where NFKC and TR39 agree (mathematical bold a → a, etc.). Case 2 catches the 31 conflict entries. Case 3 catches multi-character NFKC expansions (ligatures). Case 4 keeps everything else: the characters NFKC doesn’t touch, which is where your confusable map actually earns its keep.

After this filtering, you go from ~6,565 raw TR39 entries to ~613 entries that are meaningful in an NFKC-first pipeline. Every entry in the filtered map represents a character that:

  • Visually resembles a Latin letter or digit
  • Survives NFKC normalization unchanged
  • Would otherwise slip past your slug validation

The pipeline

The correct architecture is a three-stage pipeline where each stage is aware of the others:

Input  →  NFKC normalize  →  Confusable map (613 entries)  →  Mixed-script reject

Stage 1 (NFKC) handles compatibility forms: fullwidth, mathematical, superscripts, ligatures. Cheap, well-understood, and mandated by most standards.

Stage 2 (Confusable map) handles characters that survive NFKC but visually mimic Latin: Cyrillic а→a, Greek ο→o, Armenian հ→h, Cherokee Ꭺ→A, IPA ɑ→a, and 600+ others across 20+ scripts.

Stage 3 (Mixed-script rejection) catches characters from non-Latin scripts that aren’t in the confusable map. Even if we don’t know which Latin letter Georgian ყ resembles, an identifier mixing Georgian and Latin characters is suspicious enough to reject.

Each stage handles a different class of threat, and the confusable map is explicitly filtered to exclude anything the NFKC stage already handles. No overlap, no conflicts, no dead code.

Making it reproducible

Rather than hand-curating a confusable map (which becomes stale when Unicode ships new versions), I wrote a generator script that:

  1. Downloads confusables.txt from unicode.org
  2. Extracts all single-character → Latin letter/digit mappings
  3. Runs the NFKC deconfliction filter described above
  4. Adds supplemental mappings for known gaps (e.g., Latin small capitals that confusables.txt misses)
  5. Outputs a TypeScript object literal, grouped by Unicode block

The script prints stats to stderr so you can verify the filtering:

Filtered to 605 entries from TR39
  Skipped 31 NFKC-conflict entries (NFKC maps to different Latin char)
  Skipped 766 NFKC-handled entries (NFKC produces valid slug fragment)
Added 8 supplemental entries (Latin small capitals)
Total: 613 entries

When Unicode 16.0 ships new confusable mappings, you re-run the script and get an updated map that’s automatically filtered against the current runtime’s NFKC implementation. No manual review required for the standard cases.

Who does this affect?

Any application that:

  1. Validates usernames, handles, or slugs
  2. Runs NFKC normalization (ENS, IDNA, GitHub handle rules, or just good practice)
  3. Uses confusables.txt for homoglyph detection

If you’re doing all three (and you should be), you need to reconcile the two standards. Most libraries I surveyed ship the raw TR39 data without checking for NFKC conflicts. The entries are harmless (dead code) in a correctly ordered pipeline, but they represent a latent bug if anyone reorders the stages, and they make the codebase harder to audit.

The broader lesson

Unicode is not one monolithic standard. It’s a collection of semi-independent specifications maintained by different working groups. UTR #15 (normalization) and UTS #39 (security) were designed for different use cases and don’t explicitly account for each other. When you combine them (which modern applications routinely do), you inherit the responsibility of reconciling their overlaps.

The 31 conflicting entries aren’t a bug in Unicode. They’re a gap in the documentation. The confusables.txt mappings are visual judgments. NFKC mappings are semantic equivalences. Both are correct in their own context. The bug is in assuming they compose cleanly, when in fact they need an explicit reconciliation step that, as far as I can tell, no one has written up before.


The NFKC-aware confusable map (613 entries, ~2.5 KB gzipped) ships as part of namespace-guard, a zero-dependency TypeScript library for slug/handle validation. The generator script is at scripts/generate-confusables.ts.

Update: I’ve reported these 31 conflicts to the Unicode Consortium as a documentation gap in UTS #39.