28 CJK and Hangul characters look like Latin letters

I scanned 122,862 codepoints from logographic scripts across 230 fonts. Most look nothing like Latin. But 28 do.

In the previous post, I scanned 23,317 identifier-safe Unicode characters against Latin a-z and found 793 novel confusables. That scan deliberately excluded CJK ideographs, Hangul syllables, Cuneiform, Egyptian Hieroglyphs, and other logographic scripts (122,862 codepoints in total) because structurally dense 2D characters were expected to look nothing like thin Latin strokes.

That expectation was mostly right. But not entirely.

The scan

Milestone 2b of confusable-vision runs the same pipeline on every excluded codepoint. Of 122,862 candidates:

  • 49,859 (40.6%) have font coverage in at least one of 230 macOS system fonts
  • 73,003 (59.4%) have no system font and cannot be rendered

The coverage rate is lower than M2’s 53.8% because CJK Extensions B through I (69,932 codepoints) have minimal system font support. Characters with coverage average only 1.9 fonts each, compared to 7.1 in M2.

Each covered candidate was rendered in its native font(s), normalised to 48x48 greyscale, and compared against all 36 Latin targets (a-z, 0-9) using pHash prefiltering and SSIM scoring. Total: 8,036,479 SSIM comparisons in 46 minutes.

28 characters, 69 pairs

BandCount%Description
High (>= 0.7)690.004%Genuinely confusable
Medium (0.3-0.7)6,5640.4%Somewhat similar
Low (< 0.3)1,688,06499.6%Not visually confusable
Total1,694,697

69 pairs from 28 distinct source characters cross the 0.7 SSIM threshold. All 28 are simple geometric primitives (vertical strokes, circles, and basic cross shapes) that happen to live in ranges otherwise dominated by complex ideographs.

No complex CJK ideograph is confusable with Latin. The 76,891 characters in CJK Extensions A through I produced only 1 high-scoring pair. Hangul Syllables (11,172 characters) produced zero. Dense logographic structure is structurally incompatible with Latin letterforms, exactly as expected.

Where the discoveries are

The 69 pairs concentrate in five ranges:

RangeHigh-scoring pairsWhat they are
Egyptian Hieroglyphs19Simple geometric hieroglyphs (strokes, circles)
Cuneiform13Wedge-mark numerals that are thin vertical strokes
CJK Unified Ideographs10Only the simplest stroke characters (丨, 丄, 丅)
CJK Symbols/Bopomofo9Hangzhou numerals (〡, 〸) and Bopomofo letters
Hangul Jamo7Isolated vowel jamo (ᅵ, ㅣ) rendered as vertical strokes

The 42,720 characters in CJK Extension B: 1 pair. The 11,172 Hangul Syllables: zero. The 6,592 in CJK Extension A: zero. The dense ideographs are safe.

The vertical strokes

18 of the 28 characters are vertical strokes targeting “l”, “i”, or “j”. This is the same pattern that dominates M2’s discoveries: a single vertical bar is the minimal glyph form shared across writing systems.

SourceNameTargetSSIMFonts
〡 U+3021Hangzhou Numeral Onel0.9286 CJK fonts
丨 U+4E28CJK Vertical Strokel0.87910 CJK fonts
ᅵ U+1175Hangul Jungseong Il0.8472 fonts
ㅣ U+3163Hangul Letter Il0.8472 fonts
ᅵ U+FFDCHalfwidth Hangul Il0.8361 font
𓏺 U+133FAEgyptian Hieroglyphj0.8311 font
ᆝ U+119DHangul Jongseong Il0.8251 font
𒁹 U+12079Cuneiform Numeral 1l0.8211 font

Hangzhou Numeral One (〡) is the top discovery at 0.928 SSIM against Latin “l”. It is present in 6 CJK system fonts (Heiti SC, Hiragino Sans, Apple SD Gothic Neo, and others), making it the most broadly available M2b find.

The circles

Three characters render as circles that resemble Latin “o”:

SourceNameTargetSSIMFonts
𓃉 U+130C9Egyptian Hieroglypho0.7901 font
ㅇ U+3147Hangul Letter Ieungo0.7382 fonts
ᄋ U+110BHangul Choseong Ieungo0.7372 fonts

The Korean letter ieung (ㅇ) is a circle used as a null consonant placeholder. An Egyptian hieroglyph independently renders as a circle. Both score above 0.7 against Latin “o”.

Six confusable pairs rendered side by side at 48x48 greyscale. Top row: three vertical stroke characters from CJK and Hangul ranges next to Latin l. Bottom row: three circle characters from Egyptian Hieroglyphs and Hangul next to Latin o. Each pair is visually near-identical.
Actual 48x48 greyscale renders from confusable-vision. Left character in each pair is the source from a CJK/Hangul/logographic range; right is the Latin target. Top row: vertical strokes vs “l”. Bottom row: circles vs “o”.

Practical risk

Not all 28 characters pose the same threat. The key factor is font availability: an attacker needs the character to render on the victim’s machine.

Higher risk (common CJK fonts):

  • 丨 U+4E28, 〡 U+3021: present in 6-10 CJK system fonts. Any machine with CJK language support (standard on macOS, common on Windows) will render these. They target “l” and “i”, which are already high-value spoofing targets.
  • ㅣ U+3163, ㅇ U+3147: present in Hangul fonts. Available on any system with Korean language support.

Lower risk (specialised fonts):

  • Egyptian hieroglyphs (19 pairs): require Noto Sans Egyptian Hieroglyphs, not a standard system font.
  • Cuneiform numerals (13 pairs): require Noto Sans Cuneiform.

52% of M2b discoveries appear in only one font. The CJK stroke characters (丨, 丄, 丅, 〡) are the exception: available in 6-10 fonts, making them the most broadly exploitable finds.

Combined picture

MetricM2 (identifier-safe)M2b (CJK/Hangul/logographic)
Candidates23,317122,862
With font coverage12,555 (53.8%)49,859 (40.6%)
SSIM comparisons2,904,3768,036,479
High-risk pairs (>= 0.7)793 (0.2%)69 (0.004%)
Distinct source chars~50028
Computation time15.5 min46 min

M2b adds 28 characters to the discovery set (3.5% of M2’s count). The hit rate is 50x lower than M2 (0.004% vs 0.2%), confirming that logographic scripts are overwhelmingly incompatible with Latin. The findings are concentrated in edge cases: stroke components, counting marks, and vowel carriers that reduce to the simplest possible geometric forms.

Together, M2 and M2b have now scanned every identifier-safe Unicode character with font coverage against Latin a-z/0-9. The union of their discoveries (793 + 28 = 821 novel confusable characters) represents the complete set of visually similar characters findable by this pipeline on macOS.

How to reproduce

git clone https://github.com/paultendo/confusable-vision
cd confusable-vision
npm install

npx tsx scripts/build-candidates-m2b.ts        # 122,862 candidates (~10 min)
npx tsx scripts/build-index-m2b.ts             # 236,840 renders (~3 hours)
npx tsx scripts/score-candidates-m2b.ts        # 8M SSIM comparisons (~46 min)
npx tsx scripts/extract-m2b.ts                 # Verification report + discoveries

All scripts support crash recovery via progress.jsonl and auto-resume on restart. Use --fresh to force a clean start.

The full results are at m2b-discoveries.json and m2b-verification-report.json in the repo (CC-BY-4.0). The technical report is at REPORT.md.

Series context

This is the seventh post in a series on Unicode identifier security:

  1. confusables.txt and NFKC disagree on 31 characters
  2. Unicode ships one confusable map. You need two.
  3. A threat model for Unicode identifier spoofing
  4. Making Unicode risk measurable
  5. I rendered 1,418 Unicode confusable pairs across 230 fonts
  6. 793 novel confusables discovered outside TR39
  7. This post: 28 CJK/Hangul characters that look like Latin

confusable-vision is MIT-licensed. The discovery data is CC-BY-4.0. namespace-guard (MIT, zero dependencies) integrates these discoveries for anti-spoofing in multi-tenant apps.