paultendo

Researching Unicode security, shipping open-source tools, building applied AI systems.

Mostly writing about identifier spoofing, confusable detection and systems designed for measurable behaviour.

Current programme

Unicode identifier security

Cross-script spoofing, confusable detection and measurement.

Browse 17 posts

Start here

Unicode security / domains

250,000 confusable pairs. 102 that matter for domain names.

Most visual confusables can't appear in domain labels. Filtering by IdentifierType, script rules, and registry variant tables reduces 250K to a short list.

250K → 102

Read the filtering piece

Latest posts

Three threads running through the site: Unicode security research, open-source tooling and applied AI systems.

Browse by section

Display

21 min read

RaySpace: measuring glyph similarity by firing rays through font outlines

No pixels. No neural networks. 250,000 confusable pairs across 245 fonts in 86 minutes.

RaySpace is a geometric approach to confusable character detection that compares font vector outlines directly using raycasting. Five layers of signal (intersection counts, positions, crossing angles, ping distances, and ping depth) produce a per-font similarity score without rendering a single pixel.

unicodesecurityfontsnamespace-guardopen-source

Read featured post

11 min read

250,000 confusable pairs. 102 that matter for domain names.

RaySpace found 250,000 unique confusable character pairs. Filtering by IDNA2008, IdentifierType, single-script enforcement, and ICANN registry variant tables reduces that to 3,039 cross-script pairs between Recommended characters, and 102 at high confidence. Here's how the filtering works and where the gaps are.

unicodesecurityfontsnamespace-guardopen-source
35 min read

From CT scanners to confusable characters: the prior art behind RaySpace

RaySpace draws on a century of ray-based geometry techniques, from Radon's uniqueness theorem and Carmack's raycasting to ambient occlusion and acoustic impulse responses. This post maps the full landscape of prior work in confusable detection and shape analysis, and identifies where RaySpace sits in it.

unicodesecurityfontsnamespace-guardopen-source
17 min read

"oy" is ѹ: 571,753 multi-character confusable pairs nobody is checking for

The first systematic, font-specific multi-character confusable dataset. 676 Latin bigrams tested against 133,000 single-character targets across 245 fonts using geometric SDF comparison. 571,753 pairs found, including 'rn' matching 'm' in 95 fonts and 'oy' matching Cyrillic ѹ at distance 0.000.

unicodesecurityfontsnamespace-guardopen-source
8 min read

Your font choice is a security decision

Different fonts make different characters confusable. Font-specific confusable maps give your app precise coverage with 6-30x less data.

unicodesecurityfontsnamespace-guardopen-source
11 min read

248 cross-script confusable pairs that no standard covers

Thai zero looks like Devanagari zero. Georgian S looks like Cyrillic S. Nobody's checking.

I scanned 22,581 characters from 12 scripts against each other across 230 fonts. 248 confusable pairs between non-Latin scripts have no coverage in TR39, ICANN guidelines, or any existing detection system.

unicodesecurityfontsnamespace-guardopen-source
7 min read

Who does confusable detection actually protect?

Unicode's confusable detection maps 6,247 characters to Latin targets, protecting English readers from homograph attacks. But a Russian user spoofed by Greek characters, or an Arabic user spoofed by Syriac, falls outside that coverage. confusable-vision can generate the missing data.

unicodesecurityfontsnamespace-guardopen-source
7 min read

148x faster: rebuilding a Unicode scanning pipeline for cross-script scale

confusable-vision's scoring pipeline went from 8.5 hours to 3.5 minutes through three targeted optimisations: replacing sharp with pure JS bicubic resize, moving SSIM computation to a WASM kernel in worker threads, and switching PNG decoding from sharp to fast-png. All results identical to the original pipeline.

unicodesecurityperformanceopen-source
8 min read

When shape similarity lies: size-ratio artifacts in confusable detection

Confusable detection pipelines normalise characters to a fixed canvas before measuring similarity. This hides natural size differences. We re-rendered 2,203 scored pairs (TR39 baseline + novel discoveries) at their original sizes and found 254 with width or height ratios above 2x.

unicodesecurityfontsnamespace-guardopen-source
21 min read

The new DDoS: Unicode confusables can't fool LLMs, but they can 5x your API bill

130+ API calls, 8 attack types, 4 models. Frontier LLMs read through every pixel-identical substitution. But confusable characters fragment into multi-byte BPE tokens, inflating costs 5x. The comprehension attack fails. The billing attack succeeds.

Can pixel-identical Unicode homoglyphs fool LLM contract review? I tested 8 attack types against GPT-5.2, Claude Sonnet 4.6, and others with 130+ API calls. The models read through every substitution. But confusable characters fragment into multi-byte BPE tokens, turning a failed comprehension attack into a 5x billing attack. Call it Denial of Spend.

unicodesecurityllmfontsnamespace-guardopen-source
11 min read

Your LLM reads Unicode codepoints, not glyphs. That's an attack surface.

Unicode confusable characters create attack surfaces in LLM pipelines, but not the one everyone expects. Frontier models read through every substitution. The real threats are filter bypass (58.7% success), billing inflation (5.2x token cost), and downstream system failures. No major tokenizer defends against any of them.

unicodesecurityllmnamespace-guardopen-source
7 min read

28 CJK and Hangul characters look like Latin letters

confusable-vision's Milestone 2b scanned 122,862 CJK, Hangul, Cuneiform, and Egyptian Hieroglyph codepoints against Latin a-z/0-9 across 230 macOS system fonts. 28 novel confusable characters found, all simple geometric primitives.

unicodesecurityfontsnamespace-guardopen-source
17 min read

793 Unicode characters look like Latin letters but aren't (yet) in confusables.txt

confusable-vision's Milestone 2 scanned 23,317 Unicode characters not in TR39 confusables.txt against Latin a-z/0-9 across 230 macOS system fonts. 793 novel high-risk pairs discovered, from 96 different scripts.

unicodesecurityfontsnamespace-guardopen-source
16 min read

I rendered 1,418 Unicode confusable pairs across 230 fonts. Most aren't confusable to the eye.

confusable-vision renders every TR39 confusable pair across 230 macOS system fonts and measures visual similarity with SSIM. 96.5% of confusables.txt is not high-risk, but 82 pairs are pixel-identical in at least one font.

unicodesecurityfontsnamespace-guardopen-source
8 min read

A threat model for Unicode identifier spoofing

Three attack vectors for Unicode identifier spoofing, a survey of twelve detection systems, and a published benchmark corpus for testing your own defences.

unicodesecurityidentityopen-source
10 min read

Unicode ships one confusable map. You need two.

A survey of how 12 real-world systems apply Unicode confusable detection. None chain NFKC into the confusable lookup. Your confusable map needs to match your normalization strategy.

unicodesecuritynamespace-guardopen-source
9 min read

Lawyers are using AI wrong

The legal profession has adopted AI as a productivity tool when it should be adopting it as an infrastructure layer. Here's what that difference actually means.

ailegal-techopinion
13 min read

I built a will drafting engine. The AI writes almost nothing.

Why a deterministic rule engine beats an LLM for legal document generation — and where the AI actually helps.

ailegal-techarchitecture
7 min read

Building a rights-aware ingestion pipeline for AI-generated music

AI-generated music deserves proper distribution. I built a browser extension that imports Suno tracks to Oncor with auth separation, idempotent imports, and rights attestation. Here's what we learned.

oncorai-musicbrowser-extensionsarchitecture
12 min read

confusables.txt and NFKC disagree on 31 characters

TR39 and NFKC map 31 characters to different targets. If you build a confusable map for use after NFKC normalization, those entries are unreachable dead code. Here's the full list and how to filter them.

unicodesecuritynamespace-guardopen-source
9 min read

namespace-guard: solving the shared URL namespace problem

Every multi-tenant app has to solve it. I got tired of solving it badly, so I extracted my solution into a library.

open-sourcenpmtypescriptnamespace-guard