compromise NLP misclassifies tech compounds and code fragments as person names
48d572ba-4dbb-4862-a3fe-98da96ab6b8d
Using the compromise NLP library's .people() extractor for PII detection on tech-heavy prose produces ~80% false-positive rate. The library returns multi-word strings as "person names" that are actually:
- Tech compound names: "Claude Code", "Claude Desktop", "Claude Code Ollama", "OpenAI Platform"
- Punctuation-contaminated extractions: "(Claude Code,", "Claude Code-", "Claude Haiku)"
- CamelCase code identifiers misread as proper names: "Skip PageRank", "PageRank"
- Verb-plus-noun fragments: "mark Pattern", "grant row"
- Code fragments with operators: "len +", "len ="
Single-token allowlists like tech_terms.has(name.toLowerCase()) don't help because compromise returns the multi-token string verbatim, not split into tokens. So "Claude Code" is checked against the allowlist as a single key, doesn't match (only "claude" is in the allowlist), and gets flagged as a person.
In a 621-row prod dataset of engineering knowledge reports, this produced 215 medium-severity findings, of which ~120 were these false positives.
Single-token allowlist insufficient:
compromise.people().out('array')returns the WHOLE matched phrase ("Claude Code") rather than separate tokens. An allowlist of single tech terms ("claude", "openai", etc.) only matches exact single-word inputs.No proper-name shape check:
compromiseis permissive about what counts as a "person" — it labels CamelCase identifiers, lowercase verb phrases, and punctuation-bracketed fragments as people. A proper-name shape check (every token starts with capital, rest lowercase letters/apostrophe/hyphen) catches these.No code-fragment rejection: NER ran on prose AFTER code-fence stripping, but inline backtick-less code references ("len +", "len =") survived as prose and got matched.
function stripWrappingPunct(name: string): string {
return name.trim().replace(/^[^A-Za-z]+/, '').replace(/[^A-Za-z']+$/, '')
}
// (1) Multi-token tech-term check — any token in allowlist marks
// the whole phrase as a tech compound.
function isTechTerm(name: string): boolean {
const cleaned = stripWrappingPunct(name).toLowerCase()
if (TECH_TERM_ALLOWLIST.has(cleaned)) return true
const tokens = cleaned.split(/[\s\-]+/).filter(Boolean)
return tokens.length > 1 && tokens.some((t) => TECH_TERM_ALLOWLIST.has(t))
}
// (2) Code-fragment rejection — programming punctuation or digits.
function looksLikeCodeOrFragment(name: string): boolean {
const t = name.trim()
if (/[=+(){}[\]<>;:|*&^%$#@!?\\\/`~"]/.test(t)) return true
if (/\d/.test(t)) return true
return false
}
// (3) Proper-name shape — Latin name tokens only.
function looksLikeProperName(name: string): boolean {
const cleaned = stripWrappingPunct(name)
if (cleaned.length < 2) return false
const TOKEN = /^[A-Z](?:[a-z']*)(?:-[A-Z]?[a-z']+)?$/
return cleaned.split(/\s+/).every((t) => TOKEN.test(t))
}
// In scanNer():
for (const name of dedupe(people)) {
if (isTechTerm(name)) continue
if (looksLikeCodeOrFragment(name)) continue
if (!looksLikeProperName(name)) continue
// ... emit finding ...
}Real names ("John Smith", "Maria Garcia", "Mary-Anne Lopez") still flag. Tech compounds ("OpenAI Platform", "Google Cloud") and code fragments ("len +", "mark Pattern") no longer flag.