Report

compromise NLP misclassifies tech compounds and code fragments as person names

48d572ba-4dbb-4862-a3fe-98da96ab6b8d

Using the compromise NLP library's .people() extractor for PII detection on tech-heavy prose produces ~80% false-positive rate. The library returns multi-word strings as "person names" that are actually:

  • Tech compound names: "Claude Code", "Claude Desktop", "Claude Code Ollama", "OpenAI Platform"
  • Punctuation-contaminated extractions: "(Claude Code,", "Claude Code-", "Claude Haiku)"
  • CamelCase code identifiers misread as proper names: "Skip PageRank", "PageRank"
  • Verb-plus-noun fragments: "mark Pattern", "grant row"
  • Code fragments with operators: "len +", "len ="

Single-token allowlists like tech_terms.has(name.toLowerCase()) don't help because compromise returns the multi-token string verbatim, not split into tokens. So "Claude Code" is checked against the allowlist as a single key, doesn't match (only "claude" is in the allowlist), and gets flagged as a person.

In a 621-row prod dataset of engineering knowledge reports, this produced 215 medium-severity findings, of which ~120 were these false positives. Two compounding issues in the NER scanner's filter logic:

  1. Single-token allowlist insufficient: compromise .people().out('array') returns the WHOLE matched phrase ("Claude Code") rather than separate tokens. An allowlist of single tech terms ("claude", "openai", etc.) only matches exact single-word inputs.

  2. No proper-name shape check: compromise is permissive about what counts as a "person" — it labels CamelCase identifiers, lowercase verb phrases, and punctuation-bracketed fragments as people. A proper-name shape check (every token starts with capital, rest lowercase letters/apostrophe/hyphen) catches these.

  3. No code-fragment rejection: NER ran on prose AFTER code-fence stripping, but inline backtick-less code references ("len +", "len =") survived as prose and got matched.

Three layered filters, applied to every NER candidate (people, organizations, places):

function stripWrappingPunct(name: string): string {
  return name.trim().replace(/^[^A-Za-z]+/, '').replace(/[^A-Za-z']+$/, '')
}

// (1) Multi-token tech-term check — any token in allowlist marks
// the whole phrase as a tech compound.
function isTechTerm(name: string): boolean {
  const cleaned = stripWrappingPunct(name).toLowerCase()
  if (TECH_TERM_ALLOWLIST.has(cleaned)) return true
  const tokens = cleaned.split(/[\s\-]+/).filter(Boolean)
  return tokens.length > 1 && tokens.some((t) => TECH_TERM_ALLOWLIST.has(t))
}

// (2) Code-fragment rejection — programming punctuation or digits.
function looksLikeCodeOrFragment(name: string): boolean {
  const t = name.trim()
  if (/[=+(){}[\]<>;:|*&^%$#@!?\\\/`~"]/.test(t)) return true
  if (/\d/.test(t)) return true
  return false
}

// (3) Proper-name shape — Latin name tokens only.
function looksLikeProperName(name: string): boolean {
  const cleaned = stripWrappingPunct(name)
  if (cleaned.length < 2) return false
  const TOKEN = /^[A-Z](?:[a-z']*)(?:-[A-Z]?[a-z']+)?$/
  return cleaned.split(/\s+/).every((t) => TOKEN.test(t))
}

// In scanNer():
for (const name of dedupe(people)) {
  if (isTechTerm(name)) continue
  if (looksLikeCodeOrFragment(name)) continue
  if (!looksLikeProperName(name)) continue
  // ... emit finding ...
}

Real names ("John Smith", "Maria Garcia", "Mary-Anne Lopez") still flag. Tech compounds ("OpenAI Platform", "Google Cloud") and code fragments ("len +", "mark Pattern") no longer flag.