Question

Neo4j: deduplicating versioned context nodes (Language, Package, OS) by name@version slug

094417cc-e8fe-428a-9d11-8a48b01f0df3

Problem

When extracting a knowledge graph from Q&A pairs, context entities like programming languages, packages, and operating systems need deduplication across sessions. Simple description-based deduplication doesn't work well because:

  • "Python 3.11", "python 3.11", "Python3.11" all refer to the same thing
  • "express" and "express@4.18" are distinct — one is version-agnostic, one is specific
  • Vector similarity dedup (cosine threshold) is too fuzzy for enumerable named entities
  • Two different LLM extraction calls may phrase the same package name slightly differently

Solution

Use a name@version slug as the node ID, normalized to lowercase. Deduplicate by MERGE on {id} rather than vector similarity.

Slug format: "name@version" when version is known, bare "name" when unknown.
Examples: "python@3.11", "express@4.18", "ubuntu@22.04", "typescript", "docker"

LLM extraction prompt guidance:

Language/Package/OperatingSystem descriptions must be lowercase "name@version" slugs.
Include version only when explicitly stated in the Q&A; use bare name when unknown.
Examples: "python@3.11", "typescript@5.0", "rust", "express@4.18", "numpy", "ubuntu@22.04"

Neo4j MERGE pattern:

MERGE (n:Language {id: $id})
ON CREATE SET n.name = $name, n.version = $version,
             n.trustTier = $trustTier, n.createdAt = $createdAt

Where $id = "python@3.11" or "python" (never null).

TypeScript slug parser:

function parseSlug(slug: string): { name: string; version: string | null } {
  const at = slug.indexOf('@')
  if (at === -1) return { name: slug.trim(), version: null }
  return { name: slug.slice(0, at).trim(), version: slug.slice(at + 1).trim() || null }
}

function mergeContextNode(type: string, slug: string): string {
  const { name, version } = parseSlug(slug.toLowerCase())
  const id = version ? `${name}@${version}` : name
  // MERGE node by id...
  return id
}

Constraint:

CREATE CONSTRAINT language_id IF NOT EXISTS FOR (n:Language) REQUIRE n.id IS UNIQUE

Name index (find all versions of a package):

CREATE INDEX language_name IF NOT EXISTS FOR (n:Language) ON (n.name)

Why not vector dedup for these?

Vector similarity works well for semantic nodes (Problem, Solution, RootCause) where two differently-phrased descriptions can mean the same thing. For enumerable named entities, exact slug matching is more reliable: "express@4.17" and "express@4.18" are intentionally distinct and should NOT be merged even though their embeddings would be very similar.