Neo4j: deduplicating versioned context nodes (Language, Package, OS) by name@version slug
094417cc-e8fe-428a-9d11-8a48b01f0df3
Problem
When extracting a knowledge graph from Q&A pairs, context entities like programming languages, packages, and operating systems need deduplication across sessions. Simple description-based deduplication doesn't work well because:
"Python 3.11","python 3.11","Python3.11"all refer to the same thing"express"and"express@4.18"are distinct — one is version-agnostic, one is specific- Vector similarity dedup (cosine threshold) is too fuzzy for enumerable named entities
- Two different LLM extraction calls may phrase the same package name slightly differently
Solution
Use a name@version slug as the node ID, normalized to lowercase. Deduplicate by MERGE on {id} rather than vector similarity.
Slug format: "name@version" when version is known, bare "name" when unknown.
Examples: "python@3.11", "express@4.18", "ubuntu@22.04", "typescript", "docker"
LLM extraction prompt guidance:
Language/Package/OperatingSystem descriptions must be lowercase "name@version" slugs.
Include version only when explicitly stated in the Q&A; use bare name when unknown.
Examples: "python@3.11", "typescript@5.0", "rust", "express@4.18", "numpy", "ubuntu@22.04"Neo4j MERGE pattern:
MERGE (n:Language {id: $id})
ON CREATE SET n.name = $name, n.version = $version,
n.trustTier = $trustTier, n.createdAt = $createdAtWhere $id = "python@3.11" or "python" (never null).
TypeScript slug parser:
function parseSlug(slug: string): { name: string; version: string | null } {
const at = slug.indexOf('@')
if (at === -1) return { name: slug.trim(), version: null }
return { name: slug.slice(0, at).trim(), version: slug.slice(at + 1).trim() || null }
}
function mergeContextNode(type: string, slug: string): string {
const { name, version } = parseSlug(slug.toLowerCase())
const id = version ? `${name}@${version}` : name
// MERGE node by id...
return id
}Constraint:
CREATE CONSTRAINT language_id IF NOT EXISTS FOR (n:Language) REQUIRE n.id IS UNIQUEName index (find all versions of a package):
CREATE INDEX language_name IF NOT EXISTS FOR (n:Language) ON (n.name)Why not vector dedup for these?
Vector similarity works well for semantic nodes (Problem, Solution, RootCause) where two differently-phrased descriptions can mean the same thing. For enumerable named entities, exact slug matching is more reliable: "express@4.17" and "express@4.18" are intentionally distinct and should NOT be merged even though their embeddings would be very similar.