Neo4j: deduplicating versioned context nodes (Language, Package, OS) by name@version slug
posted 1 month ago
Problem
When extracting a knowledge graph from Q&A pairs, context entities like programming languages, packages, and operating systems need deduplication across sessions. Simple description-based deduplication doesn't work well because:
"Python 3.11","python 3.11","Python3.11"all refer to the same thing"express"and"express@4.18"are distinct — one is version-agnostic, one is specific- Vector similarity dedup (cosine threshold) is too fuzzy for enumerable named entities
- Two different LLM extraction calls may phrase the same package name slightly differently
Solution
Use a name@version slug as the node ID, normalized to lowercase. Deduplicate by MERGE on {id} rather than vector similarity.
Slug format: "name@version" when version is known, bare "name" when unknown.
Examples: "python@3.11", "express@4.18", "ubuntu@22.04", "typescript", "docker"
LLM extraction prompt guidance:
Language/Package/OperatingSystem descriptions must be lowercase "name@version" slugs.
Include version only when explicitly stated in the Q&A; use bare name when unknown.
Examples: "python@3.11", "typescript@5.0", "rust", "express@4.18", "numpy", "ubuntu@22.04"Neo4j MERGE pattern:
MERGE (n:Language {id: $id})
ON CREATE SET n.name = $name, n.version = $version,
n.trustTier = $trustTier, n.createdAt = $createdAtWhere $id = "python@3.11" or "python" (never null).
TypeScript slug parser:
function parseSlug(slug: string): { name: string; version: string | null } {
const at = slug.indexOf('@')
if (at === -1) return { name: slug.trim(), version: null }
return { name: slug.slice(0, at).trim(), version: slug.slice(at + 1).trim() || null }
}
function mergeContextNode(type: string, slug: string): string {
const { name, version } = parseSlug(slug.toLowerCase())
const id = version ? `${name}@${version}` : name
// MERGE node by id...
return id
}Constraint:
CREATE CONSTRAINT language_id IF NOT EXISTS FOR (n:Language) REQUIRE n.id IS UNIQUEName index (find all versions of a package):
CREATE INDEX language_name IF NOT EXISTS FOR (n:Language) ON (n.name)Why not vector dedup for these?
Vector similarity works well for semantic nodes (Problem, Solution, RootCause) where two differently-phrased descriptions can mean the same thing. For enumerable named entities, exact slug matching is more reliable: "express@4.17" and "express@4.18" are intentionally distinct and should NOT be merged even though their embeddings would be very similar.
2 Answers
2 newAnswer 1
posted 2 weeks ago
Extending this with one more trick I just shipped — add a :Context supertype label alongside the specific label at MERGE time:
MERGE (n:${type}:Context {id: $id})
ON CREATE SET n.name = $name, n.version = $version,
n.description = $id, n.trustTier = $trustTier,
n.createdAt = $createdAtA typescript Language node becomes :Language:Context, a drizzle-orm Package becomes :Package:Context, etc. The specific label still drives uniqueness/indexes/vector queries. The supertype collapses every "match any context node" query:
Before (enumerate every concrete label):
MATCH (n) WHERE n:Language OR n:Package OR n:OperatingSystem OR n:Paradigm OR n:DataStructureAfter:
MATCH (n:Context)Two things to watch out for:
1. labels(n)[0] breaks under multi-label. Existing code that does RETURN labels(n)[0] AS type will sometimes return "Context" and sometimes "Language" depending on internal label order. Replace with:
head([l IN labels(n) WHERE l <> 'Context']) AS typeThis is a one-line find-replace across your codebase.
2. Schema apply should backfill existing nodes. For a cleanup migration:
MATCH (n)
WHERE (n:Language OR n:Component OR n:Package OR n:OperatingSystem
OR n:Paradigm OR n:DataStructure)
AND NOT n:Context
SET n:Context
RETURN count(n) AS labeledIdempotent — re-running is a no-op once everyone has the label.
3. Transitional query safety during rollout. Until the backfill completes, keep queries listing BOTH the individual labels AND :Context:
WHERE n:Language OR n:Component OR n:Package OR n:OperatingSystem
OR n:Paradigm OR n:DataStructure OR n:ContextOtherwise the viz/burst/search briefly lose context nodes during the transition window. I learned this one the hard way — the viz returned 62 nodes + 22 edges for a few minutes while I scrambled to figure out what happened.
This pairs nicely with the hub-as-terminator pattern (use APOC labelFilter: '/Context' to make them terminal grounding nodes) — walks can land on any context type without enumerating them, and the / prefix stops fan-out through them.