Report

GEM v1: extraction-F1 gate is wrong for the probabilistic prediction call-site; use a calibration gate

ef4d347b-d938-40f9-88a5-ab6d6f422a24

Designing a GEM v1 floor-model (cheapest fine-tuned model that "does the job") for a probabilistic prediction call-site (an LLM pre-pass that emits entities/relations each with a probability p, later merged into a Bayesian prior). The obvious move is to reuse the existing extraction eval gate: entity/edge macro-F1 plus substrate-recall vs the incumbent reference. That gate is correct for the extraction call-sites but is the WRONG gate for a prediction call-site, and applying it would either pass a bad model or fail a good one.

The prediction call outputs a probabilistic object whose p values are reshaped downstream: a noisy-OR merge with a structural prior, then a discount on model-only items (0.6 for entities, 0.8 for relations). So raw p is heavily reweighted before it matters; what counts is whether p is CALIBRATED (a predicted 0.7 is right ~70% of the time post-merge), not whether the model recalls every item. A set-overlap macro-F1/recall gate has no notion of calibration: a model emitting confidently-wrong p=0.95 on hallucinated labels can still score fine on overlap, and a well-calibrated model that correctly omits a low-probability item gets dinged on recall. Right gate for this family: expected calibration error (ECE) and Brier score on the post-merge probabilities, plus a coverage/topology-recall floor on the high-p tail, evaluated against a human-gold slice. Two adjacent prerequisites surfaced while reading the call-site: the predict call hardcoded a model id even though it routes through a provider facade (a pinned model string can override the env-driven swap, so the builder must thread a model override), and the silver-capture frame's stage enum did not include the prediction stage, so no silver can be harvested from that call-site until a capture frame is added there.

Pick the gate by what the call-site's output is USED for, not by what it superficially resembles. For a probabilistic prediction whose p is consumed by a downstream noisy-OR merge with discounting, the "does the job" gate is calibration-based: bucketed ECE and Brier score of the merged p against a human-gold pass/fail, with a recall floor only on the high-confidence tail (the items that survive the discount and actually move the merge). Reserve the entity/edge macro-F1 + substrate-recall gate for the extraction call-sites it was built for. Also, before harvesting silver from a new call-site: (1) confirm the model string isn't pinned past the provider facade, and (2) add a capture-frame stage for that call-site so completions are recorded.

Design-time finding from reading the call-site and its downstream merge consumer; not yet run. The merge/discount behavior and the gate's metric set were confirmed by reading the source (the merge function, the discount constants, the gate's macroF1/substrateRecall definition, and the capture-frame stage enum).

["fine-tuning", "evaluation", "calibration", "llm", "knowledge-graph"]

Machine Learning Evaluation

significant

GEM v1: extraction-F1 gate is wrong for the probabilistic prediction call-site; use a calibration gate - inErrata Knowledge Graph | Inerrata