The Engine Record-Agreement Index: which AI engines agree with the verified record?
A live, per-engine leaderboard of record-agreement — how often each AI engine’s answers about real tracked entities match the verified record. Every figure carries its sample size, a 95% confidence interval and the window it was measured in; engines with thin coverage are held, never ranked.
What this index measures
Model-benchmark leaderboards rank engines on synthetic tasks. This index asks a different, more consequential question: when someone asks ChatGPT, Claude, Gemini, Perplexity and Grok about a real entity— a creator, a brand, an organisation, a public figure — how often does the answer agree with the verified record? We run structured knowledge probes across the tracked cohort, compare every returned field to a canonical, provenance-carrying record, and pool the result per engine. The unit is the decidable fact: a field the engine answered and the record holds a value for. A field the engine declines to answer is a coverage gap, counted separately — never scored as wrong.
We publish record-agreement, not “accuracy,” on purpose: the measurement is two-way, and some disagreements are our record’s fault rather than the engine’s. Read every figure as a conservative floor, with its sample size and confidence interval attached.
The index
| Engine | Record-agreement (pooled) | 95% interval | Entities | Decidable facts | Reading |
|---|---|---|---|---|---|
| Grok (xAI) | 77% | 71.79–81.49% | 67 of 67 | 287 | full standing — ranked |
| ChatGPT (OpenAI) | 72.7% | 67.49–77.35% | 67 of 67 | 311 | full standing — ranked |
| Claude (Anthropic) | 70.5% | 61.18–78.37% | 24 of 67 | 105 | partial coverage — indicative, not ranked |
| Gemini (Google DeepMind) | 85.6% | 77.88–90.94% | 17 of 67 | 111 | partial coverage — indicative, not ranked |
| Perplexity (Perplexity AI) | held | — | 0 of 67 | 0 | held — coverage below the floor |
Coverage is uneven: where an engine shows a lower entity count, that engine was unavailable to our measurement pipeline for part of the window — an access failure on our side of the measurement, not an engine judgment. An engine covering fewer than half the measured cohort is published as indicative rather than ranked, and one below 10 entities is held outright. Held engines re-enter automatically as coverage recovers; the re-aggregation is free because the evaluations are stored, not re-probed.
Across the whole cohort the engines matched the verified record on 75.7% of 814 decidable facts (95% interval 72.64–78.52%) — and 69% of the 67 measured entities (46 of 67) carried at least one fact an engine got flatly wrong. For most entities the question is not whether the engines hold a wrong fact about them, but which one.
Method
Pooled record-agreement
For every measured entity we take each engine’s latest structured knowledge snapshot and compare every returned field to the verified record. An engine’s figure is pooled matches divided by pooled decidable facts across the cohort — a genuine proportion, so the 95% interval is a Wilson score interval over it. The per-entity reads underneath are exactly the ones each entity’s own LLM Knowledge Accuracy surface reports; this index is their cohort-level roll-up.
The n floor
Every figure states the entity count and decidable-fact count it rests on. An engine that covered fewer than half the measured cohort is published as indicative — sample size attached, not ranked — and one below 10 entities is held outright. A leaderboard row with a hidden n is exactly what this index exists to avoid.
The truth anchor
A disagreement with the verified record is not automatically an engine mistake — some disagreements are ours. When an engine turns out to be right, we correct our record; we never recode a verified value to close a gap, and we never dress a divergence up as an error. That is why the figure is called record-agreement and read as a floor.
The measured cohort is every tracked entity with stored ground-truth cross-checks — the reference cohort of brands, organisations and public figures plus the creator and show pool from our Creator Truth Gap study. The index inherits that study’s honesty discipline: explicit sample sizes, held engines, and a record-quality caveat that cuts both ways.
See it live for one entity
This page is the cohort view. The per-entity Truth Gap surface shows each engine’s record-agreement for one entity — with the exact fields it gets wrong and the verified value beside each one:
Questions
What is the Engine Record-Agreement Index?
A live per-engine measurement of how often each AI engine (ChatGPT, Claude, Gemini, Perplexity, Grok) agrees with the verified record when asked structured questions about real tracked entities — creators, brands, organisations and public figures. It is pooled across every decidable fact in the measured cohort, published with explicit sample sizes and 95% confidence intervals, and re-aggregated daily from stored evaluations at zero probe cost.
Why "record-agreement" and not "AI accuracy"?
Because the measurement is two-way. When an engine disagrees with the verified record, sometimes the engine is wrong — and sometimes our record is the stale or narrow one. Calling the figure "accuracy" would silently blame the engine for every disagreement. Record-agreement is the honest name: it is a conservative floor on how well the engines do, and every disagreement is a prompt to re-verify — including our own record, which we correct when the engine turns out to be right. We never recode a verified value to close a gap.
How is each engine’s figure computed?
For every entity in the measured cohort we take the engine’s latest structured knowledge snapshot and compare each returned field to the verified record. A decidable fact is one the engine answered and the record holds a value for; the engine’s figure is pooled matches divided by pooled decidable facts across the whole cohort. Fields the engine declines to answer are a coverage gap, counted separately — never scored as wrong. The 95% interval is a Wilson score interval over that pooled proportion.
Why are some engines held rather than ranked?
Engine coverage is uneven — when an engine was unavailable to our measurement pipeline for part of a window, it evaluated fewer entities. That is an access failure on our side of the measurement, not an engine judgment, so we do not rank an engine on thin coverage: an engine covering less than half the measured cohort is published as indicative (with its sample size attached), and one below 10 entities is held outright. A leaderboard row resting on a hidden n of 3 is exactly the kind of figure this index exists to avoid. Held engines re-enter automatically as coverage recovers, because re-aggregation is free.
How often does the index update?
The page re-aggregates from stored evaluations roughly daily. The underlying evaluations accrue continuously as the knowledge-probe pipeline runs across the tracked cohort; because the index is a pure re-aggregation of stored data, refreshing it costs nothing and never re-probes an engine.
Where can I see the numbers for a specific entity?
Every tracked entity has its own Truth Gap surface showing per-engine record-agreement with a 95% interval and the exact fields each engine gets wrong, and creators can pull the same read on the free Creator AI Report Card. This page is the cohort-level view; the per-entity surfaces are where the specific wrong facts live.
See what AI says about your entity
Run a free scan — no signup, no key. Resolve your entity and read its live AI Visibility, Sentiment, Share of Voice and the truth-gap against the verified record.