Citation Laundering: How AI Search Engines Consume Structured Data and Erase the Source
A single query through Google’s AI search returned our proprietary entity intelligence as synthesised “knowledge” — attributed to sources that never published it. This is citation laundering, and it is now a measurable signal.
The short version
- We asked Google’s AI search a single question and it returned our proprietary entity intelligence — AI Visibility, Sentiment, Share of Voice, drift — as confident synthesised knowledge.
- It cited none of it back to us. The attributions pointed at domains that have never published any of those metrics.
- We call this citation laundering: accurate data, fabricated provenance.
- The page carried five well-formed JSON-LD blocks, including a
Datasetschema declaring the source. The AI layer used the structured data for discovery but not for attribution. - We are now tracking the gap between cited source and actual source as a distinct, measurable signal.
On 6 June 2026, we ran a simple query through Google’s AI search: “What does AI say about PBD Podcast?”
Google returned a detailed, confident answer. It cited AI Visibility scores, Share of Voice percentages, Sentiment ratings, competitor mappings, and narrative divergence analysis. It described the precise gap between how the press frames the show and how the show’s own content positions itself. It even named the specific AI engines involved — ChatGPT, Perplexity, Grok — and quantified their coverage.
Every single data point came from our entity page for PBD Podcast. Google’s AI Overview cited none of them back to us. Instead, the response attributed its claims to general-interest and social domains that contain no entity intelligence data, no visibility scoring, and no cross-surface Sentiment analysis. The citations were anchors bolted onto real data to give the appearance of sourcing.
This is what we call citation laundering: the process by which an AI search engine consumes authoritative structured data, repackages it as synthesised knowledge, and attributes it to unrelated sources — or to no source at all.
What actually happened
The Entidex entity page for PBD Podcast contains live intelligence compiled from 36+ collectors across 8 signal categories. The page carries five well-formed JSON-LD structured data blocks, including a Dataset schema that explicitly declares Entidex as the creator, provides a machine-readable download endpoint, and timestamps the last modification.
Google found the page. Google’s crawler indexed it. The content matched the query intent precisely — a page about what AI engines say about PBD Podcast is the canonical answer to “what does AI say about PBD Podcast?” The AI Overview layer then did what it was designed to do: it consumed the content, synthesised an answer, and served it. But in doing so, it severed the attribution chain. The structured data told Google exactly where every metric originated. The AI layer chose not to pass that through.
The numbers it presented were not estimates or general knowledge. They were specific, proprietary metrics from a live intelligence pipeline:
These are not numbers that exist anywhere else on the internet. They were presented as though they were common knowledge, with citations pointing to domains that have never published them.
The tool designed to detect AI hallucinations about entities was hallucinated about by AI. Except this wasn’t hallucination. The data was accurate. The citations were not.
We became our own case study
Entidex exists to answer a specific question: what do AI engines actually believe about an entity, and where does that belief come from?We probe AI surfaces — ChatGPT, Claude, Gemini, Perplexity, Grok — to capture how they describe entities. We track the citations they volunteer, the Sentiment they project, the facts they get wrong, and the sources they fail to acknowledge. We measure the distance between what is true and what AI engines claim is true. We call this the truth gap.
In this instance, we became our own case study. Google’s AI search consumed our truth-gap analysis, presented it as its own, and in doing so created a new truth gap — one where the provenance of the data was itself erased.
From the user’s perspective: they asked a question, received a detailed, metrics-rich answer, and had no reason to doubt it, no way to trace it, and no awareness that a platform called Entidex produced the underlying analysis. The AI only mentioned Entidex when a follow-up prompt specifically asked about the platform — at which point it produced an enthusiastic explanation, again drawn from our own pages, and framed it as recommending an existing tool rather than acknowledging the source it had just silently consumed.
What the structured data shows
We checked our own markup after the incident. The entity page carries:
- A
schema.org/Datasetblock declaring Entidex as the creator, with aDataDownloaddistribution pointing to our API. - A
schema.org/Thingblock identifying the entity, withsameAslinking the podcast’s canonical domain. - A
schema.org/BreadcrumbListfor navigation context. - Full OpenGraph and Twitter Card metadata with entity-specific descriptions.
- A canonical URL, proper meta descriptions, and
robots: index, followdirectives.
The structured data is not merely present — it is comprehensive. It tells any crawler, including Google’s, exactly where this data comes from, who produced it, when it was last updated, and how to access it programmatically. The AI layer ignored all of it.
This suggests that Dataset JSON-LD, while useful for traditional search indexing and knowledge panels, is not yet being honoured by AI Overview for citation attribution. The structured data influenced discovery — the page ranked for the query — but not attribution— the AI did not credit the source. Discovery and attribution are separate problems, and generative engine optimisation has to solve both.
Why this matters beyond Entidex
This pattern is not unique to us. Any publisher producing structured, high-value intelligence data faces the same dynamic. When an AI search engine can consume your content, strip your branding, replace your citations with unrelated domains, and serve the result as its own synthesis, the incentive to produce that content erodes.
For entities being described by AI — brands, people, podcasts, products, organisations — this means the narrative shaping their public perception is increasingly authored by systems that consume but do not credit the intelligence behind their answers. The structural problem looks like this:
- A publisher produces structured, authoritative data.
- A search crawler indexes the page, including its structured data.
- An AI synthesis layer consumes the indexed content.
- The synthesis layer generates an answer, stripping source attribution.
- A citation module attaches references from the search index — but not necessarily the pages that sourced the data.
- The user receives accurate data with inaccurate provenance.
Steps 4 and 5 are where the laundering occurs. The synthesis layer treats the content as raw material rather than cited output, and the citation module operates independently of the actual data flow. The result is a confident answer backed by citations that point nowhere useful.
What we are doing about it
This finding feeds directly into our ongoing work on citation flow analysis — tracking not just whether AI engines mention an entity, but where they claim the information came from and whether that claim is accurate. We already measure this for tracked entities. Every probe we run captures the URLs that AI engines volunteer as sources. We score those citations by tier — from authoritative domains down to low-confidence attributions — and we track how often an entity’s own domain appears in AI-generated citations versus third-party domains.
What this incident adds is a new dimension: the citation behaviour of AI search engines themselves, not just the underlying models. When an AI Overview synthesises an answer, it constructs a new citation layer on top of the model’s knowledge — and that layer can introduce fabricated attributions the underlying model never produced. We are now tracking this as a distinct signal category. When an AI search engine cites a source for a claim, we verify whether that source actually contains the claimed data. The gap between cited source and actual source is a measurable metric. We intend to publish it.
It also reinforces why we publish a machine-readable verified record for every entity — an agent- and crawler-readable knowledge statement designed to close the AI knowledge gap at the point of ingestion, rather than hoping attribution survives the synthesis layer.
Go deeper with Entidex
The full entity intelligence observatory — beyond Explore Entidex
- Continuous multi-source entity observation
- Alerts when sources diverge or drift
- Cross-surface consensus + visibility over time
- Evidence-anchored intelligence reports
Frequently asked questions
What is citation laundering?
Citation laundering is the process by which an AI search engine consumes authoritative structured data, repackages it as synthesised knowledge, and attributes it to unrelated sources — or to no source at all. The underlying data can be entirely accurate while the provenance presented to the user is fabricated.
Does schema.org Dataset structured data guarantee an AI engine will cite the source?
No. In the case documented here, the source page carried a well-formed Dataset JSON-LD block naming the creator, a machine-readable download endpoint, and a last-modified timestamp. The structured data influenced discovery — the page ranked for the query — but the AI Overview layer did not honour it for citation attribution. Discovery and attribution are separate problems.
How is citation laundering different from an AI hallucination?
A hallucination is factually wrong output. Citation laundering can present perfectly accurate data with inaccurate provenance. The numbers are real; the cited sources are not where they came from. It is a provenance failure, not a factual one — which makes it far harder for a reader to detect.
How can a publisher detect citation laundering?
Compare the claims an AI answer makes against the sources it cites. Where a cited domain does not actually contain the claimed data, the attribution is fabricated. Entidex tracks this as a distinct signal: for every probe we run, we capture the URLs an AI engine volunteers and verify whether each source genuinely contains the claim attributed to it.
What is the truth gap?
The truth gap is the measurable distance between what is verifiably true about an entity and what AI engines claim is true. Citation laundering creates a second-order truth gap: even when the facts survive, the provenance is erased, so a reader cannot trace the claim back to its origin.
Entidex is an entity intelligence observatory. We monitor how AI engines describe, recommend, and rank entities across every public surface — and we track the citation chains that connect AI claims to their actual sources.
Run a free scan