Why AI Search Analytics Breaks When AI Is on Both Sides of the Measurement

The Double-Probabilistic Problem
Your conversational analytics stack just told you why your AI citations dropped 15% last week. It named three causes, weighted them roughly, and suggested two actions. The response took eleven seconds. It was probably wrong, and nothing in the text will tell you so.
This is not a story about a bad tool. Every piece of that stack is doing what it was built to do:
The citation tracker pulled real numbers from real prompt runs
The MCP layer handed those numbers to an LLM with full fidelity
The LLM generated a fluent causal explanation in the register it was trained to produce
The output arrived on time and looked decision-ready
The problem is what nobody in the conversation is paying attention to. The numbers the tracker produced are samples from a probability distribution, not readings from a gauge. The LLM interpreting them has a systematic overconfidence bias documented in peer-reviewed literature across multiple task domains. Neither layer signals its uncertainty in the format the user sees. A dashboard shows a number and forces the human to supply the causal story. A conversational answer supplies the story pre-assembled, in prose, with no markers of the variance underneath it.
Two probabilistic systems, stacked, and a clean-looking answer that reflects neither.
The argument in six bullets
AI search outputs (what ChatGPT recommends, what Perplexity cites, what AI Overviews surface) are generated by a probabilistic system with massive per-run variance.
The LLM you use to interpret those outputs has a peer-reviewed, cross-domain overconfidence bias.
Stacking probabilistic signal with probabilistic interpretation compounds error in ways the reader cannot see.
Prose preempts skepticism in ways dashboards do not. The fluency of the output is itself the problem.
The standard safety net, Twyman's Law, misfires in both directions on AEO data.
The fix is a six-principle framework we call AEO-Safe Measurement, all of it implementable through system-prompt configuration in an afternoon.
The rest of this piece is the evidence and the implementation.
The Signal Is Not What SEO Trained You to Expect
Two decades of SEO trained marketers to trust the baseline. Rank third for a keyword yesterday, you probably rank third today. Noise at the margins, stability at the core. A 15% week-over-week traffic drop is a real event worth a real investigation.
This instinct is correct for SEO. It breaks completely for AI search, because the variance in AI search data is first-order, not marginal.
The evidence is now overwhelming
Per-run variance (SparkToro and Gumshoe, January 2026):
600 volunteers, 12 brand-recommendation prompts, 60 to 100 runs per platform, 2,961 total runs across ChatGPT, Claude, and Google AI
Probability of receiving the same brand list twice from the same prompt: under 1 in 100
Probability of receiving the same list in the same order: under 1 in 1,000
Every run varied on three dimensions at once: which brands appeared, in what order, how many appeared
Underlying intent was stable (top brands surfaced 55 to 77 percent of the time) but specific rankings were effectively random
Regeneration and cross-platform variance (Ahrefs, November to December 2025):
AI Overview content regenerates for the same query roughly 70% of the time
When it regenerates, 45.5% of cited sources turn over
Within Google's own product family, AI Mode and AI Overviews cite different URLs for the same query 87% of the time
Between ChatGPT and Perplexity, only about 11% of cited domains overlap
Monthly churn (Profound, 680 million citations, August 2024 to June 2025):
40 to 60% of cited domains change month over month
70 to 90% are completely different over six-month windows
Individual brand-level citation counts are more volatile than these aggregates
Sudden structural shifts (Semrush, 13-week tracking study):
September 2025: ChatGPT's citation share for Reddit collapsed from ~60% of responses to ~10% in a few weeks
Wikipedia on ChatGPT dropped from ~55% to under 20% in the same window
Neither Perplexity nor Google AI Mode showed equivalent drops
Most likely explanation (per Semrush's research lead): an intentional behavioral adjustment by OpenAI, not an external event any brand could have predicted
Citation-claim coupling (Wu et al., Nature Communications, April 2025):
SourceCheckup framework evaluated 7 LLMs on 800 medical questions, 58,000 statement-source pairs
50 to 90% of LLM responses were not fully supported by the sources they cited
GPT-4o with web access had ~30% of individual statements unsupported by listed sources
Over 90% of the same answers were presented with high-confidence framing
What this data means
This is the signal layer. It is:
High-variance run to run
High-churn month to month
Subject to sudden platform-level shifts no brand can predict
Internally loose in the coupling between claim and citation
A brand whose AEO strategy depended on Reddit visibility lost most of its ChatGPT citations in September 2025 for reasons uncorrelated with anything that brand had done. Every measurement instinct a seasoned SEO or performance marketer brings to this data will over-read what the data is saying. This is not a problem better tools solve. It is a property of the underlying systems.
The Interpretation Layer Has a Understanding Problem
Here is the part the industry has not talked about out loud. When your team asks Claude, ChatGPT, or Gemini to interpret AI search data, you are not getting a statistical analysis. You are getting a plausible narrative generated by a system whose overconfidence has been characterized across multiple peer-reviewed studies.
The convergent findings on LLM overconfidence
Stanford FermiEval (Epstein et al., October 2025):
Benchmark of Fermi-style numerical estimation questions
When asked for a nominal 99% confidence interval, modern LLMs produced intervals that covered the true answer only 65% of the time
Authors' explanation: "perception-tunnel theory." Under uncertainty, LLMs sample from a truncated region of their inferred distribution and neglect the tails
Princeton reasoning-calibration study (Mei et al., 2025):
Tested whether reasoning models improve calibration when they think longer
Result: reasoning models are overconfident at baseline and become more overconfident as reasoning depth increases
The opposite of the expected direction
Confidence elicitation across domains (Lin et al., 2023):
Evaluated verbalized, consistency-based, and hybrid confidence methods
Across 5 dataset types and 4 model families: verbalized confidence systematically exceeds actual accuracy
Prompting strategies (chain-of-thought, top-K, multi-step) help at the margins but do not solve the problem
LLM-as-Judge overconfidence literature:
Same pattern in pairwise preference tasks
Calibration degrades under distribution shift
Domain-specific tasks are worse, not better
A necessary caveat
FermiEval tests numerical estimation, not causal analysis of marketing data. The 65% coverage number does not transfer one-for-one to the task of explaining citation movements. What transfers is the pattern: modern LLMs produce expressions of confidence that systematically exceed the empirical accuracy of the claims underneath, across task types and model generations. This is not something scale, chain-of-thought, or additional reasoning depth has been shown to fix.
What this looks like in your stack, mechanically
Step by step, when a user asks "why did our citations drop 15% last week?":
The model receives the prompt plus a small tabular view of the data
It has no knowledge of the historical variance of this particular metric
It has no internal representation of the natural noise floor for a 20-prompt citation basket
It has no prior over how often platform-level shifts invalidate week-over-week comparisons
What it has is training on billions of tokens where "why" questions get answered with fluent causal explanations
So it generates one
The output lists three plausible causes: a competitor's recent content, a schema change, a Reddit thread. Each is internally coherent. Each may contain true facts. The aggregate answer is expressed with the narrative confidence the model would use to tell you what year the French Revolution started.
The reader reads this as analysis. It is not. It is pattern-completion over a small sample from a high-variance substrate, wrapped in the grammatical register of expertise.
Why format matters more than content
A dashboard showing "minus 15%" puts the burden of causal interpretation on the human reader
The human brings their own skepticism, their own knowledge of recent activity, their own sense of what a meaningful movement looks like
A conversational answer supplies the causal interpretation pre-assembled
The reader's skepticism is preempted by the fluency of the output
Prose does not carry error bars, and asking an overconfident narrator to supply them produces narrators who claim error bars they have not computed
The instrument has a known overconfidence bias. The medium actively hides the uncertainty a more primitive format would have preserved. That is the second layer of the problem.
What Happens When You Stack Your Data and Analysis
The stacking is the new part. Every major AEO tracker now ships an MCP integration. Every growth team is being pitched a conversational analytics upgrade. The productivity argument is compelling when demoed.
Five specific failure modes result when a probabilistic signal layer is stacked with a probabilistic interpretation layer and no structural guardrails are installed between them.
Every team we have audited in the last six months has at least two of these running in production.
Failure Mode One: Noise gets promoted to signal
The classic safety net for interpreting unusual numbers is Twyman's Law: any figure that looks interesting or different is usually wrong. Flag anything moving more than 50% period-over-period and have a human review before acting. On GA4 events or paid-search conversions, this catches most easy errors: logging bugs, bot traffic, duplicated events.
On AEO data, Twyman's trigger misfires in both directions:
Under-fires on the 10 to 40 percent movements that feel meaningful but sit inside natural noise for low-sample citation metrics, so teams chase phantom trends
Over-fires on the 50%+ movements that are real platform-level shifts, so teams dismiss the most important signals as probable errors
The September 2025 Reddit collapse on ChatGPT was a 50-percentage-point move in weeks. Twyman's rule, applied mechanically, would have flagged it as "probably an instrumentation problem" at the exact moment it was the single most important real signal of the quarter.
The deeper issue: Twyman's rule assumes a stable baseline. AEO has none. The safety net is calibrated for the wrong failure mode.
Failure Mode Two: Composite scores hide the distinction that matters
Every AEO tracker produces something shaped like "AI Visibility Score: 42%." Every conversational layer, asked "how are we doing in AI search," retrieves something like this by default. The number is a blend of three fundamentally different outcomes:
Metric | Mechanism | Variance | Team owner |
|---|---|---|---|
Citations (URL linked as a source) | Real-time retrieval | Highest | Content / SEO |
Mentions (brand named, no link) | Parametric memory + retrieval | Medium | Brand / PR |
Recommendations (brand actively suggested) | Parametric memory + 3rd-party corroboration | Lowest | Earned media / Brand |
Collapsing these into one composite is not information loss. It is information distortion.
Seer Interactive's research on "ghost citations" across 541,213 LLM responses documented cases of a brand being cited more than 100 times in 25 days with zero brand mentions. The content was functioning as reference material for AI answers that recommended competitors.
A composite score would have called this a win
A team optimizing against a composite score would produce more of the same content
The distinction between citations, mentions, and recommendations is the single highest-leverage unbundling available in AEO measurement
The interpretation layer erases it by default
Failure Mode Three: One platform becomes the whole world
Most MCP-backed stacks connect to one or two citation trackers. Most trackers are stronger on some platforms than others. When the CMO asks how the brand is doing in AI search, the interpretation layer reasons over whatever sample it can see.
The sample almost always has coverage gaps:
Strong on ChatGPT and Perplexity (most trackers)
Weak on Gemini and AI Overviews (most trackers)
Usually absent on Copilot and AI Mode
Given 11% cross-platform domain overlap, a single-platform view is not a noisy view of the whole. It is a view of a different thing.
A brand strong on ChatGPT and weak on Perplexity faces a very different strategic situation than a brand strong on both
The interpretation layer, unaware of its own coverage gaps, produces a confident "your AI visibility is good" sentence
The sentence is correct about ChatGPT and silent about everything else
The fluency of the answer prevents the silence from being noticed
Failure Mode Four: Period-over-period comparisons assume a stationary baseline
"Your citations are up 22% month-over-month, which suggests the content strategy is working."
This is the kind of sentence an LLM produces without being asked to, because month-over-month framing dominates its training data and matches the user's question structure.
The framing assumes a stationary baseline. The evidence says otherwise:
Profound: 40 to 60% monthly cited-domain churn
SparkToro: per-run variance large enough that single measurements are samples, not points
Semrush: 50-percentage-point platform shifts in weeks
The same number, reported two ways:
"Your citations are up 22% month-over-month, suggesting the content strategy is working." (Supports aggressive reinvestment)
"Your citations are up 0.7 standard deviations from the 30-day rolling baseline, which is inside natural noise for this metric." (Supports continued observation)
Both come from the same underlying number. Only the second is defensible given the variance structure.
Failure Mode One vs Failure Mode Four: They sound similar but are structurally distinct. Mode One is about reading noise in a single measurement. Mode Four is about assuming the baseline you compare against is stable. They compound each other, and every conversational analytics deployment we have audited has both.
Failure Mode Five: The stack reasons over outputs when the signal lives upstream
When a team asks "why did our citations drop?", the MCP layer reasons over whatever it is connected to, which is usually a citation tracker. But citation movements are downstream outputs of a causal chain whose inputs live elsewhere.
The AEO causal chain (roughly):
Brand authority and earned media movements
Third-party signal density (Reddit threads, YouTube mentions, review velocity)
AI crawler retrieval decisions
Citation and mention patterns
AI referral traffic
Conversions
Most MCP stacks are wired into steps 4 and 5. The signals that actually explain most movements live in steps 1 to 3.
Evidence that upstream signals matter more than downstream citations:
Kevin Indig's Growth Memo analysis: brand search volume has the highest correlation with LLM mentions (0.334)
Seer Interactive: branded search is the second-strongest non-organic ranking signal
Upstream signals have lower natural variance (not filtered through any LLM's probabilistic response generation)
Upstream signals have higher predictive validity for where citations will be in 3 to 6 months
What this means for your stack: an MCP stack connected only to downstream trackers is reasoning over the noisiest available data and building causal stories on top of it. The interpretation layer cannot see the upstream signals that would actually explain most movements. So it constructs downstream-to-downstream explanations that sound coherent but get the causal direction wrong. Teams act on the stories. The actions fail to move the real levers. Six months later, nobody can explain why the AEO investment is not compounding.
The Framework: AEO-Safe Measurement in Six Principles
The principles below are the structural response to the structural problems above. Each addresses a failure mode. Each is implementable today in your existing stack, primarily through system-prompt configuration and tool-selection choices in your MCP layer. None require a new category-defining product.
Principle 1: Require 60+ runs per query per platform before any causal claim
The threshold: Our methodology establishes this as the statistical floor. Below 60 runs, visibility-percentage metrics do not have confidence intervals narrow enough to distinguish signal from noise.
The operational instruction: Before answering any "why," "what changed," or comparative question about AI citation data, verify that the underlying prompt basket has been run at least 60 times per query per platform in the relevant window. If it has not, respond with a specific refusal that names the sample-size gap.
What to expect: Most teams find this gate fires on a large fraction of questions they are currently asking. That is the point. Those questions should not have been answered in the first place.
Principle 2: Require cross-platform confirmation on ≥2 of 4 surfaces
The four surfaces covering most AI search audience:
ChatGPT
Perplexity
Gemini (including AI Overviews and AI Mode)
Copilot
The operational instruction: For any citation movement, check whether the same directional movement appears on at least 2 of the 4 major surfaces in the same time window. Movements visible on only one surface should be reported as likely platform-specific, with causal attribution about brand-level factors explicitly withheld.
Why this works: Given 11% cross-platform domain overlap and documented precedent of platform-specific shifts, a movement visible on only one surface is most likely platform-specific behavior, not a brand-level event.
Principle 3: Separate citations, mentions, and recommendations; never return a composite
The separation rule: Any open question about AI visibility must return three distinct numbers, not one. Ghost-citation patterns (citations up while recommendations flat) must be surfaced automatically, not only on specific request.
If your stack cannot compute these three separately:
The answer is not to ask the interpretation layer to construct a composite that hides the distinction
The answer is to build toward separating them
Until then, every composite metric should be flagged as a known distortion
The high-leverage detection: Ghost citations (cited ≥50 times in a month with zero brand mentions in those responses) are an early-warning pattern that your content is being weaponized on behalf of competitors. This needs to be a first-class alert in your measurement stack, not a report you run when you remember to.
Principle 4: Use 30-day rolling baselines; report movements in standard deviations, not %
The configuration:
30-day rolling baselines computed with historical standard deviation
Movements reported in units of standard deviations
Raw percent change shown alongside but never as the leading framing
Two framings of the same fact:
Leading with σ: "Your Reddit citation share is +0.4σ from the 30-day rolling baseline" → defensible, a disciplined analyst can act on this
Leading with %: "Your Reddit citation share is up 22% month-over-month" → looks actionable, imports a stationarity assumption the data does not support
Why this principle also addresses overconfidence: Forcing the output into a format that explicitly references variance interrupts the default narrative-completion pattern that produces LLM overconfidence in the first place. This is the most direct mechanical response to the FermiEval problem available at the prompt-engineering layer.
Principle 5: Check upstream signals before speculating about downstream causes
The sources to connect, in order:
Google Trends → branded search volume
Google Search Console → branded-click trends
PR monitoring feed (Meltwater, Muck Rack, or equivalent) → earned media velocity
Review-platform APIs (G2, Capterra, Trustpilot) → review signal density
Server log files → AI crawler activity (GPTBot, PerplexityBot, etc.)
The reasoning order: When a citation movement is flagged, the interpretation layer must first ask: did an upstream signal move first?
If yes → that is your probable cause
If no → only then should the interpretation layer speculate about downstream factors
The valuable side effect: Upstream signals have lower natural variance. Decisions grounded in them are more durable than decisions grounded in citation fluctuations. The stack gets slower and quieter, which is correct.
Principle 6: Force ranked hypotheses with weights, not single narratives
This is the meta-principle and the hardest to implement, because it runs against the training grain of every major LLM. Asked "why," the model defaults to producing one fluent causal story. Asked to produce ranked hypotheses with weights, it will comply, but only if the system prompt requires the format rather than suggesting it.
The required output schema:
List 3 to 5 hypotheses consistent with the data
Assign each a rough weight reflecting how much of the movement it could plausibly explain
Include one option labeled "insufficient data to distinguish among the above"
Note which additional data would most change the weighting if collected
Why this is the keystone principle:
Imports the epistemic discipline of a confidence interval into natural-language output
Slower than a narrative answer (the slowness is the feature)
Most direct compensation for the overconfidence problem from Part Two
Most frequently skipped in real deployments because it feels like friction
The Framework at a Glance
# | Principle | Enforces | Prevents | Addresses |
|---|---|---|---|---|
1 | Minimum run-count gate | ≥60 runs per query per platform before causal claims | Reading noise as signal on small samples | Failure Mode 1 |
2 | Cross-platform confirmation | Movement visible on ≥2 of 4 surfaces | Mistaking platform-specific behavior for brand events | Failure Mode 3 |
3 | Metric separation | Citations, mentions, recommendations reported distinctly | Ghost citations and composite-score distortions | Failure Mode 2 |
4 | Standard-deviation reporting | Rolling-window baselines, movements in σ units | Stationarity-assumption failures in period comparisons | Failure Mode 4 |
5 | Upstream-first reasoning | Brand, earned media, and crawler signals checked before downstream | Causally reasoning over the noisiest available data | Failure Mode 5 |
6 | Ranked hypotheses with weights | Structured output with an "insufficient data" option | LLM overconfidence collapsing uncertainty into narrative | The overconfidence problem from Part 2 |
Limitations and Disclosures
No original data was collected for this piece. The load-bearing empirical claims have been cross-verified against primary publications or independent reports:
SparkToro reproducibility findings
Profound's 680M-citation dataset
Semrush's 13-week tracking study
Ahrefs' AI Overview regeneration work
Wu et al.'s SourceCheckup (Nature Communications)
Stanford's FermiEval
Princeton's reasoning-calibration work
On conflict of interest: Passionfruit sells AEO, GEO, and AI visibility services. We argue here that AEO measurement requires sophisticated, multi-layered approaches rather than single-tool dashboards, and that implementation benefits from expert configuration. That is a conclusion in our financial interest. The counter-position, that simpler tools and faster iteration outperform elaborate frameworks, is defensible and is the position most vendor content currently occupies. Readers should test the failure modes against their own stack and reach their own conclusion.
On vendor references: Several vendors referenced (Profound, Peec AI, Semrush, Ahrefs) sell products overlapping our capabilities. We have described their published research in the terms their authors used and separated descriptive claims about their work from evaluative claims about their products.
On the FermiEval generalization: FermiEval tests numerical estimation, not causal marketing analysis. The 65% coverage number does not transfer one-for-one to interpreting citation movements. What transfers is the pattern: cross-domain evidence of systematic LLM overconfidence that does not meaningfully diminish with scale, chain-of-thought, or reasoning depth. We have tried to phrase the argument to reflect this.
On shelf life: The field moves fast. Specific numbers in Part One will change. The structural argument in Parts Two and Three is about how probabilistic systems behave when stacked, which is architecture-level and should remain true across the next several model generations.
Key Sources
Fishkin, R. & O'Donnell, P. (Jan 2026). New research: AIs are highly inconsistent when recommending brands or products. SparkToro & Gumshoe.ai.
Epstein, E., Winnicki, J., Sornwanee, T., & Dwaraknath, R. (Oct 2025). LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval. Stanford University. arXiv:2510.26995.
Mei, Z. et al. (2025). Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know? Princeton University.
Wu, K., Wu, E., Wei, K., et al. (Apr 2025). An automated framework for assessing how well LLMs cite relevant medical references (SourceCheckup). Nature Communications.
Profound (2025). AI Platform Citation Patterns: How ChatGPT, Google AI Overviews, and Perplexity Source Information. 680M citations, August 2024 to June 2025.
Semrush (Nov 2025). The Most-Cited Domains in AI: A 3-Month Study.
Ahrefs (Nov to Dec 2025). AI Overview Frequency, CTR, and Citation Overlap.
Seer Interactive (Mar 2026). LLM Ghost Citations: Why Your Content Is Working and Your Brand Isn't.
Indig, K. (Mar 2026). The Science of How AI Picks Its Sources. Growth Memo / Gauge.
Lin, S., Hilton, J., & Evans, O. (2023). Teaching Models to Express Their Uncertainty in Words.
Twyman's Law: Ehrenberg (1975); formalized in Kohavi, Tang, & Xu (2020), Trustworthy Online Controlled Experiments.
Passionfruit Labs, April 2026.
Written by the Passionfruit Labs research team. If the failure modes in this piece describe your current stack, or if you think we have the argument wrong somewhere, reach us at getpassionfruit.com/contact-us. We read every response.
Share this with anyone on your team about to make a six-figure AEO measurement decision without the guardrails.




