The AI Consistency Problem: Why Your Brand Appears in Less Than 1% of Identical Queries
Ask ChatGPT "what CRM should I use?" right now, then ask it again in 30 seconds. You will likely get a completely different set of recommended brands. A SparkToro study found less than 1% overlap in brand recommendations between identical queries. This changes everything about how AI visibility should be measured — and built.
TLDR
AI engines are fundamentally non-deterministic — the same query returns different brands on different runs. SparkToro measured less than 1% overlap in brand recommendations between identical queries run minutes apart. This means there is no "AI ranking" — only citation probability. Brands optimizing for a stable #1 position are chasing something that doesn't exist. The right metric is citation rate across many runs. The right strategy is raising that probability through consistent signal building.
The experiment that changed how we think about AI visibility
In 2025, SparkToro ran a large-scale study on AI brand recommendation consistency. The methodology was straightforward: take hundreds of commercial queries ("best project management software," "top running shoe brands," "which CRM for small business"), run each query multiple times across ChatGPT, Perplexity, and Google AI, and measure brand list overlap between runs.
The finding was striking: less than 1% of brand recommendations were consistent across multiple runs of identical queries. Run the same query twice and you get almost completely different results. This isn't a bug — it's a fundamental property of how large language models work.
For brands that had been celebrating their "AI rank #1" for a specific query, this was a wake-up call. The position they measured was one sample from a distribution — it would be different the next time a potential customer ran the same search.
Why AI engines are non-deterministic
Unlike Google's search index — which produces deterministic results for any given query at a given moment — AI language models introduce randomness at multiple levels. The most significant is temperature: a parameter that controls how much randomness the model injects into its token selection. Most production AI engines run at a temperature above zero, meaning the model deliberately varies its outputs.
This randomness is intentional and valuable for conversational AI — it prevents robotic, repetitive responses and allows for creative variation. But for brand recommendations, it means that even if your brand is the "best" answer statistically, it won't appear in every response. It will appear with some probability, and that probability is what you should be measuring and optimizing.
For AI engines with web search grounding (Perplexity, Google AI Mode, ChatGPT with web search), there's an additional source of variation: the retrieved web results change over time as new content is published and indexed. Two runs of the same query minutes apart may pull from different web sources, leading to different brand mentions.
Variation by engine type
| Engine | Primary source of variation | Consistency level |
|---|---|---|
| ChatGPT (no web search) | Model temperature — pure LLM sampling | Low — varies with each token generation |
| ChatGPT (web search) | Temperature + real-time web retrieval variation | Very low — both model and sources vary |
| Perplexity | Real-time web retrieval + model sampling | Very low — heavily influenced by what's freshly indexed |
| Google AI Mode | Search index + model sampling | Moderate — search index is more stable than web crawl |
| Claude | Model temperature — training data patterns | Low to moderate — less real-time retrieval dependency |
What this means for measurement
The practical implication is that any single measurement of AI visibility is statistically meaningless. If you run one query, get one result, and conclude "we appear on ChatGPT" — you've measured one sample from a distribution. That sample tells you almost nothing about your actual citation probability.
The right approach is to treat AI citation like a probability problem. For any given query, you have a citation probability — the fraction of the time your brand appears when that query is run. A brand with a 70% citation probability will appear in roughly 70 out of 100 runs of the same query. That's the number you want to measure and improve.
How to measure citation probability correctly
What this means for strategy
If AI citations are probabilistic, then the strategic goal is clear: increase the probability that your brand appears. You're not trying to "rank #1" — you're trying to raise your citation rate from 15% to 50% to 80%.
This reframes the entire optimization problem. Instead of optimizing a single page for a single query (the SEO mental model), you're building signals that increase your brand's overall probability of appearing across a broad query space. The signals that work are cumulative and compounding:
Referring domain breadth raises your floor
Each unique domain that mentions your brand increases the probability that your brand appears in AI training patterns and real-time retrieval. A brand present on 50,000 sites has a statistically higher floor citation rate than one present on 5,000.
Content structure reduces extraction failure
When AI engines retrieve your content, poorly structured pages fail to yield clean brand + expertise signals. Well-structured content (H2/H3/tables/FAQ schema) reliably produces extractable signals on every retrieval — reducing variance.
Content freshness keeps you in the retrieval pool
For engines with recency weighting (Perplexity, Google AI Mode), outdated content is filtered out of retrieval. Publishing regularly keeps your content eligible for citation — a prerequisite for any citation probability above zero.
Brand search volume is a consistency signal
Brands that people actively search for are consistently recognized by AI models across different sampling runs. Brand awareness campaigns that increase brand search volume directly contribute to AI citation consistency.
The competitive opportunity
The AI consistency problem creates an asymmetric opportunity for brands willing to measure probabilistically. Most companies — if they track AI visibility at all — run a few manual queries, celebrate when they appear, and have no systematic understanding of their actual citation probability.
A brand that methodically tracks citation probability across 50 queries and 5 engines has a fundamentally different understanding of where it stands and what to improve. That brand can identify specific queries where probability is low and allocate effort precisely — rather than guessing.
The brands that will dominate AI search in the next 2–3 years are those that recognize today that the measurement model matters as much as the optimization model. If you're measuring wrong, you can't optimize right.
Key takeaways
Measure your actual AI citation probability
Pheme runs each query multiple times across all engines and gives you a statistically reliable citation rate — not a one-off snapshot.
Join the waitlist