LLMs are non-deterministic: the same exact prompt yields different answers. So any measure obtained by querying them is a statistical estimate, not an exact value.
Methodology
How we measure AI visibility — and how much you can trust it
Most GEO tools flash a "Visibility Score: 73" and move on. You have the right to know how that number is built and how reliable it is: not all metrics are equal, and pretending otherwise is dishonest.
It's a method, not a magic score
Measuring visibility in AI engines is still an open scientific problem. Two inconvenient facts.
Almost no market metric is validated in the literature: the only peer-reviewed GEO visibility measure comes from Aggarwal et al. (KDD 2024). The rest of the vocabulary is industry convention.
We start here: every number carries a reliability label, and estimates come with their margin — never a bare value passed off as truth.
Three reliability levels
🟢 Measured
Observable, documented, reproducible. E.g. citations from official APIs, prominence on the real answer text.
🟡 Estimated
From querying the models, which are non-deterministic: an estimate plus a margin. E.g. overall visibility, Share of Voice.
⚪ Directional
A trend indicator, never an absolute value. E.g. query volumes, AI referral traffic.
If a tool gives you the same certainty for all three, it's selling you smoke.
The metrics, in short
🟢 Citations and sources
which pages the engines use as sources and how many of those citations are yours. Read from the providers' official API fields, not fragile scraping.
🟢 Brand prominence
not "you're cited 2nd", but how much room and how early you appear, via the Position-Adjusted Word Count of Aggarwal et al.: the only GEO visibility measure validated in the literature.
🟢 Mentions and sentiment
telling whether it's really your brand (not a homonym) and how it's spoken of is left to LLMs — which we calibrate and measure.
🟡 Visibility and Share of Voice
how many answers mention you and how much of the conversation you own. Estimates: they depend on non-determinism and on the questions asked.
⚪ Volumes and traffic
how often a topic is asked and how many visits come from AI. Public data is scarce and a share of the traffic is structurally untrackable. Trends only.
Keeping the LLM judges honest
Models used as judges have documented biases. We don't pretend they don't exist: we keep a golden set of hand-labeled examples (many verticals, multiple languages) and measure the judge's agreement with humans using Cohen's κ. We re-run the calibration on every prompt or model change: we keep only what improves the numbers, not what sounds like a good idea.
κ ≈ 1.0
brand / homonym disambiguation
κ ≈ 0.90
sentiment, with the calibrated method
κ ≥ 0.81 = "almost perfect agreement" on the standard agreement scale.
What we do NOT claim
We don't pass off estimated volumes as hard data.
We don't guarantee 100% of AI traffic: some leaves no trace, and we say so.
We don't call a score "predictive" if it was never validated against real outcomes.
We don't give you a magic number without telling you how stable it is.
In one line
We measure what is measurable, honestly estimate the rest with its margin, and label what is merely directional. We'd rather give you a smaller true number than a big made-up one.
Sources
- Aggarwal et al. — GEO: Generative Engine Optimization, KDD 2024
- Atil et al. — Non-Determinism of "Deterministic" LLM Settings, 2024
- Miller — Adding Error Bars to Evals, 2024
- Wallat et al. — Correctness is not Faithfulness in RAG Attributions, 2024
- Jovančević et al. — Nature Scientific Reports 15:11477, 2025
- Chatterji et al. — How People Use ChatGPT, NBER WP 34255, 2025
- Thinking Machines Lab — Defeating Nondeterminism in LLM Inference, 2025