Methodology

How we measure AI visibility — and how much you can trust it

Most GEO tools flash a "Visibility Score: 73" and move on. You have the right to know how that number is built and how reliable it is: not all metrics are equal, and pretending otherwise is dishonest.

It's a method, not a magic score

Measuring visibility in AI engines is still an open scientific problem. Two inconvenient facts.

LLMs are non-deterministic: the same exact prompt yields different answers. So any measure obtained by querying them is a statistical estimate, not an exact value.

Almost no market metric is validated in the literature: the only peer-reviewed GEO visibility measure comes from Aggarwal et al. (KDD 2024). The rest of the vocabulary is industry convention.

We start here: every number carries a reliability label, and estimates come with their margin — never a bare value passed off as truth.

Three reliability levels

🟢 Measured

Observable, documented, reproducible. E.g. citations from official APIs, prominence on the real answer text.

🟡 Estimated

From querying the models, which are non-deterministic: an estimate plus a margin. E.g. overall visibility, Share of Voice.

⚪ Directional

A trend indicator, never an absolute value. E.g. query volumes, AI referral traffic.

If a tool gives you the same certainty for all three, it's selling you smoke.

The metrics, in short

🟢 Citations and sources

which pages the engines use as sources and how many of those citations are yours. Read from the providers' official API fields, not fragile scraping.

🟢 Brand prominence

not "you're cited 2nd", but how much room and how early you appear, via the Position-Adjusted Word Count of Aggarwal et al.: the only GEO visibility measure validated in the literature.

🟢 Mentions and sentiment

telling whether it's really your brand (not a homonym) and how it's spoken of is left to LLMs — which we calibrate and measure.

🟡 Visibility and Share of Voice

how many answers mention you and how much of the conversation you own. Estimates: they depend on non-determinism and on the questions asked.

⚪ Volumes and traffic

how often a topic is asked and how many visits come from AI. Public data is scarce and a share of the traffic is structurally untrackable. Trends only.

Keeping the LLM judges honest

Models used as judges have documented biases. We don't pretend they don't exist: we keep a golden set of hand-labeled examples (many verticals, multiple languages) and measure the judge's agreement with humans using Cohen's κ. We re-run the calibration on every prompt or model change: we keep only what improves the numbers, not what sounds like a good idea.

κ ≈ 1.0

brand / homonym disambiguation

κ ≈ 0.90

sentiment, with the calibrated method

κ ≥ 0.81 = "almost perfect agreement" on the standard agreement scale.

What we do NOT claim

We don't pass off estimated volumes as hard data.

We don't guarantee 100% of AI traffic: some leaves no trace, and we say so.

We don't call a score "predictive" if it was never validated against real outcomes.

We don't give you a magic number without telling you how stable it is.

In one line

We measure what is measurable, honestly estimate the rest with its margin, and label what is merely directional. We'd rather give you a smaller true number than a big made-up one.

Sources

Aggarwal et al. — GEO: Generative Engine Optimization, KDD 2024
Atil et al. — Non-Determinism of "Deterministic" LLM Settings, 2024
Miller — Adding Error Bars to Evals, 2024
Wallat et al. — Correctness is not Faithfulness in RAG Attributions, 2024
Jovančević et al. — Nature Scientific Reports 15:11477, 2025
Chatterji et al. — How People Use ChatGPT, NBER WP 34255, 2025
Thinking Machines Lab — Defeating Nondeterminism in LLM Inference, 2025