Testing and Experiment Design for GEO and AEO
How growth teams run valid experiments on AI search visibility and turn best practices into repeatable tactics.
By Eric Schaefer April 8, 2026 10 min read
Best practices are a starting point. Experiments tell you what holds for your category, your site, and the AI engines your buyers actually use.
Two research anchors make this non-negotiable: GEO tactics increase visibility but results vary by domain (Aggarwal et al., 2023, arXiv:2311.09735), and AI search systems differ significantly in freshness bias, phrasing sensitivity, and source-type preference (arXiv:2509.08919). You need systematic tests, not one-size playbooks.
Why GEO Needs Experiments, Not Only Best Practices
Three kinds of variance break one-size GEO playbooks, and knowing which one is hurting you requires testing.
Engine differences. AI search engines differ in domain diversity, freshness behavior, cross-language stability, and phrasing sensitivity. A tactic that increases citations in Google AI Overviews can do nothing in Perplexity or ChatGPT Search. You don't know which until you test per engine.
Source-mix bias. The 2025 large-scale comparative analysis (arXiv:2509.08919) documents a strong bias toward earned media over brand-owned content across several AI search systems. If your GEO plan relies mostly on owned content, testing can quantify the gap and tell you what mix you actually need. Without a test, you're guessing.
Citation noise and accuracy problems. Even when engines cite sources, citations can be inconsistent or wrong. The Tow Center for Digital Journalism's evaluation of eight AI search tools documents citation problems in news contexts — a clear signal that "quality of inclusion" matters as much as "presence." Counting citations without scoring them is incomplete measurement.
GEO experiments reduce uncertainty. They turn black-box behavior into tactics you can repeat.
How to Design a Valid GEO Experiment
You want validity without over-engineering. GEO tests run in messy conditions — algorithm changes, seasonality, content launches, multi-surface journeys. You can still run clean tests by controlling what you own.
1. Start with a narrow, testable hypothesis.
A useful GEO hypothesis names three things: the lever (what changes), the surface (where you expect the change), and the metric (what you'll track).
Strong examples:
- "Adding an answer-first block and FAQ section increases AI-feature inclusion for long-tail 'how' queries in our category hub."
- "Clarifying entity definitions and the internal link graph increases citation frequency for product comparisons across AI assistants."
Weak: "Better content will improve AI visibility." No lever, no surface, no metric. Untestable.
2. Pick an experiment unit you can control.
- Page-set test: pick 20–100 similar pages, apply treatment to half, keep half as control.
- Template test: apply treatment at the template level for a class of pages (docs, product pages, use-case pages).
- Hub test: treat one entity hub (one product line) and keep a comparable hub as control.
Don't mix page types in one test. A product page and a blog post behave differently under generative summaries — mixing them corrupts your signal.
3. Define treatment and control precisely.
Write down exactly what changed so you can repeat it.
Answer-block treatment:
- Add a 2–4 sentence direct answer at the top of the page
- Add 3–5 bullet summary points
- Add 3–6 question-style H2s with concise answers
- Add proof (citations, standards, examples)
Entity clarity treatment:
- Add or tighten definitions for primary entities
- Standardize naming across headings, body, and internal links
- Add structured data where appropriate and compliant
On structured data: Google's FAQPage documentation confirms it can support discovery for rich results, but there is no guarantee features will appear. Keep expectations grounded.
4. Match the timeframe to the mechanism.
- Structure and content changes: 4–8 weeks minimum
- Internal link graph and entity consolidation: 8–12+ weeks
- Earned media and citation ecosystem effects: 12–24+ weeks
Run a two-week baseline pre-period where you can. Pre and post windows need to be stable — don't start a test the week before a major content launch.
5. Protect Google Search performance during tests.
Google's website testing documentation outlines how to minimize risk when testing variations — consistent crawler access, avoiding indexing issues, and so on. Use those as guardrails when running URL or content variations.
6. Instrument before you implement.
If you can't observe the change, don't implement it. Minimum setup before any treatment goes live:
- Search Console deltas on query sets tied to tested pages
- AI-referred sessions (GA4 channel group)
- On-page engagement and conversion events (proof consumption, CTA clicks)
- A manual or semi-automated AI presence check on a fixed query sample
Ethics and Compliance in GEO Testing
GEO testing works under the same constraints as classic search work. Same ethics, same need for clean behavior.
Google's spam policies cover practices that mislead users or manipulate systems. Markup must match visible content — Google's structured data policies restrict eligibility when markup is misleading. If you track copy behavior or snippet use, log metadata only (block type, length bucket), not copied text.
Don't chase misleading appearances. If an AI answer engine cites your content inconsistently, the answer is accuracy and durable user value — not gaming the signal.
Three Experiment Types Worth Running
Structure, entities and schema, and coverage shape are the three primary GEO levers. Here's how to approach each as a controlled test.
Structured FAQs and Answer Blocks
Pages get impressions but convert poorly; queries are question-shaped
Use this experiment when pages rank or get impressions but conversion is weak, queries are phrased as "how," "what," "best," "vs," or "cost," and users need quick clarity before committing to depth.
Treatment:
- Add an answer-first block at the top
- Add 3–6 question-style H2s with concise answers
- Add FAQ structured data only when the page genuinely contains FAQs
Control: comparable pages with no new answer block and no FAQ section.
Primary metrics: AI feature inclusion for the target query set (presence); accuracy of extracted answers in sampled summaries (quality); conversion rate and proof-page clicks from those landers (outcomes).
Risk to manage: repeating the same FAQ pattern on every page creates redundancy. Keep FAQs tied to real buyer questions, not boilerplate.
Entity and Schema Clarity
Product terminology is ambiguous or inconsistent across the site
Use this experiment when your product terminology is ambiguous, different teams describe the same feature in different ways, or you sell into multiple industries with different vocabulary.
Treatment:
- Standardize entity names and definitions across the site
- Tighten internal links so entities connect consistently
- Add structured data that matches what's visible and true
Control: comparable pages without entity standardization.
Primary metrics: citation frequency for entity-led queries (presence); whether engines name the right product and use cases (quality); opportunity rate from evaluation pages tied to treated entities (outcomes).
AI engine note: expect different systems to cite different source types. The 2025 comparative analysis (arXiv:2509.08919) reports earned-media bias across several AI search systems. Entity clarity on owned content still helps — but earned validation changes inclusion odds. That's worth a separate test.
Topic Coverage: Expand or Consolidate
Duplicated posts, fragmented internal links, or shallow category hubs
Use this experiment when duplicated posts target adjacent queries, internal linking is fragmented, or category hubs lack depth. Generative summaries reward clarity, cohesion, and proof over volume.
Treatment options:
- Consolidate: merge 3–5 thin posts into one authoritative reference page
- Expand: build a hub-and-spoke cluster around one entity with consistent definitions, examples, and proof
Control: comparable topic sets left unchanged.
Primary metrics: inclusion across query paraphrases and long-tail variants (presence); whether summaries reflect your intended framing (quality); conversion efficiency per click on the consolidated set (outcomes).
Measuring GEO Experiment Outcomes
You need more than "did we show up." Use three layers: presence, quality, and outcomes.
Presence metrics — tracked on a fixed query set, weekly or biweekly:
- AI Presence Rate: percentage of queries where your domain appears as a cited source
- Citation Share: your domain's share of citations across sampled answers
- Landing-page alignment: percentage of citations pointing to the intended canonical page
Quality metrics — scored per query on a 0–3 rubric:
The Tow Center's AI search citation evaluation reinforces why you need a quality rubric alongside presence counting. A citation that misrepresents your product is worse than no citation.
Outcome metrics — tied to what you already trust:
- AI-referred sessions (GA4 channel group)
- Conversion rate from treated pages (lead, demo, signup)
- Proof consumption rate (security page, implementation guide, case study views)
- Opportunity rate and cycle time for leads touched by treated assets
Outcome metrics prevent the most common GEO failure mode: chasing "AI presence" that never turns into revenue.
A GEO Experiment Backlog to Start From
Score each idea on Impact, Confidence, Effort, and Risk. Run the highest-priority two per quarter.
| Priority | Experiment | Lever | Impact | Confidence | Effort | Risk |
|---|---|---|---|---|---|---|
| 1 | Answer-first blocks on top 30 non-brand landers | Structure | High | Med | Low | Low |
| 2 | Entity definition standardization across one product line | Entities | High | Med | Med | Med |
| 3 | Consolidate thin cluster into one reference hub | Coverage | High | Med | High | Med |
| 4 | FAQ + schema on top comparison pages | Structure + schema | Med | Med | Low | Low |
| 5 | Earned-media proof pack + PR placements for category terms | Source mix | High | Low | High | Med |
The source-mix experiment exists because comparative research documents earned-media bias across AI search engines. If you're investing in owned content without testing earned validation, you may be solving the wrong problem.
Two case patterns worth testing first:
Answer block + proof links → higher quality scores
Treatment: top-of-page answer block, key takeaways, and proof links (security, implementation, case study). Expected outcome: higher rubric scores for accuracy and relevance, and higher proof consumption rate.
Entity hub + internal links → improved citation alignment
Treatment: one canonical hub per entity, consistent internal linking from spoke pages. Expected outcome: more citations pointing to canonical hubs rather than random supporting blog posts.
GEO-bench results from Aggarwal et al. (2023) show structural tactics can increase visibility in generative responses — and that results vary by domain. That's exactly why you test these patterns rather than copy them.
Your Next Steps
Commit to two controlled GEO tests per quarter. For each one:
- Pick one lever: structure, entities and schema, or coverage
- Define treatment and control at the page-set level
- Lock a fixed query sample and a quality rubric
- Run a 6–8 week window with a stable two-week baseline
- Decide: scale it, iterate, or kill it
That cadence makes AI engine behavior less mysterious. It turns GEO and AEO from reactive tweaks into a consistent system for generating repeatable growth outcomes.
Frequently Asked Questions
How many pages do I need for a valid GEO page-set test?
Twenty pages is a workable floor — enough to surface a directional signal in 6–8 weeks. Fifty to one hundred pages gives more statistical confidence and faster pattern detection. The key constraint is that treated and control pages must be genuinely comparable: same page type, same topic cluster, similar traffic baseline. Don't pad a test with unrelated pages to hit a number.
Should I run GEO tests on all AI engines at once, or one at a time?
Start with the engine where your buyers actually search, then expand. Testing one engine at a time keeps your query sample manageable and lets you attribute results cleanly. The 2025 comparative study (arXiv:2509.08919) documents meaningful differences across AI search systems — a tactic that works in Google AI Overviews may not move the needle in Perplexity or ChatGPT Search. Per-engine testing is the only way to know.
How do I set up a clean control group for GEO tests when pages share internal links?
Internal links create contamination risk in entity and schema experiments — updating one hub affects its spoke pages too. The cleanest approach is to use two separate entity hubs as treatment and control, with minimal internal link overlap between them. For FAQ and answer-block tests, internal link overlap matters less, so page-set splits work fine.
What is a realistic AI Presence Rate baseline before testing?
In Phasewheel's client work, AI presence rates for brand-owned content on commercial queries typically start in the 10–30% range before optimization. Informational queries often start higher. Run a two-week manual sample before your test kicks off — that baseline is your most important pre-period measurement.
How do I handle algorithm changes that happen during a test window?
Log them in your change log and note the date. If a major algorithm update drops during your test window, extend the window by the same number of days the update was active, or restart the post-period after the dust settles. Don't throw out the test — document what happened and treat it as a confounding variable when you interpret results.
When should I stop a GEO experiment early?
Two conditions warrant an early stop: a strong negative signal on outcome metrics, or a Google spam or structured data policy violation discovered mid-test. A flat or noisy presence signal alone is not a reason to stop — noise in the first four weeks is normal. Give structure tests the full 6–8 weeks before drawing conclusions.