llms.txt, robots.txt, and AI crawler rules

For CTOs at B2B services firms who want more citations in 2026.

AI search and chat products now decide what gets quoted, linked, and sent traffic. Two files are now back on the map in 2026: robots.txt and (maybe) llms.txt.

Google is tweaking how AI Mode shows links and source context. (1.Source)
(2.Source) The European Commission opened a formal investigation into Google’s use of online content for AI, calling out AI Overviews and AI Mode. (Source) Meanwhile, teams are testing llms.txt to point systems toward knowledge bases. (1.Source) (2.Source)

Handle this well and you get cited on the right web pages while cutting brand and legal headaches. Leaving it unattended and you block citations, send bots into dead ends, or leave sensitive material sitting on public URLs.

Set clear rules for robots.txt and llms.txt so your GEO goals don’t collide with privacy, legal, and brand safety requirements.


About robots.txt and llms.txt

robots.txt controls access for bots that choose to follow it. It’s not a lock. If a URL is public, treat it as public. llms.txt is a directory. It can point systems to your best pages, but it doesn’t force anyone to use them.

Not all “AI bots” behave the same. Split them into three (3) buckets:

  1. Training crawlers (collect text to train models)

  2. Index/search crawlers (build a source index to cite in answers)

  3. User-triggered fetchers (grab a page on demand for a specific question)

OpenAI, for example, describes different bots for search vs training (OAI-SearchBot vs GPTBot) and says you can allow one and block the other. (Source)

Decide what gets crawled by starting with your content types.

Digital marketing

  • Examples: services, industries, POV posts

  • Default stance: allow indexing so you can be cited

Digital product trust pages

  • Examples: security, compliance, terms

  • Default stance: allow indexing; keep scope and dates clear

Digital docs / knowledge base

  • Examples: FAQs, guides, how-tos

  • Default stance: allow indexing; point bots to canonical hubs

Digital Thought leadership

  • Examples: research, frameworks

  • Default stance: allow indexing; keep author pages stable

Digital Client material

  • Examples: proposals, reports, SOWs, client specifics

  • Default stance: behind auth; no public URLs

Website guardrails:

  1. If it’s sensitive, put it behind authentication. robots.txt is not a protection layer.

  2. If you want to be cited, give AI bots one obvious canonical page on your site per topic.

Training vs citations
If your risk posture is “don’t train on our content,” block training crawlers explicitly while keeping index/search crawlers open where you want citations. (Source)

Canonicals win
Your goal isn’t “block bots.” It’s “citations land on the right page on your website.” If your canonical pages are blocked, bots will cite whatever else they can reach.

Shared ownership
Marketing owns your canonical hubs. Security owns what must be gated. Legal you’re your reuse risk. IT publishes the rules for your organization.

Review cadence

  • Monthly: scan logs for surprises and any broken access

  • Quarterly: revisit bot lists and your website structure

  • Release checklist: confirm canonical pages are reachable for key topics before updates


Let’s Discuss Your GEO Strategy

Implementation Patterns

robots.txt

Pattern 1: allow search/indexing, block training

[Code block — robots.txt]
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

Block sensitive paths even for search bots

User-agent: OAI-SearchBot
Disallow: /client-portal/
Disallow: /internal/
Disallow: /drafts/

Pattern 2: allow only selected public sections

[Code block — robots.txt]
User-agent: *
Disallow: /

User-agent: *
Allow: /docs/
Allow: /insights/
Allow: /services/
Allow: /industries/

Disallow: /client-portal/
Disallow: /internal/
Disallow: /admin/
Disallow: /files/private/

Operational note: user-triggered fetchers don’t always behave like background crawlers. Plan for that with access control, not only robots rules. (Source)

llms.txt

Use llms.txt to point systems at a small set of pages you’re comfortable being quoted from.

[Code block — llms.txt]

[Firm Name] — public knowledge hubs

Canonical pages we want referenced.

Start here

  • https://example.com/services

  • https://example.com/industries

  • https://example.com/how-we-work

Trust and scope

  • https://example.com/security

  • https://example.com/terms

Public docs

  • https://example.com/docs/guides

  • https://example.com/docs/faq

  • https://example.com/docs/troubleshooting

People and writing

  • https://example.com/experts

  • https://example.com/insights

Keep it short. Point to hubs versus hundreds of deep links.

What to track

Crawler activity

  • Requests by user-agent and path

  • “Blocked but should be open” hits on public pages

  • Spikes on paths that should never be public

Citations and traffic

  • Referral traffic from AI products where available (OpenAI notes ChatGPT adds utm_source=chatgpt.com when access is allowed) (Source)

  • For a short list of priority queries, check whether citations land on canonical pages

  • Track wrong answers on Tier 1 topics (security, pricing, compliance) and fix the source page

Change log
Record every change to robots/llms with:

  • Date, change, reason

  • Owner + approver

  • Expected outcome

  • Rollback plan

How to Take Your Next Step in 2026

Bring marketing, legal, security, and IT into one working session, and then run a 30-day sprint:

  1. Classify content and mark what must be gated

  2. Pick canonical “trust hubs” for high-stakes topics

  3. Produce robots.txt rules that separate training from indexing where supported

  4. Publish llms.txt pointing to your canonical hubs

  5. Set monitoring plus a quarterly review rhythm

If you need help, reach out to Phasewheel for a discovery call.


Last updated 01-24-2026

Previous
Previous

Training Sales and Customer Success Teams for AI Mediated Buyer Journeys

Next
Next

How PR Visibility Works in AI Search