llms.txt, robots.txt, and AI crawler rules
For CTOs at B2B services firms who want more citations in 2026.
AI search and chat products now decide what gets quoted, linked, and sent traffic. Two files are now back on the map in 2026: robots.txt and (maybe) llms.txt.
Google is tweaking how AI Mode shows links and source context. (1.Source)
(2.Source) The European Commission opened a formal investigation into Google’s use of online content for AI, calling out AI Overviews and AI Mode. (Source) Meanwhile, teams are testing llms.txt to point systems toward knowledge bases. (1.Source) (2.Source)
Handle this well and you get cited on the right web pages while cutting brand and legal headaches. Leaving it unattended and you block citations, send bots into dead ends, or leave sensitive material sitting on public URLs.
Set clear rules for robots.txt and llms.txt so your GEO goals don’t collide with privacy, legal, and brand safety requirements.
About robots.txt and llms.txt
robots.txt controls access for bots that choose to follow it. It’s not a lock. If a URL is public, treat it as public. llms.txt is a directory. It can point systems to your best pages, but it doesn’t force anyone to use them.
Not all “AI bots” behave the same. Split them into three (3) buckets:
Training crawlers (collect text to train models)
Index/search crawlers (build a source index to cite in answers)
User-triggered fetchers (grab a page on demand for a specific question)
OpenAI, for example, describes different bots for search vs training (OAI-SearchBot vs GPTBot) and says you can allow one and block the other. (Source)
Decide what gets crawled by starting with your content types.
Digital marketing
Examples: services, industries, POV posts
Default stance: allow indexing so you can be cited
Digital product trust pages
Examples: security, compliance, terms
Default stance: allow indexing; keep scope and dates clear
Digital docs / knowledge base
Examples: FAQs, guides, how-tos
Default stance: allow indexing; point bots to canonical hubs
Digital Thought leadership
Examples: research, frameworks
Default stance: allow indexing; keep author pages stable
Digital Client material
Examples: proposals, reports, SOWs, client specifics
Default stance: behind auth; no public URLs
Website guardrails:
If it’s sensitive, put it behind authentication. robots.txt is not a protection layer.
If you want to be cited, give AI bots one obvious canonical page on your site per topic.
Training vs citations
If your risk posture is “don’t train on our content,” block training crawlers explicitly while keeping index/search crawlers open where you want citations. (Source)
Canonicals win
Your goal isn’t “block bots.” It’s “citations land on the right page on your website.” If your canonical pages are blocked, bots will cite whatever else they can reach.
Shared ownership
Marketing owns your canonical hubs. Security owns what must be gated. Legal you’re your reuse risk. IT publishes the rules for your organization.
Review cadence
Monthly: scan logs for surprises and any broken access
Quarterly: revisit bot lists and your website structure
Release checklist: confirm canonical pages are reachable for key topics before updates
Implementation Patterns
robots.txt
Pattern 1: allow search/indexing, block training
[Code block — robots.txt]
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
Block sensitive paths even for search bots
User-agent: OAI-SearchBot
Disallow: /client-portal/
Disallow: /internal/
Disallow: /drafts/
Pattern 2: allow only selected public sections
[Code block — robots.txt]
User-agent: *
Disallow: /
User-agent: *
Allow: /docs/
Allow: /insights/
Allow: /services/
Allow: /industries/
Disallow: /client-portal/
Disallow: /internal/
Disallow: /admin/
Disallow: /files/private/
Operational note: user-triggered fetchers don’t always behave like background crawlers. Plan for that with access control, not only robots rules. (Source)
llms.txt
Use llms.txt to point systems at a small set of pages you’re comfortable being quoted from.
[Code block — llms.txt]
[Firm Name] — public knowledge hubs
Canonical pages we want referenced.
Start here
https://example.com/services
https://example.com/industries
https://example.com/how-we-work
Trust and scope
https://example.com/security
https://example.com/terms
Public docs
https://example.com/docs/guides
https://example.com/docs/faq
https://example.com/docs/troubleshooting
People and writing
https://example.com/experts
https://example.com/insights
Keep it short. Point to hubs versus hundreds of deep links.
What to track
Crawler activity
Requests by user-agent and path
“Blocked but should be open” hits on public pages
Spikes on paths that should never be public
Citations and traffic
Referral traffic from AI products where available (OpenAI notes ChatGPT adds utm_source=chatgpt.com when access is allowed) (Source)
For a short list of priority queries, check whether citations land on canonical pages
Track wrong answers on Tier 1 topics (security, pricing, compliance) and fix the source page
Change log
Record every change to robots/llms with:
Date, change, reason
Owner + approver
Expected outcome
Rollback plan
How to Take Your Next Step in 2026
Bring marketing, legal, security, and IT into one working session, and then run a 30-day sprint:
Classify content and mark what must be gated
Pick canonical “trust hubs” for high-stakes topics
Produce robots.txt rules that separate training from indexing where supported
Publish llms.txt pointing to your canonical hubs
Set monitoring plus a quarterly review rhythm
If you need help, reach out to Phasewheel for a discovery call.
Last updated 01-24-2026