llms.txt, robots.txt, and AI crawler rules

26 Jan

For CTOs at B2B services firms who want more citations in 2026.

AI search and chat products now decide what gets quoted, linked, and sent traffic. Two files are now back on the map in 2026: robots.txt and (maybe) llms.txt.

Google is tweaking how AI Mode shows links and source context. (1.Source)
(2.Source) The European Commission opened a formal investigation into Google’s use of online content for AI, calling out AI Overviews and AI Mode. (Source) Meanwhile, teams are testing llms.txt to point systems toward knowledge bases. (1.Source) (2.Source)

Handle this well and you get cited on the right web pages while cutting brand and legal headaches. Leaving it unattended and you block citations, send bots into dead ends, or leave sensitive material sitting on public URLs.

Set clear rules for robots.txt and llms.txt so your GEO goals don’t collide with privacy, legal, and brand safety requirements.

About robots.txt and llms.txt

robots.txt controls access for bots that choose to follow it. It’s not a lock. If a URL is public, treat it as public. llms.txt is a directory. It can point systems to your best pages, but it doesn’t force anyone to use them.

Not all “AI bots” behave the same. Split them into three (3) buckets:

Training crawlers (collect text to train models)
Index/search crawlers (build a source index to cite in answers)
User-triggered fetchers (grab a page on demand for a specific question)

OpenAI, for example, describes different bots for search vs training (OAI-SearchBot vs GPTBot) and says you can allow one and block the other. (Source)

Decide what gets crawled by starting with your content types.

Digital marketing

Examples: services, industries, POV posts
Default stance: allow indexing so you can be cited

Digital product trust pages

Examples: security, compliance, terms
Default stance: allow indexing; keep scope and dates clear

Digital docs / knowledge base

Examples: FAQs, guides, how-tos
Default stance: allow indexing; point bots to canonical hubs

Digital Thought leadership

Examples: research, frameworks
Default stance: allow indexing; keep author pages stable

Digital Client material

Examples: proposals, reports, SOWs, client specifics
Default stance: behind auth; no public URLs

Website guardrails:

If it’s sensitive, put it behind authentication. robots.txt is not a protection layer.
If you want to be cited, give AI bots one obvious canonical page on your site per topic.

Training vs citations
If your risk posture is “don’t train on our content,” block training crawlers explicitly while keeping index/search crawlers open where you want citations. (Source)

Canonicals win
Your goal isn’t “block bots.” It’s “citations land on the right page on your website.” If your canonical pages are blocked, bots will cite whatever else they can reach.

Shared ownership
Marketing owns your canonical hubs. Security owns what must be gated. Legal you’re your reuse risk. IT publishes the rules for your organization.

Review cadence

Monthly: scan logs for surprises and any broken access
Quarterly: revisit bot lists and your website structure
Release checklist: confirm canonical pages are reachable for key topics before updates

Let’s Discuss Your GEO Strategy

Implementation Patterns

robots.txt

Pattern 1: allow search/indexing, block training

[Code block — robots.txt]
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

Block sensitive paths even for search bots

User-agent: OAI-SearchBot
Disallow: /client-portal/
Disallow: /internal/
Disallow: /drafts/

Pattern 2: allow only selected public sections

[Code block — robots.txt]
User-agent: *
Disallow: /

User-agent: *
Allow: /docs/
Allow: /insights/
Allow: /services/
Allow: /industries/

Disallow: /client-portal/
Disallow: /internal/
Disallow: /admin/
Disallow: /files/private/

Operational note: user-triggered fetchers don’t always behave like background crawlers. Plan for that with access control, not only robots rules. (Source)

llms.txt

Use llms.txt to point systems at a small set of pages you’re comfortable being quoted from.

[Code block — llms.txt]

[Firm Name] — public knowledge hubs

Canonical pages we want referenced.

Start here

https://example.com/services
https://example.com/industries
https://example.com/how-we-work

Trust and scope

https://example.com/security
https://example.com/terms

Public docs

https://example.com/docs/guides
https://example.com/docs/faq
https://example.com/docs/troubleshooting

People and writing

https://example.com/experts
https://example.com/insights

Keep it short. Point to hubs versus hundreds of deep links.

What to track

Crawler activity

Requests by user-agent and path
“Blocked but should be open” hits on public pages
Spikes on paths that should never be public

Citations and traffic

Referral traffic from AI products where available (OpenAI notes ChatGPT adds utm_source=chatgpt.com when access is allowed) (Source)
For a short list of priority queries, check whether citations land on canonical pages
Track wrong answers on Tier 1 topics (security, pricing, compliance) and fix the source page

Change log
Record every change to robots/llms with:

Date, change, reason
Owner + approver
Expected outcome
Rollback plan

How to Take Your Next Step in 2026

Bring marketing, legal, security, and IT into one working session, and then run a 30-day sprint:

Classify content and mark what must be gated
Pick canonical “trust hubs” for high-stakes topics
Produce robots.txt rules that separate training from indexing where supported
Publish llms.txt pointing to your canonical hubs
Set monitoring plus a quarterly review rhythm

If you need help, reach out to Phasewheel for a discovery call.

Date Last Updated: 01-24-2026

Author: Eric Schaefer, CEO & Co-Founder

About Phasewheel: Phasewheel is an AI-forward marketing firm solving the problem of AI discovery for your brand, services, and products. Phasewheel is for business Owners, CMOs, and Growth Leaders who are challenged with navigating the new reality of AI answers in their customers' journey.

Caitlin Morin

llms.txt, robots.txt, and AI crawler rules

About robots.txt and llms.txt

Website guardrails:

Implementation Patterns

What to track

How to Take Your Next Step in 2026

Understand how AI sees you.

We read the sources, build the architecture
and keep the rhythm so your brand stays visible
when intelligence decides.

Company

Resources

Social

AI Discovery Infrastructure for
Brands That Need to Be Chosen

llms.txt, robots.txt, and AI crawler rules

About robots.txt and llms.txt

Website guardrails:

Implementation Patterns

What to track

How to Take Your Next Step in 2026

Training Sales and Customer Success Teams for AI Mediated Buyer Journeys

How PR Visibility Works in AI Search

Understand how AI sees you.

We read the sources, build the architectureand keep the rhythm so your brand stays visible when intelligence decides.

Company

Resources

Social

AI Discovery Infrastructure forBrands That Need to Be Chosen

We read the sources, build the architecture
and keep the rhythm so your brand stays visible
when intelligence decides.

AI Discovery Infrastructure for
Brands That Need to Be Chosen