llms.txt, robots.txt, and AI crawler rules
A practical guide for CTOs and marketing leaders who want to earn AI citations — without exposing client data or sensitive content.
By Eric Schaefer April 16, 2026 6 min read
B2B services firms need clear, intentional rules for robots.txt and llms.txt. Configure them right and AI assistants cite your best pages; leave them unattended and bots hit dead ends, block citations, or surface content that should never be public.
Two files that lived in the quiet corners of your web infrastructure are back on the map in 2026. Here's how to use them strategically.
Why AI search changed the stakes for robots.txt and llms.txt
AI search and chat products now decide what gets quoted, linked, and sent traffic. That changes what robots.txt and llms.txt actually mean for your business and what it costs to get them wrong.
Google is actively updating how AI Mode surfaces links and source context (Search Engine Land, 2025, The Verge, 2025). The European Commission has opened a formal investigation into Google's use of online content for AI products, calling out AI Overviews and AI Mode specifically (European Commission, 2025). Meanwhile, engineering and content teams are testing llms.txt to point AI systems toward curated knowledge bases (Mintlify, Webflow).
The business case for getting this right is direct: handle it well and you get cited on the pages you want. Leave it unattended and you block citations, send bots into dead ends, or leave sensitive material sitting on public URLs, any of which carries brand and legal risk.
What robots.txt and llms.txt actually do and don't do
robots.txt controls access for bots that choose to follow it. It is not a lock. If a URL is public, treat it as public. Any determined crawler can still reach it, the file is a convention, not a security layer.
llms.txt is a directory. It can point AI systems to your best pages, but it does not force anyone to use them. Think of it as a curated shortlist you publish for AI assistants that want to know where to start.
Not all AI bots behave the same way. Split them into three categories:
- Training crawlers: collect text to train models
- Index/search crawlers: build a source index to cite in answers
- User-triggered fetchers: grab a page on demand for a specific user question
OpenAI, for example, describes separate bots for search versus training OAI-SearchBot versus GPTBot and explicitly states you can allow one and block the other OpenAI Docs. That distinction is the foundation of every configuration decision below.
How to classify your content before you configure anything
Start with your content types, not your robots.txt syntax. Every category needs a default stance, something your marketing, legal, and security teams have agreed on before the file gets written.
| Content type | Examples | Default stance |
|---|---|---|
| Digital marketing | Services, industries, POV posts | Allow indexing so you can be cited |
| Trust pages | Security, compliance, terms | Allow indexing; keep scope and dates clear |
| Docs / knowledge base | FAQs, guides, how-tos | Allow indexing; point bots to canonical hubs |
| Thought leadership | Research, frameworks | Allow indexing; keep author pages stable |
| Client material | Proposals, reports, SOWs | Behind authentication — no public URLs |
Two rules govern this entire table:
- If it's sensitive, put it behind authentication. robots.txt is not a protection layer.
- If you want to be cited, give AI bots one obvious canonical page on your site per topic.
Canonicals win. Your goal isn't "block bots", it's "citations land on the right page on your website." If your canonical pages are blocked, bots will cite whatever else they can reach.
robots.txt patterns that separate training from indexing
Once your content is classified, the configuration follows. Two patterns cover most B2B services firms.
Pattern 1: Allow search and indexing, block training
Use this when your risk posture is "don't train on our content", but you still want AI search citations. Block training crawlers explicitly while keeping index/search crawlers open where you want citations (OpenAI Docs).
# Block training crawler
User-agent: GPTBot
Disallow: /
# Allow search/indexing crawler
User-agent: OAI-SearchBot
Allow: /
# Block sensitive paths even for search bots
User-agent: OAI-SearchBot
Disallow: /client-portal/
Disallow: /internal/
Disallow: /drafts/
Pattern 2: Allow only selected public sections
Use this when you want fine-grained control and only specific sections should be crawlable at all.
# Block everything by default
User-agent: *
Disallow: /
# Allow specific public sections
User-agent: *
Allow: /docs/
Allow: /insights/
Allow: /services/
Allow: /industries/
Disallow: /client-portal/
Disallow: /internal/
Disallow: /admin/
Disallow: /files/private/
One important operational note: user-triggered fetchers don't always behave like background crawlers. Plan for that with access control, not only robots rules (Search Engine Roundtable).
How to structure your llms.txt file
llms.txt works best when it's short and hub-focused. Point AI systems to a curated set of pages you're comfortable being quoted from. Think canonical entry points, not a site map.
# [Firm Name] — public knowledge hubs
# Canonical pages we want referenced.
## Start here
- https://example.com/services
- https://example.com/industries
- https://example.com/how-we-work
## Trust and scope
- https://example.com/security
- https://example.com/terms
## Public docs
- https://example.com/docs/guides
- https://example.com/docs/faq
- https://example.com/docs/troubleshooting
## People and writing
- https://example.com/experts
- https://example.com/insights
The structure maps directly to the content classification you did in the previous step. If a content type has a default stance of "allow indexing," its hub page belongs here. If it's client material behind authentication, it doesn't.
What to monitor once the rules are live
Configuration without monitoring is a guess. Once robots.txt and llms.txt are published, three areas deserve regular attention.
Crawler activity
- Requests by user-agent and path
- "Blocked but should be open" hits on public pages
- Spikes on paths that should never be public
Citations and traffic
- Referral traffic from AI products: OpenAI notes that when ChatGPT accesses content with permission, it appends
utm_source=chatgpt.com(OpenAI Help Center) - For a short list of priority queries, check whether citations land on canonical pages
- Track wrong answers on Tier 1 topics: security, pricing, compliance and fix the source page
Change log
Record every change to robots.txt or llms.txt with: date, change, and reason; owner and approver; expected outcome; rollback plan.
Shared ownership across teams
Marketing owns your canonical hubs. Security defines what must be gated. Legal manages your reuse risk. IT publishes and maintains the rules. All four need to align before any configuration goes live and revisit the alignment quarterly.
How to run your 30-day implementation sprint
Bring marketing, legal, security, and IT into one working session. Then run a focused sprint:
- Classify all content and mark what must be gated
- Pick canonical "trust hubs" for your highest-stakes topics
- Produce robots.txt rules that separate training from indexing where supported
- Publish llms.txt pointing to your canonical hubs
- Set monitoring and a quarterly review rhythm
Review cadence after launch: monthly, scan logs for surprises and any broken access; quarterly, revisit bot lists and site structure; before every release, confirm canonical pages are reachable for key topics.
Frequently asked questions
What is the difference between robots.txt and llms.txt?
robots.txt controls which bots can access which pages on your website, but it is not a security layer. Any public URL can still be accessed by a determined crawler. llms.txt is a voluntary directory that points AI systems toward your preferred canonical pages. It does not restrict access; it guides systems toward the content you most want cited.
Can I block AI training crawlers while still allowing AI search crawlers?
Yes. OpenAI operates separate bots for training (GPTBot) and search indexing (OAI-SearchBot). You can disallow GPTBot in robots.txt to opt out of training while keeping OAI-SearchBot allowed so your content can be cited in ChatGPT search responses.
What content should I block from AI crawlers?
Any sensitive material such as client proposals, reports, SOWs, internal documents, and anything behind authentication should never have a public URL. robots.txt is not a substitute for proper access control. Gate sensitive content behind authentication first, then use robots.txt to reinforce those boundaries.
What should I include in my llms.txt file?
Keep llms.txt short and hub-focused. Point AI systems to your canonical services pages, industry pages, trust and compliance pages, public documentation, and expert or insights sections. Avoid linking to hundreds of deep URLs and curate the small set of pages you most want AI assistants quoting and citing.
How do I know if AI crawlers are visiting my site?
Monitor your server logs for user-agent strings from known AI crawlers (GPTBot, OAI-SearchBot, etc.). Track requests by path to catch bots hitting pages that should be gated. OpenAI also notes that when ChatGPT accesses content with permission, it appends utm_source=chatgpt.com, so referral traffic tracking in analytics can confirm citation activity.
Who owns robots.txt and llms.txt at a B2B services firm?
Shared ownership works best. Marketing owns the canonical hub pages and decides which content should be discoverable. Security defines what must be gated. Legal manages reuse risk and content licensing posture. IT publishes and maintains the actual rules files. All four teams should align in a single working session before any configuration goes live.