llms.txt and How to Control AI Crawlers
GPTBot, ClaudeBot, PerplexityBot, Google-Extended — a practical guide to the new llms.txt convention and using robots.txt to decide who trains on and cites your site.
A new class of bot is crawling your site — and it has nothing to do with Google. OpenAI, Anthropic, Perplexity, and a growing list of AI companies send their own crawlers to fetch content for training datasets and live retrieval. You have more control over what they see, and whether they can see it at all, than most site owners realize.
The tools aren't new: robots.txt has handled this for years. What is new is a proposed convention called llms.txt that gives you a separate, cleaner way to signal your intent to AI systems. This post covers both — what they are, how they differ, and how to use them deliberately.
What llms.txt is (and isn't)
llms.txt is a proposed convention, not a ratified standard. The idea is straightforward: a Markdown file at /llms.txt that lists the canonical pages you want AI systems to prioritize — your docs, your best evergreen content, your authoritative pages — as titled, linked entries. Rather than letting a model encounter your site through scattered crawl data, llms.txt points it at the clean signal.
Think of it like a sitemap, but written for language models rather than search engines. A sitemap helps Googlebot discover URLs efficiently. llms.txt helps an AI model understand which parts of your site represent the content you want to be known for. The file follows a simple Markdown structure: an H1 title, an optional blockquote summary, and then Markdown link lists — optionally organized under ## section headings.
A few important caveats:
It isn't enforced. Unlike robots.txt, which well-behaved crawlers respect by default, llms.txt has no equivalent compliance infrastructure. A model or crawler may read it — or may not. You can't rely on it to block anything.
It doesn't replace robots.txt. If you want to prevent an AI crawler from accessing a section of your site, robots.txt is the tool for that. llms.txt is purely additive: it's a way to surface preferred content, not restrict access.
It's emerging. Adoption is growing in the developer and AI tooling communities, and several AI platforms have begun referencing it, but don't treat it as a mature standard with guaranteed behavior. Use it as a best-effort signal alongside the more reliable mechanisms.
The AI user-agents that matter
Before you can write effective robots.txt rules for AI crawlers, you need to know who's knocking. These are the user-agents you're most likely to see in your access logs:
GPTBot (OpenAI) — OpenAI's primary crawler. Used for training data collection and for ChatGPT's browsing and retrieval features. Blocking GPTBot via robots.txt is the standard way to opt out of OpenAI training data.
ClaudeBot (Anthropic) — Anthropic's crawler. Fetches content to improve Claude's knowledge and for retrieval-augmented tasks. Respects robots.txt directives.
PerplexityBot — Perplexity's crawler, used to build the index that powers live, cited answers. Blocking this bot means your content won't appear as a Perplexity citation source.
Google-Extended — Google's dedicated opt-out mechanism for Gemini training and Google AI products, separate from the standard Googlebot. You can block Google-Extended without affecting your regular search rankings.
CCBot (Common Crawl) — Common Crawl is an open dataset frequently used to train open-source and research models. Its crawls are periodic and large-scale. Blocking CCBot removes your content from one of the most widely used training corpora.
The key distinction to understand is training vs. retrieval. Some bots primarily collect data to train models — content you block today won't appear in future model weights. Others fetch content live to answer a user's specific query and cite you in the response. Blocking a retrieval bot doesn't protect you from training; it removes you from the citation pool that drives AI referral traffic. These goals are often in tension, which is why having a policy — rather than a blanket block or blanket allow — is worth the effort. Being retrievable is a prerequisite for being cited — the foundation under Generative Engine Optimization.
Allow or block? A decision framework
There's no universal right answer. The right policy depends on what your content is for and who you're trying to reach.
Block if:
- Your content is proprietary, paywalled, or provides competitive advantage when kept private. Allowing AI crawlers to train on it can effectively distribute your expertise for free.
- You publish content that's been licensed exclusively or that has contractual restrictions on redistribution.
- You have legal or compliance reasons to control how your content is used (e.g., personal data, regulated industries).
Allow if:
- Your goal is brand visibility and you want to be cited in AI-generated answers. Being in the retrieval pool for GPTBot, PerplexityBot, or Google-Extended is increasingly a meaningful source of traffic and authority — much like being indexed by Google in 2010.
- Your content is already publicly available and you benefit from broad discovery. Documentation, blog posts, guides, and educational content all fall here.
- You're building a brand in an emerging category and need every discovery channel available.
Split policy (most common for real sites):
Allow AI indexing for your public marketing content and documentation, while blocking crawlers from account pages, user-generated content, internal tools, or anything you'd prefer not to be in a training set. robots.txt makes this granular.
For content strategy, the analogy to traditional SEO is useful: opting out of search engine crawling in 2005 would have seemed like a privacy win, but the cost in long-term discoverability was enormous. Opting out of AI crawlers in 2026 carries a similar trade-off. For publicly useful content, the default should probably be to allow — and to invest in making that content excellent enough to be cited.
Implementing it
Here's a practical setup that allows AI retrieval across most of the site while protecting account and private paths.
robots.txt — declare your per-bot policy, extend the same pattern to the other bots you care about:
# robots.txt — allow AI retrieval, disallow one private path
User-agent: GPTBot
Disallow: /account
Allow: /
User-agent: Google-Extended
Allow: /
# Add analogous blocks for ClaudeBot, PerplexityBot, and CCBot
# with whatever policy fits each — this is just the pattern.llms.txt — a Markdown file at /llms.txt pointing models at your most canonical pages (see llmstxt.org for the canonical spec):
# CrawlX
> AI-powered technical-SEO crawler that finds and fixes site issues.
## Key pages
- [Getting started](https://crawlx.ai/docs/getting-started): set up your first crawl
- [Technical SEO checklist](https://crawlx.ai/blog/technical-seo-checklist-2026): the full audit checklistA few notes on this configuration:
The User-agent: * catch-all in your robots.txt still controls Googlebot and other standard search crawlers. The AI-specific rules only apply to the bots that match their declared user-agent string. You can have separate rules for each without them interfering.
For llms.txt, keep it focused. List your highest-value, most canonical pages — not an exhaustive sitemap. The goal is to surface your clearest signal, not to replicate your whole URL structure.
If you want to block training specifically but allow retrieval (for citation purposes), you currently can't make that distinction cleanly at the robots.txt level — the same rule covers both uses for a given bot. Some platforms are beginning to announce policies around this, but the tooling isn't there yet. For now, the choice is binary per user-agent.
Auditing what you actually serve
Knowing your intent is one thing; knowing what AI crawlers actually encounter is another. Several gaps can exist between your robots.txt policy and the reality of what gets fetched:
Robots.txt syntax errors can silently fail. A misplaced rule, a typo in the user-agent string, or a wildcard pattern that's more permissive than intended can mean bots access pages you thought were blocked — or are blocked from pages you intended to allow.
Blocked resources are a common issue. If your page content is loaded via JavaScript and an AI crawler doesn't execute scripts, it may fetch the HTML shell but miss everything meaningful. This is a crawl quality problem that affects AI retrieval just as much as it does fixing crawl errors for Google.
[Crawl budget](/blog/crawl-budget-optimization-guide) allocation matters here too. If AI crawlers are spending their limited fetch capacity on low-value pages — parameter URLs, thin pages, redirect chains — they may never reach your best content, regardless of what llms.txt says.
CrawlX surfaces these issues in one pass. It checks which robots directives are in effect, identifies blocked resources that would affect AI crawler visibility, and flags redirect chains and crawlability problems that could prevent your intended content from being fetched at all. If your AI crawler policy is set correctly but your content still isn't being cited, the gap is usually somewhere in the infrastructure — and that's exactly what a technical crawl surfaces.
Your ability to influence AI-generated answers depends entirely on whether AI systems can read your content cleanly in the first place. llms.txt and robots.txt are the policy layer. A technical audit confirms the policy is working as intended.
The bottom line
AI crawlers are here and multiplying. The sites that will show up in AI-generated answers, Perplexity citations, and ChatGPT responses are the ones making deliberate choices today — not just hoping for the best.
The implementation isn't complex. Add specific user-agent rules to your robots.txt, protect anything genuinely private, allow public content that benefits from citation, and publish a focused llms.txt pointing to your most authoritative pages. Then audit that the policy is actually enforced by what your server serves.
The web's relationship with crawlers has always been a negotiation. AI crawlers are the latest party at the table — and they're worth engaging thoughtfully.
Keep reading
How AI Is Transforming Technical SEO in 2026
From automated crawl analysis to intelligent fix suggestions — AI is reshaping how SEO professionals approach technical audits. Here's what's changed and what's coming next.
Technical SEOHow to Fix Crawl Errors in Google Search Console
A step-by-step guide to diagnosing and fixing crawl errors in Google Search Console — from 5xx server errors and 404s to soft 404s and blocked pages.
Put this into practice.
Run a free crawl and get every issue on your site ranked by estimated impact — fixes opened as pull requests.