How do you use this tool?
- Pick an AI-bot preset (or skip) — `Block all AI bots` sets all 14 current 2026 tokens to `Disallow: /`; `Allow AI search, block training` lets search bots in while blocking training crawlers.
- Optionally tap a common-block preset (admin, shop, search, query-string noise, PDFs, drafts) — paths land in the first `User-agent: *` stack; or add your own user-agent stacks and fill Allow/Disallow line by line.
- Set Crawl-delay only if the server is overloaded (Googlebot ignores it; Bing and Yandex respect it).
- Enter sitemap URLs one per line and check the validator panel: case mismatch, Allow/Disallow conflict, deprecated tokens, blocked CSS/JS, malformed sitemap URLs.
- Copy the output or download it as `robots.txt` and drop it on the domain root at `/robots.txt`.
What does the robots.txt generator do?
It is an editor for the robots.txt file that search-engine crawlers and AI bots read before
indexing. You compose any number of User-agent stacks side by side, each with its own Allow
and Disallow rules and an optional Crawl-delay. Alongside the editor sit presets for AI-bot
tokens (current as of 2026), common-block paths (admin, shop, search, PDFs) and a validator that
flags typical mistakes. The output is plain text with LF line endings and no BOM, ready to drop at
/robots.txt on the domain root.
Three pillars drive the tool:
- Multi-stack editor — any number of user-agent stacks, Allow and Disallow editable line by line, Crawl-delay per stack.
- AI-bot presets — five curated splits: block all 14 bots, allow search-bots and block training-bots, plus three-tier splits dedicated to Apple, OpenAI and Anthropic.
- Validator — case mismatch, conflict between Allow and Disallow, blocked CSS/JS, deprecated token names, malformed sitemap URLs, plain http vs https.
Everything in the browser. No upload, no account, no cookie banner.
Which AI-bot tokens does the generator know (as of 2026)?
The curated list covers fourteen current tokens — sourced from vendor documentation, not from outdated tutorials:
| Vendor | Bot token | Purpose |
|---|---|---|
| OpenAI | GPTBot | training |
| OpenAI | ChatGPT-User | user-initiated fetch |
| OpenAI | OAI-SearchBot | real-time search grounding |
| Anthropic | ClaudeBot | training |
| Anthropic | Claude-User | user-initiated fetch |
| Anthropic | Claude-SearchBot | claude.ai search |
| Perplexity | PerplexityBot | search grounding |
| Perplexity | Perplexity-User | user-initiated fetch |
| Common Crawl | CCBot | training (dataset for many models) |
| ByteDance | Bytespider | training |
| Meta | Meta-ExternalAgent | training |
| Amazon | Amazonbot | mixed |
| Apple | Applebot-Extended | training |
Google-Extended | training |
Common mistake: 2023-era tutorials list anthropic-ai and Claude-Web. Anthropic retired those
names in 2024 — sites still carrying them in robots.txt block nobody and at the same time have
no block on the actual bot. The validator flags those tokens as deprecated and names the modern
replacement.
How does the allow-search-block-train preset work?
The second AI preset splits bot purpose instead of blanket-blocking everything. Search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) are allowed to fetch the page because each fetch matches a user query — the page lands in the answer as a citation. Training bots (GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent, Applebot-Extended, Google-Extended) are blocked because their crawl only feeds the model snapshot and brings no visibility gain to the site itself.
This separation is missing as a one-click toggle on every other generator we surveyed (metatags.io,
seoptimer.com, websiteseochecker.com). Users had to copy the lists by hand from vendor doc PDFs —
which is precisely where outdated tutorials persist and tokens like anthropic-ai outlive their
relevance.
What is the validator for?
The validator runs live over the emitted text and reports five foot-gun classes:
- Case mismatch:
/Admin/vs/admin/—robots.txtis case-sensitive, the two paths block different URLs. - Conflict: Allow and Disallow on the same path — crawlers behave inconsistently, the intent is ambiguous.
- Deprecated tokens:
anthropic-ai,Claude-Web— no active bot reads these in 2026. - Blocked CSS/JS:
/css/,/assets/,/*.js— Google then renders a broken version of the page in the Mobile-Friendly test. - Sitemap URL format: non-absolute URLs (
example.com/sitemap.xmlinstead ofhttps://example.com/sitemap.xml) are silently ignored by crawlers.
The validator is passive — it does not auto-correct, it just tells you where to look. That keeps the file deterministic: same input, same output, no hidden rewrites.
Why no Host: directive (except for Yandex)?
Host: is a Yandex extension and is not part of the official robots.txt spec used by Google,
Bing and DuckDuckGo. In sitemap setups with multiple mirror domains, Yandex uses it to name the
canonical variant. If your primary search engines are Google and Bing, you do not need Host:
— canonical URLs belong in the <link rel="canonical"> tag of the HTML <head> or in a
Sitemap: entry. The generator offers Host: as an optional field per stack, default empty.
How does the generator handle sitemap entries?
Sitemap URLs are emitted as a separate section at the bottom of the file, one entry per line:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap lines apply globally — they are not bound to a user-agent stack. Multiple sitemap entries
are allowed; every modern crawler reads all of them. The validator checks URL format (http://
or https:// required) and warns about plain http as a best-practice reminder.
What does the honest-limits banner at the bottom mean?
robots.txt is a voluntary convention. Well-behaved crawlers — Googlebot, Bingbot, DuckDuckBot,
Yahoo Slurp, large SEO crawlers like Ahrefsbot or SemrushBot — respect the file reliably. AI
crawlers however have been caught ignoring robots.txt in multiple audit reports in 2024 and
2025: WIRED tested PerplexityBot and found accesses despite Disallow; 404Media documented similar
findings for Bytespider. When you need a hard block, layer a bot-mitigation service on top:
Cloudflare Bot Fight Mode,
WAF rules per user-agent, an nginx if block or Apache RewriteCond %{HTTP_USER_AGENT}. The
generator names that in the banner explicitly because many tutorials act as though robots.txt
alone is enough.
What other foot-guns are worth knowing?
Three frequently-missed details:
- Paths are prefix matches.
Disallow: /adminalso blocks/administrator/, not only/admin/. If you want only the exact path blocked, writeDisallow: /admin/$with an end-anchor (Googlebot dialect; the strict RFC does not require this — check before relying on it). Disallow:with no value. That is a valid directive and means “block nothing” — functionally identical to “User-agent: X, do not block any path”. Some legacy crawlers expect at least one Disallow line per block; the empty form is convention for that.User-agent: *does not match every bot. If a specific user-agent block (e.g.User-agent: GPTBot) exists, it overrides the*rules for that bot completely — the*-block’s Allow rules are lost too. That means special bot stacks have to repeat every relevant rule explicitly rather than rely on the*block.
Which related tools exist?
If you ship a robots.txt, you usually also build other crawler / server-header infrastructure.
The set includes:
- .htaccess Generator — Apache server configuration with security headers and redirects.
- nginx Config Generator — modern nginx server blocks with HTTP/3 and security headers.
- OpenGraph Generator — social-media preview tags for six platforms.
- UTM Link Builder — clean tracking parameters on marketing URLs.
Where can I read more?
- Google Robots.txt Specification — Google’s official documentation for the Robots Exclusion Protocol.
- ai.robots.txt (community repository) — maintained list of current AI-bot tokens, the basis for the 14 tokens preselected here.
- Cloudflare AIndependence — example strategy for AI-bot mitigation beyond robots.txt.
- Robots Exclusion Protocol at Wikipedia — protocol background, history since 1994.
Last updated: