Skip to content
DEV TOOL

robots.txt Generator — AI Bot Block 2026, Validator

Compose user-agent stacks visually, block AI bots cleanly or allow them as a citation source — with a validator catching the typical foot-guns (case mismatch, deprecated tokens, blocked CSS/JS).

Runs locally in the browser — the generator emits text in memory, nothing is uploaded.

AI bot presets

Current 2026 bot tokens — deprecated names (anthropic-ai, Claude-Web) block nobody and are flagged by the validator.

Search bots may cite the page, training bots are blocked. The recommended strategy for content sites that want visibility without data harvesting.

Common block paths

Adds paths to the first `User-agent: *` stack (creates one if missing).

User-agent stacks

Googlebot ignores Crawl-delay. Bing and Yandex respect it. Set only if the server is overloaded.

Sitemap(s)

One absolute URL per line — should ideally start with `https://`.

Options

AI bots in output 0
User-agent stacks 1
Sitemap lines 1

Validator

No issues found.

Output

Place at `/robots.txt` on your domain root (LF line endings, no BOM).

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

How It Works

  1. 01

    Paste text or code

    Paste your content into the input field or type directly.

  2. 02

    Instant processing

    The tool processes your content immediately and shows the result.

  3. 03

    Copy result

    Copy the result to your clipboard with one click.

Privacy

All calculations run directly in your browser. No data is sent to any server.

An editor for `robots.txt` with multiple user-agent stacks, curated AI-bot presets (block-all, allow-search-block-train, three-tier splits for Apple, OpenAI and Anthropic) and a semantic validator. You add paths line by line, the generator emits the result live as LF-formatted plain text — ready to drop at `/robots.txt` on your domain root. Pure-client, no upload, no account.

01 — How to Use

How do you use this tool?

  1. Pick an AI-bot preset (or skip) — `Block all AI bots` sets all 14 current 2026 tokens to `Disallow: /`; `Allow AI search, block training` lets search bots in while blocking training crawlers.
  2. Optionally tap a common-block preset (admin, shop, search, query-string noise, PDFs, drafts) — paths land in the first `User-agent: *` stack; or add your own user-agent stacks and fill Allow/Disallow line by line.
  3. Set Crawl-delay only if the server is overloaded (Googlebot ignores it; Bing and Yandex respect it).
  4. Enter sitemap URLs one per line and check the validator panel: case mismatch, Allow/Disallow conflict, deprecated tokens, blocked CSS/JS, malformed sitemap URLs.
  5. Copy the output or download it as `robots.txt` and drop it on the domain root at `/robots.txt`.

What does the robots.txt generator do?

It is an editor for the robots.txt file that search-engine crawlers and AI bots read before indexing. You compose any number of User-agent stacks side by side, each with its own Allow and Disallow rules and an optional Crawl-delay. Alongside the editor sit presets for AI-bot tokens (current as of 2026), common-block paths (admin, shop, search, PDFs) and a validator that flags typical mistakes. The output is plain text with LF line endings and no BOM, ready to drop at /robots.txt on the domain root.

Three pillars drive the tool:

  • Multi-stack editor — any number of user-agent stacks, Allow and Disallow editable line by line, Crawl-delay per stack.
  • AI-bot presets — five curated splits: block all 14 bots, allow search-bots and block training-bots, plus three-tier splits dedicated to Apple, OpenAI and Anthropic.
  • Validator — case mismatch, conflict between Allow and Disallow, blocked CSS/JS, deprecated token names, malformed sitemap URLs, plain http vs https.

Everything in the browser. No upload, no account, no cookie banner.

Which AI-bot tokens does the generator know (as of 2026)?

The curated list covers fourteen current tokens — sourced from vendor documentation, not from outdated tutorials:

VendorBot tokenPurpose
OpenAIGPTBottraining
OpenAIChatGPT-Useruser-initiated fetch
OpenAIOAI-SearchBotreal-time search grounding
AnthropicClaudeBottraining
AnthropicClaude-Useruser-initiated fetch
AnthropicClaude-SearchBotclaude.ai search
PerplexityPerplexityBotsearch grounding
PerplexityPerplexity-Useruser-initiated fetch
Common CrawlCCBottraining (dataset for many models)
ByteDanceBytespidertraining
MetaMeta-ExternalAgenttraining
AmazonAmazonbotmixed
AppleApplebot-Extendedtraining
GoogleGoogle-Extendedtraining

Common mistake: 2023-era tutorials list anthropic-ai and Claude-Web. Anthropic retired those names in 2024 — sites still carrying them in robots.txt block nobody and at the same time have no block on the actual bot. The validator flags those tokens as deprecated and names the modern replacement.

How does the allow-search-block-train preset work?

The second AI preset splits bot purpose instead of blanket-blocking everything. Search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) are allowed to fetch the page because each fetch matches a user query — the page lands in the answer as a citation. Training bots (GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent, Applebot-Extended, Google-Extended) are blocked because their crawl only feeds the model snapshot and brings no visibility gain to the site itself.

This separation is missing as a one-click toggle on every other generator we surveyed (metatags.io, seoptimer.com, websiteseochecker.com). Users had to copy the lists by hand from vendor doc PDFs — which is precisely where outdated tutorials persist and tokens like anthropic-ai outlive their relevance.

What is the validator for?

The validator runs live over the emitted text and reports five foot-gun classes:

  • Case mismatch: /Admin/ vs /admin/robots.txt is case-sensitive, the two paths block different URLs.
  • Conflict: Allow and Disallow on the same path — crawlers behave inconsistently, the intent is ambiguous.
  • Deprecated tokens: anthropic-ai, Claude-Web — no active bot reads these in 2026.
  • Blocked CSS/JS: /css/, /assets/, /*.js — Google then renders a broken version of the page in the Mobile-Friendly test.
  • Sitemap URL format: non-absolute URLs (example.com/sitemap.xml instead of https://example.com/sitemap.xml) are silently ignored by crawlers.

The validator is passive — it does not auto-correct, it just tells you where to look. That keeps the file deterministic: same input, same output, no hidden rewrites.

Why no Host: directive (except for Yandex)?

Host: is a Yandex extension and is not part of the official robots.txt spec used by Google, Bing and DuckDuckGo. In sitemap setups with multiple mirror domains, Yandex uses it to name the canonical variant. If your primary search engines are Google and Bing, you do not need Host: — canonical URLs belong in the <link rel="canonical"> tag of the HTML <head> or in a Sitemap: entry. The generator offers Host: as an optional field per stack, default empty.

How does the generator handle sitemap entries?

Sitemap URLs are emitted as a separate section at the bottom of the file, one entry per line:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Sitemap lines apply globally — they are not bound to a user-agent stack. Multiple sitemap entries are allowed; every modern crawler reads all of them. The validator checks URL format (http:// or https:// required) and warns about plain http as a best-practice reminder.

What does the honest-limits banner at the bottom mean?

robots.txt is a voluntary convention. Well-behaved crawlers — Googlebot, Bingbot, DuckDuckBot, Yahoo Slurp, large SEO crawlers like Ahrefsbot or SemrushBot — respect the file reliably. AI crawlers however have been caught ignoring robots.txt in multiple audit reports in 2024 and 2025: WIRED tested PerplexityBot and found accesses despite Disallow; 404Media documented similar findings for Bytespider. When you need a hard block, layer a bot-mitigation service on top: Cloudflare Bot Fight Mode, WAF rules per user-agent, an nginx if block or Apache RewriteCond %{HTTP_USER_AGENT}. The generator names that in the banner explicitly because many tutorials act as though robots.txt alone is enough.

What other foot-guns are worth knowing?

Three frequently-missed details:

  1. Paths are prefix matches. Disallow: /admin also blocks /administrator/, not only /admin/. If you want only the exact path blocked, write Disallow: /admin/$ with an end-anchor (Googlebot dialect; the strict RFC does not require this — check before relying on it).
  2. Disallow: with no value. That is a valid directive and means “block nothing” — functionally identical to “User-agent: X, do not block any path”. Some legacy crawlers expect at least one Disallow line per block; the empty form is convention for that.
  3. User-agent: * does not match every bot. If a specific user-agent block (e.g. User-agent: GPTBot) exists, it overrides the * rules for that bot completely — the *-block’s Allow rules are lost too. That means special bot stacks have to repeat every relevant rule explicitly rather than rely on the * block.

If you ship a robots.txt, you usually also build other crawler / server-header infrastructure. The set includes:

Where can I read more?

Last updated:

You might also like