What is a robots.txt file and why do I need one?

A `robots.txt` file lives at the web root of a domain and tells crawlers which paths they should not index. It is part of the Robots Exclusion Protocol, which search-engine crawlers have respected since the 1990s — well-behaved bots like Googlebot, Bingbot, DuckDuckBot read the file before crawling. The file must be reachable at exactly `/robots.txt` (URL-path root, no subfolders) or crawlers ignore it. It does not control what ends up in the index — only what gets crawled; already-indexed URLs disappear only after `noindex` or a Search-Console removal request.

How do I block AI bots like GPTBot and ClaudeBot in robots.txt?

One user-agent block per bot plus `Disallow: /`. As of 2026 the active list is at least 14 tokens: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, CCBot, Bytespider, Meta-ExternalAgent, Amazonbot, Applebot-Extended and Google-Extended. Deprecated names like `anthropic-ai` or `Claude-Web` block nobody anymore — vendor documentation no longer lists them as active tokens. The generator emits the current list on click and flags deprecated entries in the validator.

What is the difference between OAI-SearchBot and GPTBot?

OAI-SearchBot fetches pages for real-time search results in ChatGPT and Perplexity-style answers — when a user asks 'what is X?', the bot pulls the page and the model cites it in the response. GPTBot in contrast harvests training data for future model updates. If you want visibility in AI answers but not in training corpora, block GPTBot and allow OAI-SearchBot. The same three-tier pattern exists for Anthropic (Claude-SearchBot vs ClaudeBot) and Apple (Applebot vs Applebot-Extended). The generator exposes these splits as named presets.

Are robots.txt rules case-sensitive?

Yes. `Disallow: /Admin/` and `Disallow: /admin/` block different URLs — the crawler compares paths character by character. Common trap: someone tests `/admin/` locally with lower case but deploys to a CMS that serves `/Admin/`; the block then matches nothing. The generator's validator flags case-only differences between Allow and Disallow rules and within a single block. Safer route: block both spellings explicitly or enforce one canonical case at the server level.

What does Crawl-delay mean and who respects it?

`Crawl-delay: 10` asks the crawler to wait 10 seconds between two requests. Googlebot officially ignores the directive (per developers.google.com), Bingbot and Yandex respect it. Practically Crawl-delay matters only if the server is suffering from crawler load — modern hosting setups almost never have that problem. The generator surfaces this caveat inline because many 2010s-era tutorials still recommend Crawl-delay as an SEO best-practice, which is no longer accurate in 2026.

Why am I seeing a warning that I am blocking CSS or JS?

If `Disallow: /css/` or `Disallow: /assets/` sits in the `User-agent: *` block, Googlebot cannot fully render the page — the render preview in Search Console and the Mobile-Friendly score see a broken version without styles and scripts. Google does not penalise that directly but the page's quality signals suffer. The generator flags this case because it shows up often in consulting work and is hard to spot via self-inspection. Fix: allow CSS and JS paths in an explicit `Allow:` rule or remove the `Disallow:` that catches them.

Is robots.txt enough to really block AI bots?

No. The Robots Exclusion Protocol is a voluntary convention — the bot decides whether to honour the file. Googlebot/Bingbot/DuckDuckBot follow it reliably. For AI crawlers there have been multiple audit reports in 2024 and 2025 (e.g. by WIRED and 404Media) showing PerplexityBot and Bytespider ignoring robots.txt. When the file alone is not enough, you need a bot-mitigation layer: Cloudflare Bot Fight Mode, WAF rules, hetzner firewall with user-agent block, or server configuration (nginx `if` block, Apache `RewriteCond %{HTTP_USER_AGENT}`). The generator says so explicitly in the honest-limits banner.

Where do I put the robots.txt file on the server?

At the web root of your domain — reachable at `https://your-domain.com/robots.txt`. On Apache shared hosting the folder is typically `public_html/`; on nginx it is often `/var/www/html/`; on Cloudflare Pages, Astro, Hugo or 11ty you drop the file in `public/` or `static/` and the build picks it up automatically. Important: one `robots.txt` per domain, and it applies to subdomains only when served from each subdomain (subdomains have their own robots scopes). After deploy a reload is enough — Google refetches in minutes, Bing in hours to days.

robots.txt Generator — AI Bot Block 2026, Validator

What does the robots.txt generator do?

It is an editor for the robots.txt file that search-engine crawlers and AI bots read before indexing. You compose any number of User-agent stacks side by side, each with its own Allow and Disallow rules and an optional Crawl-delay. Alongside the editor sit presets for AI-bot tokens (current as of 2026), common-block paths (admin, shop, search, PDFs) and a validator that flags typical mistakes. The output is plain text with LF line endings and no BOM, ready to drop at /robots.txt on the domain root.

Three pillars drive the tool:

Multi-stack editor — any number of user-agent stacks, Allow and Disallow editable line by line, Crawl-delay per stack.
AI-bot presets — five curated splits: block all 14 bots, allow search-bots and block training-bots, plus three-tier splits dedicated to Apple, OpenAI and Anthropic.
Validator — case mismatch, conflict between Allow and Disallow, blocked CSS/JS, deprecated token names, malformed sitemap URLs, plain http vs https.

Everything in the browser. No upload, no account, no cookie banner.

Which AI-bot tokens does the generator know (as of 2026)?

The curated list covers fourteen current tokens — sourced from vendor documentation, not from outdated tutorials:

Vendor	Bot token	Purpose
OpenAI	`GPTBot`	training
OpenAI	`ChatGPT-User`	user-initiated fetch
OpenAI	`OAI-SearchBot`	real-time search grounding
Anthropic	`ClaudeBot`	training
Anthropic	`Claude-User`	user-initiated fetch
Anthropic	`Claude-SearchBot`	claude.ai search
Perplexity	`PerplexityBot`	search grounding
Perplexity	`Perplexity-User`	user-initiated fetch
Common Crawl	`CCBot`	training (dataset for many models)
ByteDance	`Bytespider`	training
Meta	`Meta-ExternalAgent`	training
Amazon	`Amazonbot`	mixed
Apple	`Applebot-Extended`	training
Google	`Google-Extended`	training

Common mistake: 2023-era tutorials list anthropic-ai and Claude-Web. Anthropic retired those names in 2024 — sites still carrying them in robots.txt block nobody and at the same time have no block on the actual bot. The validator flags those tokens as deprecated and names the modern replacement.

How does the allow-search-block-train preset work?

The second AI preset splits bot purpose instead of blanket-blocking everything. Search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) are allowed to fetch the page because each fetch matches a user query — the page lands in the answer as a citation. Training bots (GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent, Applebot-Extended, Google-Extended) are blocked because their crawl only feeds the model snapshot and brings no visibility gain to the site itself.

This separation is missing as a one-click toggle on every other generator we surveyed (metatags.io, seoptimer.com, websiteseochecker.com). Users had to copy the lists by hand from vendor doc PDFs — which is precisely where outdated tutorials persist and tokens like anthropic-ai outlive their relevance.

What is the validator for?

The validator runs live over the emitted text and reports five foot-gun classes:

Case mismatch: /Admin/ vs /admin/ — robots.txt is case-sensitive, the two paths block different URLs.
Conflict: Allow and Disallow on the same path — crawlers behave inconsistently, the intent is ambiguous.
Deprecated tokens: anthropic-ai, Claude-Web — no active bot reads these in 2026.
Blocked CSS/JS: /css/, /assets/, /*.js — Google then renders a broken version of the page in the Mobile-Friendly test.
Sitemap URL format: non-absolute URLs (example.com/sitemap.xml instead of https://example.com/sitemap.xml) are silently ignored by crawlers.

The validator is passive — it does not auto-correct, it just tells you where to look. That keeps the file deterministic: same input, same output, no hidden rewrites.

Why no `Host:` directive (except for Yandex)?

Host: is a Yandex extension and is not part of the official robots.txt spec used by Google, Bing and DuckDuckGo. In sitemap setups with multiple mirror domains, Yandex uses it to name the canonical variant. If your primary search engines are Google and Bing, you do not need Host: — canonical URLs belong in the <link rel="canonical"> tag of the HTML <head> or in a Sitemap: entry. The generator offers Host: as an optional field per stack, default empty.

How does the generator handle sitemap entries?

Sitemap URLs are emitted as a separate section at the bottom of the file, one entry per line:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Sitemap lines apply globally — they are not bound to a user-agent stack. Multiple sitemap entries are allowed; every modern crawler reads all of them. The validator checks URL format (http:// or https:// required) and warns about plain http as a best-practice reminder.

robots.txt is a voluntary convention. Well-behaved crawlers — Googlebot, Bingbot, DuckDuckBot, Yahoo Slurp, large SEO crawlers like Ahrefsbot or SemrushBot — respect the file reliably. AI crawlers however have been caught ignoring robots.txt in multiple audit reports in 2024 and 2025: WIRED tested PerplexityBot and found accesses despite Disallow; 404Media documented similar findings for Bytespider. When you need a hard block, layer a bot-mitigation service on top: Cloudflare Bot Fight Mode, WAF rules per user-agent, an nginx if block or Apache RewriteCond %{HTTP_USER_AGENT}. The generator names that in the banner explicitly because many tutorials act as though robots.txt alone is enough.

What other foot-guns are worth knowing?

Three frequently-missed details:

Paths are prefix matches. Disallow: /admin also blocks /administrator/, not only /admin/. If you want only the exact path blocked, write Disallow: /admin/$ with an end-anchor (Googlebot dialect; the strict RFC does not require this — check before relying on it).
Disallow: with no value. That is a valid directive and means “block nothing” — functionally identical to “User-agent: X, do not block any path”. Some legacy crawlers expect at least one Disallow line per block; the empty form is convention for that.
User-agent: * does not match every bot. If a specific user-agent block (e.g. User-agent: GPTBot) exists, it overrides the * rules for that bot completely — the *-block’s Allow rules are lost too. That means special bot stacks have to repeat every relevant rule explicitly rather than rely on the * block.

If you ship a robots.txt, you usually also build other crawler / server-header infrastructure. The set includes:

.htaccess Generator — Apache server configuration with security headers and redirects.
nginx Config Generator — modern nginx server blocks with HTTP/3 and security headers.
OpenGraph Generator — social-media preview tags for six platforms.
UTM Link Builder — clean tracking parameters on marketing URLs.

Where can I read more?

Google Robots.txt Specification — Google’s official documentation for the Robots Exclusion Protocol.
ai.robots.txt (community repository) — maintained list of current AI-bot tokens, the basis for the 14 tokens preselected here.
Cloudflare AIndependence — example strategy for AI-bot mitigation beyond robots.txt.
Robots Exclusion Protocol at Wikipedia — protocol background, history since 1994.

robots.txt Generator — AI Bot Block 2026, Validator

AI bot presets

Common block paths

User-agent stacks

Sitemap(s)

Options

Validator

Output

How It Works

Paste text or code

Instant processing

Copy result

Privacy

How do you use this tool?

What does the robots.txt generator do?

Which AI-bot tokens does the generator know (as of 2026)?

How does the allow-search-block-train preset work?

What is the validator for?

Why no `Host:` directive (except for Yandex)?

How does the generator handle sitemap entries?

What does the honest-limits banner at the bottom mean?

What other foot-guns are worth knowing?

Where can I read more?

AI bot presets

Common block paths

User-agent stacks

Sitemap(s)

Options

Validator

Output

How It Works

Paste text or code

Instant processing

Copy result

Privacy

What does the robots.txt generator do?

Which AI-bot tokens does the generator know (as of 2026)?

How does the allow-search-block-train preset work?

What is the validator for?

Why no Host: directive (except for Yandex)?

How does the generator handle sitemap entries?

What does the honest-limits banner at the bottom mean?

What other foot-guns are worth knowing?

Which related tools exist?

Where can I read more?

Why no `Host:` directive (except for Yandex)?