The Complete Guide to robots.txt: Syntax, Rules and Gotchas
The Complete Guide to robots.txt: Syntax, Rules and Gotchas
Every few months I see a thread somewhere where a developer confidently claims their robots.txt is "blocking crawlers" — and then proceeds to show a file that does absolutely nothing of the sort. Robots.txt is one of those technologies that looks deceptively simple until you sit down and read the actual spec. Then it gets weird. Then you realize Googlebot and Bingbot don't fully agree on how to interpret it. Then you find out that Disallow: doesn't actually prevent indexing. Welcome.
This is a proper technical reference. We'll cover the directives, how wildcards work (and don't), precedence rules, crawl-delay behavior, and — critically — the hard boundaries of what robots.txt can and cannot do.
The Basics: Format and Structure
Robots.txt must live at the root of your domain: https://example.com/robots.txt. That's non-negotiable. Subpaths don't work. Subdomains need their own files. A file at example.com has zero authority over docs.example.com.
The file is built from "records" — groups of directives separated by blank lines. Each record starts with one or more User-agent: lines and is followed by the actual instructions.
User-agent: Googlebot
Disallow: /private/
Allow: /private/public-section/
User-agent: *
Disallow: /admin/
Crawl-delay: 2
Blank lines between records matter. If you accidentally merge two records (no blank line between them), crawlers treat them as one record, and you may end up applying restrictions to agents you didn't intend to.
Comments start with # and run to end of line. Fields are case-insensitive in the directive name (user-agent and User-agent both work), but the path values in Allow and Disallow are case-sensitive. That's a real gotcha: Disallow: /Admin/ does not protect /admin/.
User-agent Matching: Specificity Wins
The User-agent: * wildcard matches any bot not claimed by a more specific record. "More specific" means: does a record name this exact bot?
If Googlebot hits your site and you have both a User-agent: Googlebot record and a User-agent: * record, Googlebot applies only the Googlebot-specific record — it ignores the wildcard record entirely. The two records don't merge. This is probably the most common misconception I see in production robots.txt files.
There's no partial matching on user-agent strings (in the official spec). User-agent: Google does not match "Googlebot". You need the exact bot name as documented by the crawler's vendor.
Allow vs. Disallow: The Precedence Rule
When both Allow and Disallow rules match a URL, the most specific rule wins — measured by the length of the path string. Longer path = higher specificity.
User-agent: *
Disallow: /docs/
Allow: /docs/public/
Here, /docs/public/index.html is allowed, because /docs/public/ (14 chars) beats /docs/ (6 chars). If two rules tie on length, most crawlers give precedence to Allow — but this is where Google's interpretation and the original REP (Robots Exclusion Protocol) diverge slightly. Google published their own extensions to the original 1994 Martijn Koster spec, and their specificity-then-allow-wins behavior is now fairly standard among major crawlers.
There is no global precedence of Allow over Disallow or vice versa. Order within the file doesn't matter for rule resolution. Only specificity matters.
Wildcards: * and $
The original robots.txt spec had no wildcard support. Wildcards were added as an extension, and support is not universal — though every major crawler now honors them.
The * wildcard matches any sequence of characters (including none). It works in paths only, not in user-agent strings.
Disallow: /search?*q=
This blocks any URL where the query string contains q= — useful for blocking faceted search pages that generate near-infinite parameter combinations.
The $ anchor matches end-of-URL. Without it, Disallow: /page matches /page, /page.html, /pages/archive, and everything that starts with that prefix. With $:
Disallow: /page$
This matches only the exact URL /page, not /page.html or /pages/. This is powerful for protecting specific canonical endpoints while keeping related paths open.
You can combine them:
Disallow: /*.pdf$
Blocks all URLs ending in .pdf. This is a common pattern for preventing crawlers from indexing internal PDF documents — though again, see the "limits" section below for why this isn't a security measure.
Crawl-delay: Supported Inconsistently
Crawl-delay is not part of the official REP spec, but it's widely supported. The directive tells a crawler to wait N seconds between requests to your site.
User-agent: *
Crawl-delay: 5
Here's the critical gotcha: Google ignores Crawl-delay. Completely. Googlebot uses its own signals (your server's response times, crawl budget based on PageRank, Search Console settings) to determine crawl rate. If you want to throttle Googlebot, you must use Google Search Console's crawl rate limiter or adjust your server response times to signal load.
Bing and many other crawlers do respect Crawl-delay. So the directive isn't useless — just don't rely on it for Google.
Also: fractional values. Crawl-delay: 0.5 — some crawlers accept it, others don't, others round it. If you need half-second delays, test against each crawler you care about, or handle rate limiting at the infrastructure level (nginx rate limiting or similar).
Sitemaps in robots.txt
You can (and should) reference your XML sitemap from robots.txt:
Sitemap: https://example.com/sitemap.xml
Multiple Sitemap: lines are valid. This directive belongs outside any user-agent record — it applies globally. Placing it inside a record is technically a parse error, though most crawlers handle it gracefully. Still, put it at the top or bottom of the file, outside any record block.
The sitemap URL must be absolute. Relative paths don't work.
The Hard Limits: What robots.txt Cannot Do
This is where people get into real trouble by over-trusting robots.txt.
1. Disallowed URLs can still get indexed. If another site links to a URL you've disallowed, Google may index that URL anyway — it just won't crawl it. The URL appears in search results with no title, no snippet, just the URL. To prevent indexing, you need a noindex meta tag (which requires the page to be crawlable) or a noindex HTTP header. If you block crawling and indexing, you've created a contradiction — the crawler can't read your noindex because you've told it not to visit the page.
2. Robots.txt is not access control. It's an advisory protocol. There is zero enforcement mechanism. Malicious bots, scrapers, and vulnerability scanners don't read it — or actively read it to find paths you're trying to protect. If you need something private, put it behind authentication.
3. The file can expose your site structure. A common mistake is listing sensitive-sounding paths in Disallow directives. Disallow: /internal-customer-data/ broadcasts that path to anyone who reads the file. Consider using generic names if you must protect a path, or better — just require authentication and don't mention the path at all.
4. robots.txt is per-protocol and per-port. Your http://example.com/robots.txt doesn't apply to https://example.com (treated as a different origin), example.com:8080, or FTP. In practice, if you redirect HTTP to HTTPS, Googlebot fetches the HTTPS file, but it's worth knowing the spec treats them separately.
5. Fetch failures default to full access. If a crawler can't fetch your robots.txt (5xx errors, timeouts), most crawlers will treat the file as if it were empty — meaning full access. A 4xx response (including 404) typically means "no restrictions". So if your robots.txt file returns a 500 error during a server incident, crawlers will assume open access, which is usually better than being locked out entirely.
Validation and Testing
Google Search Console has a robots.txt tester that shows you exactly which rules match a given URL — including the winning rule. Use it. The parser behavior is subtle enough that manual inspection of a 50-line file will miss things.
For non-Google crawlers, the reppy Python library implements the REP and lets you test locally. Feed it a robots.txt file and a URL, and it tells you whether a given user-agent would be blocked.
Always validate after edits. A stray blank line or a misplaced Allow directive has silently blocked entire sections of major sites before.
A Realistic Production Template
# robots.txt — example.com
# Block AI training crawlers (GPTBot, Claude, etc.)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
# Block internal/admin paths from all crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Disallow: /checkout/
Disallow: /account/
Allow: /
Sitemap: https://example.com/sitemap.xml
Note the explicit Allow: / at the end of the wildcard block — technically redundant (empty Disallow or no Disallow means full allow), but useful as documentation of intent. It makes clear the file is deliberately open except where stated.
Final Thought
Robots.txt is most powerful when you understand exactly what it controls: crawl access, not visibility, not security, not link equity. Once you internalize that boundary — and learn that Googlebot ignores Crawl-delay, that Disallow doesn't prevent indexing, and that specificity determines rule precedence — the rest follows logically. The spec is small. Read it. Then read Google's extension documentation. Then use Search Console to verify your rules behave the way you think they do. That three-step process will put you ahead of the vast majority of developers who "just wrote something that seemed right."