Does robots.txt actually prevent pages from appearing in Google search results?

Not reliably. Disallowing a URL stops Googlebot from crawling it, but if other websites link to that page, Google may still list it in search results — just without a description snippet. To fully remove a URL from the index, you need a noindex directive (meta tag or HTTP header), which Google can only read if it is allowed to crawl the page. Robots.txt and noindex must be used together carefully.

When two rules match the same URL, which one wins?

Google and Bing both follow the longest-match rule: the rule whose path string is longest wins, regardless of whether it is an Allow or Disallow, and regardless of order in the file. If two rules have the same length, the more permissive one (Allow) takes precedence. Older bots may use first-match logic, so testing against the specific bot you care about matters.

Can I use robots.txt to block AI training bots like GPTBot or anthropic-ai?

Yes — well-behaved AI crawlers respect robots.txt. Add a specific User-agent group for the bot you want to block (e.g., User-agent: GPTBot followed by Disallow: /) and compliant crawlers will honour it. However, not all scrapers are compliant, so this is a courtesy mechanism rather than a technical lock. Check each AI company's documentation for their exact User-agent string.

What is the difference between Disallow: / and Disallow: /*?

In practice, both block all paths on your site. 'Disallow: /' matches every URL because all paths start with '/'. 'Disallow: /*' uses the wildcard to explicitly match any characters after '/', which has the same effect. Either pattern will prevent the bot from crawling any page on your domain, so both should be avoided in production unless you intentionally want a fully de-indexed site.

Does a Googlebot-specific rule override the wildcard (*) rule?

Yes. When a bot finds a group with its exact name, it uses only that group's rules and completely ignores the wildcard (*) group. The groups do not combine or inherit from each other. So if you have a Googlebot group with no Disallow for /private/, Googlebot will crawl /private/ even if your wildcard group blocks it.

Why does Crawl-delay not work for Googlebot?

Google does not honour the Crawl-delay directive. To control how fast Googlebot crawls your site, use Google Search Console's Crawl Rate settings under the Legacy Tools section. Crawl-delay is supported by Bingbot, Yandex, and several other crawlers, so it is still worth including for those bots — just do not rely on it to throttle Google.

Robots.txt Tester & Validator — Icaeztool

🚦 Robots.txt Tester & Validator

Paste your robots.txt, enter a URL, pick a crawler — get the verdict instantly.

Your robots.txt content

Test URL

User-Agent (Bot)

Why Your Robots.txt File Is Silently Blocking (or Exposing) the Wrong Pages

Most website owners set up robots.txt once and forget it. It sits in the root directory, doing quiet work — or quiet damage. The problem is that robots.txt syntax has enough edge cases, quirks, and ordering rules that a single misplaced slash can accidentally block Google from crawling your entire site, or just as dangerously, leave private sections wide open to every bot on the internet.

This guide explains how robots.txt actually works under the hood, what the most common mistakes look like, and how a dedicated tester can save you from a costly crawl budget disaster or an embarrassing data exposure.

What Robots.txt Actually Does — and What It Does Not

The Robots Exclusion Protocol is a voluntary standard. When a well-behaved crawler like Googlebot, Bingbot, or GPTBot visits your site, it first fetches https://yourdomain.com/robots.txt and reads the instructions before crawling anything else. If you disallow a path, a compliant bot will skip it. If you allow it, the bot proceeds.

Here is what most people get wrong: robots.txt is not a security mechanism. It is a politeness protocol. Malicious scrapers, bad bots, and anyone running a custom HTTP client will happily ignore every directive you write. If a page contains genuinely sensitive data — admin panels, payment details, user records — it must be protected at the server level with authentication. Relying on robots.txt for privacy is one of the most dangerous misconceptions in SEO.

The second misunderstanding: disallowing a URL does not deindex it. If other sites link to a disallowed page, Google may still show it in search results — just without a snippet, because it could never read the content. To properly remove a URL from Google's index, you need a noindex meta tag or HTTP header, combined (if necessary) with temporarily allowing the bot to crawl it long enough to see the noindex instruction.

How Bots Choose Which Rule to Apply

When a crawler finds multiple rules that could apply to a URL, it does not just take the first one it encounters. Google and Bing both follow the longest-match rule: whichever rule — allow or disallow — has the longest matching path wins, regardless of which appears first in the file.

This means:

User-agent: *
Disallow: /private/
Allow: /private/public.html

The file above correctly allows /private/public.html while blocking everything else under /private/. The Allow path is longer (24 characters) than the Disallow path (9 characters), so it takes precedence for that specific URL. Many webmasters write these rules expecting the first match to win — as older parsers used to work — and are surprised when a modern crawler behaves differently.

If two rules have exactly the same length, the more permissive one (Allow) wins according to Google's implementation.

The Over-Broad Disallow Trap

The single most destructive robots.txt mistake is:

User-agent: *
Disallow: /

This one-liner tells every bot to crawl nothing. Zero pages indexed. Zero traffic from search. The entire site goes dark. It happens more often than you think — developers use this pattern in staging environments to block indexing, then accidentally deploy it to production, or forget to remove it before launch.

A slightly less obvious variant:

Disallow: /*

The wildcard * in robots.txt matches any sequence of characters. /* therefore matches every possible path, same effect as /. Similarly, Disallow: /*.pdf$ blocks all PDF files — the $ anchors the match to the end of the URL, while * allows anything in the filename.

Another trap: leaving spaces in paths. Disallow: /my files/ should be written as Disallow: /my%20files/ — unencoded spaces are technically invalid and parsers handle them inconsistently across different bots.

User-Agent Specificity and Group Inheritance

Robots.txt supports targeting specific bots by name. A group is defined by one or more User-agent: lines followed by rules, with a blank line marking the end of the group:

User-agent: Googlebot
Allow: /
Disallow: /no-google/

User-agent: Bingbot
Disallow: /no-bing/

User-agent: *
Disallow: /private/

When Googlebot arrives, it looks first for groups that explicitly name it. It finds the Googlebot group and uses only those rules. The wildcard * group is ignored for bots that have a specific group. Bingbot similarly uses only its own group. A bot with no named group falls back to *.

A common mistake is assuming rules cascade or inherit. They do not. If you define a Googlebot group with no Disallow for /private/, Googlebot is allowed to crawl /private/ even if the wildcard group blocks it. Each group is fully independent.

Syntax Errors That Silent-Fail

Robots.txt has minimal error feedback — bots simply skip lines they cannot parse, which makes debugging tricky without a tester. Common silent failures include:

Missing colon: Disallow /private/ instead of Disallow: /private/ — the directive is ignored entirely.
Rules before User-agent: Any Allow or Disallow line before the first User-agent directive is invalid and will be skipped.
Non-absolute Sitemap URL: Sitemap: /sitemap.xml should be Sitemap: https://example.com/sitemap.xml — relative paths are not valid per the spec.
Crawl-delay non-numeric value: Crawl-delay: fast is meaningless and will be ignored.
Unknown directives: Fields like NoIndex: are not part of the robots.txt spec. Google does not honour them — use the X-Robots-Tag HTTP header or a <meta name="robots"> tag instead.

Using This Tester Effectively

The tool above parses your robots.txt entirely in the browser — no data leaves your machine. Paste your file content, enter the full URL you want to test (including https://), and select the bot you want to simulate. The tester applies the same longest-match algorithm that Google and Bing use, shows you exactly which line won, and flags any syntax errors or over-broad patterns it detects.

A few practical workflows:

Before launching a site: Verify the wildcard group does not accidentally block important sections. Test your homepage, key category pages, and any paths you want crawled.

After a traffic drop: If organic traffic falls suddenly with no obvious ranking change, robots.txt is one of the first things to check. Paste the live file and test your most important URLs against Googlebot.

When blocking AI bots: Many site owners now want to block GPTBot (ChatGPT), anthropic-ai (Claude), or CCBot (Common Crawl) from scraping content for training. Test these agent names specifically — a rule that blocks * does not block named bots that have their own group, and vice versa.

Robots.txt is small but consequential. Five minutes with a proper tester before every deployment is the cheapest insurance in technical SEO.

🚦 Robots.txt Tester & Validator

🚦 Robots.txt Tester & Validator

Why Your Robots.txt File Is Silently Blocking (or Exposing) the Wrong Pages

What Robots.txt Actually Does — and What It Does Not

How Bots Choose Which Rule to Apply

The Over-Broad Disallow Trap

User-Agent Specificity and Group Inheritance

Syntax Errors That Silent-Fail

Using This Tester Effectively

FAQ