π¦ Robots.txt Tester & Validator
Paste your robots.txt, enter a URL, pick a crawler β get the verdict instantly.
Why Your Robots.txt File Is Silently Blocking (or Exposing) the Wrong Pages
Most website owners set up robots.txt once and forget it. It sits in the root directory, doing quiet work β or quiet damage. The problem is that robots.txt syntax has enough edge cases, quirks, and ordering rules that a single misplaced slash can accidentally block Google from crawling your entire site, or just as dangerously, leave private sections wide open to every bot on the internet.
This guide explains how robots.txt actually works under the hood, what the most common mistakes look like, and how a dedicated tester can save you from a costly crawl budget disaster or an embarrassing data exposure.
What Robots.txt Actually Does β and What It Does Not
The Robots Exclusion Protocol is a voluntary standard. When a well-behaved crawler like Googlebot, Bingbot, or GPTBot visits your site, it first fetches https://yourdomain.com/robots.txt and reads the instructions before crawling anything else. If you disallow a path, a compliant bot will skip it. If you allow it, the bot proceeds.
Here is what most people get wrong: robots.txt is not a security mechanism. It is a politeness protocol. Malicious scrapers, bad bots, and anyone running a custom HTTP client will happily ignore every directive you write. If a page contains genuinely sensitive data β admin panels, payment details, user records β it must be protected at the server level with authentication. Relying on robots.txt for privacy is one of the most dangerous misconceptions in SEO.
The second misunderstanding: disallowing a URL does not deindex it. If other sites link to a disallowed page, Google may still show it in search results β just without a snippet, because it could never read the content. To properly remove a URL from Google's index, you need a noindex meta tag or HTTP header, combined (if necessary) with temporarily allowing the bot to crawl it long enough to see the noindex instruction.
How Bots Choose Which Rule to Apply
When a crawler finds multiple rules that could apply to a URL, it does not just take the first one it encounters. Google and Bing both follow the longest-match rule: whichever rule β allow or disallow β has the longest matching path wins, regardless of which appears first in the file.
This means:
User-agent: *
Disallow: /private/
Allow: /private/public.html
The file above correctly allows /private/public.html while blocking everything else under /private/. The Allow path is longer (24 characters) than the Disallow path (9 characters), so it takes precedence for that specific URL. Many webmasters write these rules expecting the first match to win β as older parsers used to work β and are surprised when a modern crawler behaves differently.
If two rules have exactly the same length, the more permissive one (Allow) wins according to Google's implementation.
The Over-Broad Disallow Trap
The single most destructive robots.txt mistake is:
User-agent: *
Disallow: /
This one-liner tells every bot to crawl nothing. Zero pages indexed. Zero traffic from search. The entire site goes dark. It happens more often than you think β developers use this pattern in staging environments to block indexing, then accidentally deploy it to production, or forget to remove it before launch.
A slightly less obvious variant:
Disallow: /*
The wildcard * in robots.txt matches any sequence of characters. /* therefore matches every possible path, same effect as /. Similarly, Disallow: /*.pdf$ blocks all PDF files β the $ anchors the match to the end of the URL, while * allows anything in the filename.
Another trap: leaving spaces in paths. Disallow: /my files/ should be written as Disallow: /my%20files/ β unencoded spaces are technically invalid and parsers handle them inconsistently across different bots.
User-Agent Specificity and Group Inheritance
Robots.txt supports targeting specific bots by name. A group is defined by one or more User-agent: lines followed by rules, with a blank line marking the end of the group:
User-agent: Googlebot
Allow: /
Disallow: /no-google/
User-agent: Bingbot
Disallow: /no-bing/
User-agent: *
Disallow: /private/
When Googlebot arrives, it looks first for groups that explicitly name it. It finds the Googlebot group and uses only those rules. The wildcard * group is ignored for bots that have a specific group. Bingbot similarly uses only its own group. A bot with no named group falls back to *.
A common mistake is assuming rules cascade or inherit. They do not. If you define a Googlebot group with no Disallow for /private/, Googlebot is allowed to crawl /private/ even if the wildcard group blocks it. Each group is fully independent.
Syntax Errors That Silent-Fail
Robots.txt has minimal error feedback β bots simply skip lines they cannot parse, which makes debugging tricky without a tester. Common silent failures include:
- Missing colon:
Disallow /private/instead ofDisallow: /private/β the directive is ignored entirely. - Rules before User-agent: Any Allow or Disallow line before the first User-agent directive is invalid and will be skipped.
- Non-absolute Sitemap URL:
Sitemap: /sitemap.xmlshould beSitemap: https://example.com/sitemap.xmlβ relative paths are not valid per the spec. - Crawl-delay non-numeric value:
Crawl-delay: fastis meaningless and will be ignored. - Unknown directives: Fields like
NoIndex:are not part of the robots.txt spec. Google does not honour them β use theX-Robots-TagHTTP header or a<meta name="robots">tag instead.
Using This Tester Effectively
The tool above parses your robots.txt entirely in the browser β no data leaves your machine. Paste your file content, enter the full URL you want to test (including https://), and select the bot you want to simulate. The tester applies the same longest-match algorithm that Google and Bing use, shows you exactly which line won, and flags any syntax errors or over-broad patterns it detects.
A few practical workflows:
Before launching a site: Verify the wildcard group does not accidentally block important sections. Test your homepage, key category pages, and any paths you want crawled.
After a traffic drop: If organic traffic falls suddenly with no obvious ranking change, robots.txt is one of the first things to check. Paste the live file and test your most important URLs against Googlebot.
When blocking AI bots: Many site owners now want to block GPTBot (ChatGPT), anthropic-ai (Claude), or CCBot (Common Crawl) from scraping content for training. Test these agent names specifically β a rule that blocks * does not block named bots that have their own group, and vice versa.
Robots.txt is small but consequential. Five minutes with a proper tester before every deployment is the cheapest insurance in technical SEO.