🔍 Sitemap Validator & Checker
Validate sitemap.xml structure, URL count, lastmod dates, encoding and indexability issues — paste XML directly or enter a URL hint.
A sitemap.xml file is your direct communication channel to search engine crawlers. Done right, it tells Googlebot exactly which pages exist, when they were last updated, and how important they are relative to each other. Done wrong — even with subtle XML encoding issues or a wrong date format — it can silently cause Google to skip pages you spent months building. This checklist walks through every validation point that matters.
Check 1: Does Your XML Actually Parse?
The most basic failure mode is a sitemap that looks fine in a text editor but breaks the moment a crawler's XML parser hits it. Common culprits are stray characters before the XML declaration, mismatched tags, or an unclosed element after the last URL. Always run your sitemap through a strict XML parser — not just a browser, which often silently auto-corrects malformed markup. If your CMS generates the sitemap dynamically, test what's actually being served at the URL, not the template.
The XML declaration line should look exactly like this: <?xml version="1.0" encoding="UTF-8"?>. Nothing — no whitespace, no BOM character, no byte-order mark — should precede it. A BOM (the invisible U+FEFF character that some Windows text editors add) will cause XML parsers to reject the file immediately.
Check 2: Root Element Must Be <urlset> or <sitemapindex>
A standard sitemap has <urlset> as its root element. A sitemap index — which lists other sitemaps rather than individual pages — uses <sitemapindex>. Neither is optional or interchangeable. Using the wrong root element, or having a root element with a typo (<UrlSet>, <url_set>), means the crawler cannot identify the document as a valid sitemap at all.
Check 3: Namespace Declaration
The sitemap namespace must be declared on the root element: xmlns="http://www.sitemaps.org/schemas/sitemap/0.9". Miss this and the document is ambiguous XML — technically parseable, but not identifiable as a sitemap. Some crawlers will still process it; others won't. Don't rely on the lenient path.
Check 4: The 50,000 URL and 50 MB Hard Limits
Google enforces two hard limits on sitemap files: 50,000 URLs per file and 50 MB uncompressed file size. Exceed either and the crawler stops processing at that point — silently. Pages beyond the cutoff simply don't get submitted. If your site has more than 50,000 URLs, you must split them across multiple sitemap files and reference them from a sitemap index. Use gzip compression (sitemap.xml.gz) to reduce file size — Google handles compressed sitemaps natively and many sites cut their file size by 70–80% this way.
Check 5: Every URL Must Be Absolute and HTTPS
The <loc> element inside each <url> entry must contain a fully qualified absolute URL including the scheme. /about is invalid. example.com/about is invalid. https://example.com/about is correct. Additionally, if your site serves over HTTPS (which it should), using HTTP URLs in the sitemap is a mismatch — Google may crawl the HTTP version and encounter a redirect, wasting crawl budget.
Check 6: Escaped Special Characters in URLs
XML has reserved characters that must be escaped. The most common mistake is an ampersand in a URL query string. In HTML you might write ?color=red&size=large, but in XML that & must be written as &. Unescaped ampersands cause XML parse errors that break every URL after that point in the file. Similarly, <, >, and " all need their respective XML entities if they appear in attribute values.
Check 7: lastmod Format Must Be W3C Datetime
The <lastmod> tag accepts only W3C datetime format. The simplest valid form is YYYY-MM-DD (e.g., 2024-06-15). The full form includes time and timezone offset: 2024-06-15T14:30:00+05:30 or with UTC: 2024-06-15T09:00:00Z. What doesn't work: June 15, 2024, 15/06/2024, 2024-6-15 (no zero padding), or Unix timestamps.
More important than format: the date must be accurate. Google's John Mueller has explicitly said that inaccurate lastmod values (like setting every page to today's date, or never updating them) train Googlebot to ignore the field entirely. Only update lastmod when the page's content actually changes.
Check 8: changefreq and priority Are Hints, Not Commands
If you include <changefreq>, it must be one of exactly seven values: always, hourly, daily, weekly, monthly, yearly, or never. Any other string is invalid. For <priority>, the value must be a decimal between 0.0 and 1.0 inclusive. A value of 1 is valid (interpreted as 1.0), but 2.0 or -1 are not.
Be aware that Google largely ignores both of these fields. They provide hints at best. Crawl frequency is determined by Google's own crawl budget algorithms, not your declared changefreq. Priority only has meaning relative to other URLs on the same domain, and even then its impact is minimal. Don't invest significant effort optimizing these fields.
Check 9: No Duplicate URLs
Every URL in your sitemap should appear exactly once. Duplicate entries waste space, confuse crawlers, and signal poor sitemap generation hygiene. Duplicates commonly appear when sitemaps are generated from database queries without DISTINCT, or when sitemap plugins pull from multiple sources that overlap. If you have canonical URL handling on your site, ensure the sitemap contains only canonical URLs — not the non-canonical variants.
Check 10: URLs Must Match the Sitemap's Domain
A sitemap submitted through Google Search Console for example.com may only contain URLs on example.com. Cross-domain URLs are silently ignored. This catches teams that accidentally include CDN URLs, staging server URLs, or partner site links. It also matters if you recently migrated domains — old URLs from the previous domain won't be processed through the new sitemap.
Check 11: Sitemap URL Must Be Discoverable
Beyond the file's contents, the sitemap must be discoverable. The standard is referencing it in robots.txt: Sitemap: https://example.com/sitemap.xml. You can also submit it directly in Google Search Console (which provides coverage reports and error details that passive crawling doesn't). Both methods work and aren't mutually exclusive — do both. The sitemap URL itself must return a 200 HTTP status code, not a redirect. Googlebot doesn't follow redirects to sitemaps.
What to Fix First
If your sitemap has multiple issues, prioritize in this order: XML parse errors first (nothing else matters if the file can't be parsed), then missing or incorrect namespace, then URL format issues (non-absolute, spaces, unescaped characters), then file size and count limits, then lastmod format errors, and finally optional tag validation. Structural issues stop the entire file from being processed; data quality issues only affect individual entries.
Run validation after every sitemap regeneration — especially after CMS updates, URL restructuring, or adding new content types. A valid sitemap isn't a one-time task; it's an ongoing maintenance item that directly affects how quickly new and updated content appears in search results.
]]>