In our article “The 5 Biggest XML Sitemap Mistakes to Avoid,” we talked about the top mistakes we see when it comes to XML sitemap files, such as listing non-compliant pages and omitting important pages. Avoiding mistakes like these is important because having an accurate, up-to-date sitemap can ensure Google doesn’t miss any content you want indexed or waste your crawl budget on URLs you don’t care about.
So how do you audit the URLs in your sitemap? You could manually comb through your sitemap files, but that would be impractical — especially on a site with millions of URLs. An easier option would be to conduct a programmatic crawl with a tool like Botify to scan your URLs for errors.
Typically, site crawlers will start a crawl of your site from a single URL — your home page. From the home page, the crawler will follow the links on that page to other pages on your site, and then follow the links on those pages, etc.
This isn’t the only way to crawl your site though. You can also use Botify to customize your crawl to start from:
- An XML sitemap or sitemap index file
- A text file
- Custom start URL(s)
Crawling your site from sitemaps or custom text files is great for a variety of use cases. Let’s explore how each option works, and what you can use them to accomplish.
Let’s dive in!
Sitemap Audit Option 1: Start crawl from sitemap
You can opt to start your crawl from an external file, like a sitemap or sitemap index, so that you can easily find any errors in your XML sitemaps. One important thing to note is that in Botify, you can not only crawl all the URLs in your sitemap file, but you can crawl from your sitemap file.
How are these things different?
Starting a crawl from your sitemap means you not only crawl the URLs in your sitemap, but crawl the pages they link to as well.
What’s the benefit? Let’s say, for example, that a URL in your sitemap is noindexed. That’s great to know, because ideally you’d only have compliant pages in your sitemap file. However, what if that non-compliant page linked to a 404 page? You wouldn’t be able to find that 404d URL unless it was also linked to in your sitemap.
Starting a site crawl from your sitemap is a great way to ensure that not only are the URLs in your sitemap error-free, but that the pages they link or redirect to are error-free as well.
It's easy to take for granted that URLs in your sitemap are only the URLs that we want to be crawled/indexed, but crawling from your sitemap files is a valuable safeguard against wasting Google’s time on URLs we don't even want crawled.
How can I crawl a sitemap index file?
What if you have a sitemap index file rather than a single XML sitemap file?
Plenty of sites use a sitemap index file. If you’re not familiar, a sitemap index file is a file that contains multiple sitemaps. Think of it as a container where you can store all the XML sitemap files on your website.
Not every site needs multiple sitemaps, but it can be necessary for many large websites since XML sitemap files cannot exceed 50,000 URLs or 50MB (uncompressed).
Sometimes, sitemap index files can pose problems for crawlers. Thankfully though, starting the crawl of your sitemap index is as easy in Botify as pasting in a link to it when you set up your crawl. We will follow your sitemap index and download any additional sitemap referenced in it.
Comparing sitemap crawls to “crawl from home page”
Sitemaps are a great way to send Google information about what you want crawled, but they’re not a guarantee that Google will find and crawl all those pages. When it comes to your website, a good way to learn how much you’re relying on Google to “figure it out” on its own is to compare a crawl from your home page to a crawl from your sitemap.
In other words, is there a disparity between what you’re feeding Google in your sitemap and what Google can easily access from your site architecture?
You’ll not only want to make sure your important pages are in your sitemap correctly, but also make sure those those pages are accessible by links on your other pages. Not only is making pages accessible in your site architecture important for helping Google find your important content; it’s also important for helping your visitors navigate your site!
Sitemap Audit Option 2: Crawling a text file
Sometimes, you don’t need to crawl your entire sitemap. There are plenty of instances where you may want to crawl only a specific set of URLs. For that, you can use Botify’s “crawl from text file” option. Simply add all the URLs you want to crawl into a text file and we’ll crawl just those URLs.
There are lots of possible use cases for this, including:
- Ongoing monitoring of paid search URLs: You can create a text file that includes all your paid search URLs to use specifically for error monitoring. Never waste money on clicks to dead pages again!
- Ongoing monitoring of “VIP” URLs: You can create a text file of a list of your most valuable URLs (e.g. your high-value product pages) for ongoing monitoring to make sure nothing happens to them. You can even use Botify to configure a report that alerts you if the status codes of these pages return errors!
- Auditing staging URLs before launch: You can add your staging URLs to a text file to crawl them for quality before releasing them into the wild. This can help you prevent launching pages with errors.
It’s great to be able to audit your sitemap as a whole, but there are plenty of really specific use cases where it’s valuable to be able to crawl a custom list of URLs too.
For example, if you only had the ability to crawl from your home page or from your official XML sitemap file, you wouldn’t be able to crawl pages “outside your website” (e.g. hidden landing pages). Using a text file to crawl specific URLs means every day you can have a different crawl, enabling you to zoom in and drill down on specific sections of your site if needed.
Sitemap Audit Option 3: Crawling from specific URLs
Many site crawls start from the home page, but you can also start your crawls from specific URLs on your site.
For example, instead of starting a crawl from mywebsite.com, you could start it from mywebsite.com/blog or mywebsite.com/products.
This option allows you to start the crawl from a specific page or subfolder, but it will continue to crawl the entire site from the links discovered on those pages. This can reveal how well-connected different sections of your website are to the rest of your website.
Configuring your crawl to match your goals
The crawl option you choose will always depend on what you want to achieve — there is no “best way” to crawl your site. You may be performing a comprehensive sitemap audit, in which case “crawl from sitemap” would be a great option. Or, you may only want to analyze a specific group of URLs from your sitemap, in which case using the “text file” option would be helpful.
Whatever option you choose though, Botify can help you do it fast. Our cloud-based crawler can audit up to 250 URLs per second.