In our article “The 5 Biggest XML Sitemap Mistakes to Avoid,” we talked about the top mistakes we see when it comes to XML sitemap files, such as listing non-indexable pages and omitting important pages. Avoiding mistakes like these is important because having an accurate, up-to-date sitemap can ensure Google doesn’t miss any content you want indexed or waste your crawl budget on URLs you don’t care about.
So how do you audit the URLs in your sitemap? You could manually comb through your sitemap files, but that would be impractical — especially on a site with millions of URLs. An easier option would be to conduct a programmatic crawl with a tool like Botify to scan your URLs for errors.
Typically, site crawlers will start a crawl of your site from a single URL — your home page. From the home page, the crawler will follow the links on that page to other pages on your site, and then follow the links on those pages, etc.
This isn’t the only way to crawl your site though. You can also use Botify to customize your crawl to start from:
Crawling your site from sitemaps or custom text files is great for a variety of use cases. Let’s explore how each option works, and what you can use them to accomplish.
Let’s dive in!
You can opt to start your crawl from an external file, like a sitemap or sitemap index, so that you can easily find any errors in your XML sitemaps. One important thing to note is that in Botify, you can not only crawl all the URLs in your sitemap file, but you can crawl from your sitemap file.
How are these things different?
Starting a crawl from your sitemap means you not only crawl the URLs in your sitemap, but crawl the pages they link to as well.
What’s the benefit? Let’s say, for example, that a URL in your sitemap is noindexed. That’s great to know, because ideally you’d only have indexable pages in your sitemap file. However, what if that non-indexable page linked to a 404 page? You wouldn’t be able to find that 404d URL unless it was also linked to in your sitemap.
Starting a site crawl from your sitemap is a great way to ensure that not only are the URLs in your sitemap error-free, but that the pages they link or redirect to are error-free as well.
It’s easy to take for granted that URLs in your sitemap are only the URLs that we want to be crawled/indexed, but crawling from your sitemap files is a valuable safeguard against wasting Google’s time on URLs we don’t even want crawled.
What if you have a sitemap index file rather than a single XML sitemap file?
Plenty of sites use a sitemap index file. If you’re not familiar, a sitemap index file is a file that contains multiple sitemaps. Think of it as a container where you can store all the XML sitemap files on your website.
Not every site needs multiple sitemaps, but it can be necessary for many large websites since XML sitemap files cannot exceed 50,000 URLs or 50MB (uncompressed).
Sometimes, sitemap index files can pose problems for crawlers. Thankfully though, starting the crawl of your sitemap index is as easy in Botify as pasting in a link to it when you set up your crawl. We will follow your sitemap index and download any additional sitemap referenced in it.
Sitemaps are a great way to send Google information about what you want crawled, but they’re not a guarantee that Google will find and crawl all those pages. When it comes to your website, a good way to learn how much you’re relying on Google to “figure it out” on its own is to compare a crawl from your home page to a crawl from your sitemap.
In other words, is there a disparity between what you’re feeding Google in your sitemap and what Google can easily access from your site architecture?
You’ll not only want to make sure your important pages are in your sitemap correctly, but also make sure those those pages are accessible by links on your other pages. Not only is making pages accessible in your site architecture important for helping Google find your important content; it’s also important for helping your visitors navigate your site!
Sometimes, you don’t need to crawl your entire sitemap. There are plenty of instances where you may want to crawl only a specific set of URLs. For that, you can use Botify’s “crawl from text file” option. Simply add all the URLs you want to crawl into a text file and we’ll crawl just those URLs.
There are lots of possible use cases for this, including:
It’s great to be able to audit your sitemap as a whole, but there are plenty of really specific use cases where it’s valuable to be able to crawl a custom list of URLs too.
For example, if you only had the ability to crawl from your home page or from your official XML sitemap file, you wouldn’t be able to crawl pages “outside your website” (e.g. hidden landing pages). Using a text file to crawl specific URLs means every day you can have a different crawl, enabling you to zoom in and drill down on specific sections of your site if needed.
Many site crawls start from the home page, but you can also start your crawls from specific URLs on your site.
For example, instead of starting a crawl from mywebsite.com, you could start it from mywebsite.com/blog or mywebsite.com/products.
This option allows you to start the crawl from a specific page or subfolder, but it will continue to crawl the entire site from the links discovered on those pages. This can reveal how well-connected different sections of your website are to the rest of your website.
The crawl option you choose will always depend on what you want to achieve — there is no “best way” to crawl your site. You may be performing a comprehensive sitemap audit, in which case “crawl from sitemap” would be a great option. Or, you may only want to analyze a specific group of URLs from your sitemap, in which case using the “text file” option would be helpful.
Whatever option you choose though, Botify can help you do it fast. Our cloud-based crawler can audit up to 250 URLs per second.