At this point in our ongoing Crawl Budget series (Botify’s response to Google’s confirmation that crawl budget is real and that it can and should be optimized), you’ve learned how to identify your Crawl Ratio and the value of knowing Google’s Crawl Frequency.
Now, we’ll look at how to know whether you’re spending your Crawl Budget wisely and some steps to prevent waste.
Ways to Waste Crawl Budget
In Gary Illyes’ Google Webmaster Central article about crawl budget, he described several factors that can negatively affect crawl budget. Here’s an excerpt of that list:
- Faceted navigation and session identifiers
- On-site duplicate content
- Soft error pages
- Hacked pages
- Infinite spaces and proxies
- Low quality and spam content
We would add to that list cases where tools that can be used for constructive SEO outcomes are at such scale as to have a corrosive effect by causing Google to spend too much resource on URLs that won’t drive their own traffic. Two of those examples are:
- High percentages of redirects or 404 errors as a share of crawl and/or in site structure
- High percentage of non-canonical URLs in site structure
How To Identify Potential Crawl Budget Waste
Let’s start with the easiest to identify: what is the share of non-200 (or 304) response codes in your crawl activity. In Google Search Console you can get a list of errors and a trend line, but not a daily percent of total crawl. So to get to the facts, you need log file analysis for SEO.
Waste Example: Share of Bad HTTP Codes
In the example below we see a small publisher (< 100k pages) that already had a somewhat high share of non-200 response codes in its crawl (40% on average). But one day (Jan. 31, the blank spot in the chart) a bug was introduced causing previously unseen malformed URLs to be crawled. Those bad URLs eventually consume 90% of the crawl.
The malformed URL pattern was classified (error, in the chart below) using URL segmentation to make it easier to understand its effect on the rest of the site. The chart below shows that all of this change was attributable to the malformed URL problem and the effect was to marginalize crawl of the publisher’s core, valuable page types.
Since there was a clear pattern in the URLs, this publisher can update its robots.txt file to stop spending crawl budget on URLs that will not help it drive traffic.
Waste Example: Share of Non-Canonical URLs
With classification of important SEO attributes and integration with log files, it becomes relatively easy to identify crawl of non-compliant URLs. The e-commerce site below has an extreme case of crawlable non-canonical URLs, in fact it was 97% of the one million pages crawled by Botify.
Even though the compliant URLs numbered only about 25,000, Google only managed to crawl little more than half in the course of a month. As we can see below, Google’s crawl budget allowed for more than the total number of compliant URLs, but the remainder of the budget was spent on non-compliant URLs.
This is unfortunate since the site could potentially have achieved a near 100% crawl ratio, making it more likely that more pages would drive traffic. Another possible result of omitting this mass of non-canonical URLs from being crawled is that more pages could be crawled more frequently. As we saw in the previous crawl budget blog post, we find that more frequently crawled pages tend to produce more visits.
Waste Example: Faceted Navigation and On-Site Duplicate Content
We can use the sharp increase in crawl of non-compliant URLs in the chart above as a launching point to investigate the change.
In this case, the change appears to have allowed crawlers into the site’s faceted navigation. Google called this problem out as a waste of crawl resource years ago, yet it still exists as a problem to managed.
We can see that just for the lipstick category, there are nearly 200,000 URLs that are canonical to another page. (The domain and major URL patterns have been obfuscated for customer privacy.)
An increase in crawl volume exclusively on non-compliant URLs seems counterintuitive. The reason is likely that the site released code that introduced these as new URLs or removed restrictions from crawling them (removal of nofollow in links or meta tags or removal of a robots.txt disallow). Google likes to discover new URLs in case they will add value to its index.
When Best Practices Reach Their Limits
This is an example where the site owner followed search engine recommendations - to use canonical links for faceted navigation or near-duplicate content. Unfortunately in this case the scale of URLs means in order to try to canonicalize everything correctly Google would need to crawl it all, taking attention away from the core, strategic pages that can drive traffic.
Since these pages are new and not very differentiated, it seems like a better option would have been to keep them from being crawled at all. That would have allowed crawl budget to be spent on valuable URLs while still letting users have the search refinement functionality.
Spending Crawl Budget Wisely
Like any budget, Crawl Budget is a strategic resource to be managed to achieve your business objectives. Actively monitoring it can prevent you from squandering it on pages that aren’t helping you drive more traffic and revenue. You can decide what is allowed to be crawled and organize your website to influence the Crawl Ratio and Crawl Frequency to produce the most traffic possible with the content you have.
Please share your experience with Crawl Budget waste, feedback, or questions about this post in the comments below! Look for future installments of this series to cover in more detail other factors that negatively affect Crawl Budget.