Crawl Budget Is A Finite Resource: Spend It Wisely

Posted on

Crawl Budget is a Finite Resource: Spend It Wisely

9th March 2017JeffJeff

At this point in our ongoing Crawl Budget series (Botify’s response to Google’s confirmation that crawl budget is real and that it can and should be optimized), you’ve learned how to identify your Crawl Ratio and the value of knowing Google’s Crawl Frequency.

Now, we’ll look at how to know whether you’re spending your Crawl Budget wisely and some steps to prevent waste.

Ways to Waste Crawl Budget

In Gary Illyes’ Google Webmaster Central article about crawl budget, he described several factors that can negatively affect crawl budget. Here’s an excerpt of that list:

  • Faceted navigation and session identifiers
  • On-site duplicate content
  • Soft error pages
  • Hacked pages
  • Infinite spaces and proxies
  • Low quality and spam content

We would add to that list cases where tools that can be used for constructive SEO outcomes are at such scale as to have a corrosive effect by causing Google to spend too much resource on URLs that won’t drive their own traffic. Two of those examples are:

  • High percentages of redirects or 404 errors as a share of crawl and/or in site structure
  • High percentage of non-canonical URLs in site structure

How To Identify Potential Crawl Budget Waste

Let’s start with the easiest to identify: what is the share of non-200 (or 304) response codes in your crawl activity. In Google Search Console you can get a list of errors and a trend line, but not a daily percent of total crawl. So to get to the facts, you need log file analysis for SEO.

Waste Example: Share of Bad HTTP Codes

In the example below we see a small publisher (< 100k pages) that already had a somewhat high share of non-200 response codes in its crawl (40% on average). But one day (Jan. 31, the blank spot in the chart) a bug was introduced causing previously unseen malformed URLs to be crawled. Those bad URLs eventually consume 90% of the crawl.

crawl budget waste status codes 20170306

Botify Log Analyzer trended share of HTTP status codes in Google crawl

The malformed URL pattern was classified (error, in the chart below) using URL segmentation to make it easier to understand its effect on the rest of the site. The chart below shows that all of this change was attributable to the malformed URL problem and the effect was to marginalize crawl of the publisher’s core, valuable page types.

crawl budget waste by segment 20170306 use this

Botify Log Analyzer trended share of crawl by pagetype

Since there was a clear pattern in the URLs, this publisher can update its robots.txt file to stop spending crawl budget on URLs that will not help it drive traffic.

Waste Example: Share of Non-Canonical URLs

With classification of important SEO attributes and integration with log files, it becomes relatively easy to identify crawl of non-compliant URLs. The e-commerce site below has an extreme case of crawlable non-canonical URLs, in fact it was 97% of the one million pages crawled by Botify.

crawl budget waste Compliant - Not Compliant URLs distribution

Distribution of compliant URLs in e-commerce site

Even though the compliant URLs numbered only about 25,000, Google only managed to crawl little more than half in the course of a month. As we can see below, Google’s crawl budget allowed for more than the total number of compliant URLs, but the remainder of the budget was spent on non-compliant URLs.

crawl budget waste Cumulative Crawl Over Time by google

Cumulative crawl over one month with distribution by compliant URLs

This is unfortunate since the site could potentially have achieved a near 100% crawl ratio, making it more likely that more pages would drive traffic. Another possible result of omitting this mass of non-canonical URLs from being crawled is that more pages could be crawled more frequently. As we saw in the previous crawl budget blog post, we find that more frequently crawled pages tend to produce more visits.

Waste Example: Faceted Navigation and On-Site Duplicate Content

We can use the sharp increase in crawl of non-compliant URLs in the chart above as a launching point to investigate the change.

In this case, the change appears to have allowed crawlers into the site’s faceted navigation. Google called this problem out as a waste of crawl resource years ago, yet it still exists as a problem to managed.

We can see that just for the lipstick category, there are nearly 200,000 URLs that are canonical to another page. (The domain and major URL patterns have been obfuscated for customer privacy.)

crawl budget waste url explorer use this

URL explorer filtered to page titles starting with “Lipstick”

An increase in crawl volume exclusively on non-compliant URLs seems counterintuitive. The reason is likely that the site released code that introduced these as new URLs or removed restrictions from crawling them (removal of nofollow in links or meta tags or removal of a robots.txt disallow). Google likes to discover new URLs in case they will add value to its index.

When Best Practices Reach Their Limits

This is an example where the site owner followed search engine recommendations – to use canonical links for faceted navigation or near-duplicate content. Unfortunately in this case the scale of URLs means in order to try to canonicalize everything correctly Google would need to crawl it all, taking attention away from the core, strategic pages that can drive traffic.

Since these pages are new and not very differentiated, it seems like a better option would have been to keep them from being crawled at all. That would have allowed crawl budget to be spent on valuable URLs while still letting users have the search refinement functionality.

Spending Crawl Budget Wisely

Like any budget, Crawl Budget is a strategic resource to be managed to achieve your business objectives. Actively monitoring it can prevent you from squandering it on pages that aren’t helping you drive more traffic and revenue. You can decide what is allowed to be crawled and organize your website to influence the Crawl Ratio and Crawl Frequency to produce the most traffic possible with the content you have.

Please share your experience with Crawl Budget waste, feedback, or questions about this post in the comments below! Look for future installments of this series to cover in more detail other factors that negatively affect Crawl Budget.

Blog comments powered by Disqus.



Related posts

Get more articles like this in your inbox monthly!