If you publish a page on your website, will Google index and rank it?
In fact, our data indicates that Google misses about half the pages on large websites.
In order for a page to show up in search results and drive traffic to your site, Google has to crawl it first. In Google’s own words, “Crawling is the entry point for sites into Google’s search results.”
However, because Google doesn’t have unlimited time and resources to crawl every page on the web all the time, not all pages will be crawled.
This is what SEOs refer to as crawl budget, and optimizing it can be key to your enterprise website’s growth.
Read on to learn what crawl budget is, or jump to another section:
Crawl budget is the maximum number of pages a search engine can and wants to crawl on any given website. Google determines crawl budget by weighing crawl rate limit and crawl demand.
As far back as 2009, Google acknowledged it could only find a percentage of the content online and encouraged webmasters to optimize for crawl budget.
“The Internet is a big place; new content is being created all the time. Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that’s available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we’ve crawled, we’re only able to index a portion.”
SEOs and webmasters began to talk more and more about crawl budget, which prompted Google in 2017 to publish the post “What crawl budget means for Googlebot.” This post clarified how Google thinks about crawl budget, and how they calculate it.
If you work on smaller websites, crawl budget may not be something you have to worry about.
According to Google, “Crawl budget is not something most publishers have to worry about. If a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.”
However, if you work on large websites, especially those that auto-generate pages based on URL parameters, you may want to prioritize activities that help Google understand what to crawl and when.
Whether you work on a site with one thousand or one million URLs, instead of taking Google’s word for it, you’ll likely want to check for yourself to see whether you have a crawl budget issue.
The best way to check your crawl budget and uncover whether Google is missing some of your pages is to compare the total number of pages in your site architecture with the number of pages crawled by Googlebot.
From your log files, you can see the number of URLs that Google is crawling on your site each month. This is your Google crawl budget.
Combine your log files with a full site crawl to understand how your crawl budget is being spent. Segment that data by pagetype to show which sections of your site are being crawled by search engines and with what frequency.
How are the most important sections of your site being crawled?
One of the best ways to see, at a high level, the ratio of pages Googlebot is crawling vs. not crawling is the Crawls Venn Diagram.
The two circles in this Venn diagram represent pages in your site architecture (crawled by Botify only), pages outside your site architecture (crawled by Google only, AKA “orphan pages”), and pages crawled by both Google and Botify.
Pages crawled by Google only represent possible room for improvement when it comes to your crawl budget. If those pages aren’t linked to anywhere on your site, but Google is still finding and crawling them, you may be wasting some of your crawl budget.
Crawl ratio varies dramatically by site. Across industries, for unoptimized sites, an average of only 40% of strategic URLs are crawled by Google each month. That’s 60% of pages on a site that aren’t being regularly crawled and potentially aren’t indexed or being served to searchers.
This offers a strong business case for measuring and optimizing your crawl budget.
Crawl budget optimization is the process of helping Googlebot and other search engines crawl and index more of your important content.
There are three main ways you can do this:
Let’s take a look at what exactly that could look like in practice.
Optimizing your crawl budget can be as much about increasing your crawl budget (i.e. getting Google to spend more time on their site) as it is about getting Google to spend the time they’ve already allocated to your site more wisely.
That can include:
If you’re not familiar, canonical tags tell Google which version of a page is the preferred, primary version.
For example, say you have a product category page for “women’s jeans” located at /clothing/women/jeans, and that page allows visitors to sort by price: low to high (i.e. faceted navigation).
This might change the URL to /clothing/women/jeans?sortBy=PriceLow. Putting the jeans in a different order didn’t change the content of the page, so you wouldn’t want /clothing/women/jeans?sortBy=PriceLow and /clothing/women/jeans both to be indexed.
You’d likely add a canonical tag on /clothing/women/jeans?sortBy=PriceLow, indicating that /clothing/women/jeans is the primary version of that page and the other version is a duplicate. The same thing is true for URL parameters appended as session identifiers.
You can easily identify when Google is spending time crawling non-canonical pages with Botify’s non-compliant indicator. Take this e-commerce site, for example, which has an extreme case of crawlable non-canonical URLs. In fact, non-canonical URLs represented 97% of the one million pages crawled by Botify.
Even though the compliant URLs numbered only about 25,000, Google only managed to crawl little more than half in the course of a month. As we can see below, Google’s crawl budget allowed for more than the total number of compliant URLs, but the remainder of the budget was spent on non-compliant URLs.
This is unfortunate since the site could potentially have achieved a near 100% crawl ratio, making it more likely that more pages would drive traffic. Another possible result of omitting this mass of non-canonical URLs from being crawled is that more pages could be crawled more frequently, and we find that more frequently crawled pages tend to produce more visits.
Google called this problem out as a waste of crawl budget years ago, yet it still exists as a major problem for SEO.
Wasting server resources on these types of pages will drain crawl activity from pages that do actually have value, which may prevent or delay Google from discovering your great content.
By using your site’s robots.txt file, you can tell search engine bots what to crawl and what to ignore. If you’re unfamiliar, robots.txt files live at the root of websites and look like this:
Visit Google’s documentation for more information on creating robots.txt files.
So how do these files help preserve your crawl budget?
To use the same example of a large e-commerce site with a faceted navigation that lets you sort the content without changing it (e.g. sorting by price, lowest to highest), you could use your robots.txt to disallow search engines from crawling those sort pages because they’re duplicates of the original page. You don’t want search engines wasting time on them since you don’t want them in the index anyway.
This reminds us of a story Technical SEO Manager at REI Ryan Ricketts shared at our Crawl2Convert conference. His team cut their website down from 34 million URLs to 300,000 and saw drastic crawl budget improvements. Or when Hubspot’s Aja Frost cut down the number of pages Google had access to and increased traffic.
Your robots.txt file can be an important step to take in directing search engines away from your unimportant content and towards your critical content. If you’re a Botify customer, know that our crawler will follow the rules defined for Google in your website’s robots.txt file. However, you can also set up a virtual robots.txt file to override those rules.
It’s important to note that disallowing search engines from certain sections or pages on your site does not guarantee that search engines won’t index those pages. If there are links to those pages elsewhere, such as in your content or sitemap, search engines may still find and index them. See step #3 for more on that.
Consider this example.
Remember Google’s crawl budget formula? One of the criteria that helps Google determine how much time to spend on your site is “Is the crawler running into errors?”
If Googlebot runs into a lot of errors while crawling your site, such as 500 server errors, that could lower your crawl rate limit, and consequently, your crawl budget. If you’re noticing a high volume of 5xx errors, you may want to look into improving your server capabilities.
But non-200 status codes can also simply constitute waste. Why spend Google’s time crawling pages you’ve deleted and/or redirected when you could direct their time toward only your live, current URLs?
In the example below, we see a small publisher (< 100k pages) that already had a somewhat high share of non-200 response codes in its crawl (40% on average). But one day (Jan. 31, the blank spot in the chart), a bug was introduced causing previously unseen malformed URLs to be crawled. Those bad URLs eventually consumed 90% of the crawl.
The malformed URL pattern was identified and labeled as “error” in yellow below using URL segmentation. This made it easier to understand the issue’s effect on the rest of the site, which was that Google was spending all its time on the error URLs and missing the publisher’s real, valuable pages.
In addition to blocking search engine bots from crawling bad URLs, it’s also a good idea to avoid linking to pages with non-200 status codes.
To avoid wasting your crawl budget, make sure you’re linking to the live, preferred version of your URLs throughout your content. As a general rule, you should avoid linking to URLs if they’re not the final destination for your content.
For example, you should avoid linking to:
Don’t waste your crawl budget by sending search engine bots through multiple middlemen (a.k.a. chains and loops) to find your content. Instead, link to the ultimate destination.
Also, avoid common XML sitemap mistakes such as:
Including only live, preferred URLs and making sure you’re not leaving out key pages that you want search engines to crawl and index is critical. Have old product pages? Make sure to expire them and remove them from your sitemap.
You can use Botify to audit your sitemap for errors to reduce your crawl waste.
Google gives you the option to change Googlebot’s crawl rate on your site. This tool can affect your crawl rate limit, which is part of how Google determines your site’s crawl budget, so it’s an important one to understand.
While you don’t have to use this function, it is available to anyone who wants to modify what Google’s algorithms have determined is the appropriate crawl rate for your site.
If the crawl rate is too high, Googlebot’s crawl may put too much strain on your server, which is why Google gives webmasters the option of limited crawl rate. However, this could result in Google finding less of your important content, so use with caution.
To adjust your crawl rate, go to the crawl rate settings page for the property you want to adjust. You’ll see two options: “Let Google optimize” and “Limit Google’s maximum crawl rate.”
If you want to increase your crawl rate, it’s a good idea to check and see if “Limit Google’s maximum crawl rate” has been selected accidentally.
URLs that are more popular on the internet tend to be crawled more often by Google.
One way that Google might judge popularity or at least the relative importance of a page is by viewing it’s depth. Page depth (or “click depth”) is the number of clicks it takes to get to a page from the home page.
In Botify, you can use the page depth reports to get a better idea of how page depth impacts how Google crawls your site. In this example here, you can see that the number of URLs being crawled by Google starts to drop significantly starting approximately three levels deep.
Another signal of popularity on your site is internal linking. If a page is linked to several times, it implies that page is popular. Botify offers several charts that offer insight about internal linking.
In the example below, you can see that pages that Google crawls have much more internal links pointing to them than pages that Google doesn’t crawl.
To help Google better understand how important and popular your pages are, it’s a good idea to make your important pages closer to the home page, as well as link to them more often.
While you can’t link to every page from your home page, be strategic about your internal linking and site architecture. If a page is buried on your site and/or not linked to very often, there’s a good chance Google will view it as less popular and crawl it less often.
Google may no longer be crawling a page because it’s stale and has not changed the past few times it’s crawled your site. Google also wants to prevent pages from becoming stale in their index.
One way to identify whether you have stale content on your site is to isolate posts that were published before a certain date. For example, if you have a pretty aggressive publishing cadence (i.e. multiple posts every day), you may want to filter and view posts older than three months old. For sites that publish less often, you may choose to view posts older than three years old. It just depends on your cadence.
Additionally, you could pair this filter with the “active/not active” filter. This would allow you to see all older posts that aren’t getting organic search traffic.
These may be good candidates to improve.
There are lots of ways you can refresh your stale content, such as:
There are lots more ideas you can use in the article How To Identify & Refresh Stale Evergreen Content: The Underdogs that Drive Long-Term Traffic.
Now that you’re familiar with crawl budget and how to optimize it, you may be eager to get going on some projects on your own. However, your boss may be asking you to make the business case for a project like this before they let you invest the time and resources.
We’ve got you covered.
A large online auto marketplace website came to Botify with a huge problem — 99% of the pages on their site were invisible to Google. Talk about the need for crawl budget optimization.
Here are the exact steps they took to increase their crawl by 19x:
💡 You can download the full case study, which provides a bit more detail on how they performed each step, here: The Invisible Site: How an Online Auto Marketplace Increased Their Google Crawl by 19x.
Applying these optimizations on a site with millions of pages can open up a wealth of opportunity — not only for your crawl budget, but your site’s traffic and revenue, too!
That’s because of the SEO funnel principle, which shows that improvements at the crawl phase have downstream benefits for the ranking, traffic, and revenue phases as well, which your stakeholders will definitely be happy about.
Crawl budget isn’t just a technical thing. It’s a revenue thing. So bring the bots – and visitors – only to the good stuff! Botify was built for solving these kinds of problems, so if you’d like to learn more or see it in action, get in touch! We’d love to show you around.