If you publish a page on your website, will Google index and rank it?
Not necessarily!
In fact, our data indicates that Google misses about half the pages on large websites.
In order for a page to show up in search results and drive traffic to your site, Google has to crawl it first. In Google's own words, "Crawling is the entry point for sites into Google's search results."
However, because Google doesn't have unlimited time and resources to crawl every page on the web all the time, not all pages will be crawled.
This is what SEOs refer to as crawl budget, and optimizing it can be key to your enterprise website's growth.
Read on to learn what crawl budget is, or jump to another section:
- What is crawl budget?
- How do I check my crawl budget?
- What is crawl budget optimization?
- How do I optimize my crawl budget?
- How one site increased crawl by 19x to double organic traffic
What is crawl budget?
Crawl budget is the maximum number of pages a search engine can and wants to crawl on any given website. Google determines crawl budget by weighing crawl rate limit and crawl demand.
- Crawl rate limit: The speed of your pages, crawl errors, and the crawl limit set in Google Search Console (website owners have the option of reducing Googlebot's crawl of their site) can all impact your crawl rate limit.
- Crawl demand: The popularity of your pages as well as how fresh or stale they are can impact your crawl demand.
The history of crawl budget
As far back as 2009, Google acknowledged it could only find a percentage of the content online and encouraged webmasters to optimize for crawl budget.
"The Internet is a big place; new content is being created all the time. Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that's available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we've crawled, we're only able to index a portion."
SEOs and webmasters began to talk more and more about crawl budget, which prompted Google in 2017 to publish the post "What crawl budget means for Googlebot." This post clarified how Google thinks about crawl budget, and how they calculate it.
Do I need to worry about crawl budget?
If you work on smaller websites, crawl budget may not be something you have to worry about.
According to Google, "Crawl budget is not something most publishers have to worry about. If a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently."
However, if you work on large websites, especially those that auto-generate pages based on URL parameters, you may want to prioritize activities that help Google understand what to crawl and when.
How do I check my crawl budget?
Whether you work on a site with one thousand or one million URLs, instead of taking Google's word for it, you'll likely want to check for yourself to see whether you have a crawl budget issue.
The best way to check your crawl budget and uncover whether Google is missing some of your pages is to compare the total number of pages in your site architecture with the number of pages crawled by Googlebot.
This requires a site crawler as well as a log file analyzer.
Use log analysis with URL segmentation
From your log files, you can see the number of URLs that Google is crawling on your site each month. This is your Google crawl budget.
Combine your log files with a full site crawl to understand how your crawl budget is being spent. Segment that data by pagetype to show which sections of your site are being crawled by search engines and with what frequency.
How are the most important sections of your site being crawled?
Use the Crawls Venn Diagram
One of the best ways to see, at a high level, the ratio of pages Googlebot is crawling vs. not crawling is the Crawls Venn Diagram.
The two circles in this Venn diagram represent pages in your site architecture (crawled by Botify only), pages outside your site architecture (crawled by Google only, AKA "orphan pages"), and pages crawled by both Google and Botify.
Pages crawled by Google only represent possible room for improvement when it comes to your crawl budget. If those pages aren't linked anywhere on your site, but Google is still finding and crawling them, you may be wasting some of your crawl budget.
Crawl ratio varies dramatically by site. Across industries, for unoptimized sites, an average of only 40% of strategic URLs are crawled by Google each month. That's 60% of pages on a site that aren't being regularly crawled and potentially aren't indexed or being served to searchers.
This offers a strong business case for measuring and optimizing your crawl budget.
What is crawl budget optimization?
Crawl budget optimization is the process of helping Googlebot and other search engines crawl and index more of your important content.
There are three main ways you can do this:
- Keeping Google and other search engines away from the pages you don't want indexed
- Helping them find your important content faster
- Improving the popularity and freshness of your important pages
Let's take a look at what exactly that could look like in practice.
How do I optimize my crawl budget?
Optimizing your crawl budget can be as much about increasing your crawl budget (i.e. getting Google to spend more time on their site) as it is about getting Google to spend the time they've already allocated to your site more wisely.
That can include:
- Preventing Google from crawling your non-canonical URLs
- Improving page load times by optimizing your JavaScript
- Minimizing crawl errors & non-200 status codes
- Checking your crawl rate limit in Google Search Console
- Increasing the popularity of your pages
- Refreshing stale content
1. Preventing Google from crawling your non-canonical URLs
If you're not familiar, canonical tags tell Google which version of a page is the preferred, primary version.
For example, say you have a product category page for "women's jeans" located at /clothing/women/jeans, and that page allows visitors to sort by price: low to high (i.e. faceted navigation).
This might change the URL to /clothing/women/jeans?sortBy=PriceLow. Putting the jeans in a different order didn't change the content of the page, so you wouldn't want /clothing/women/jeans?sortBy=PriceLow and /clothing/women/jeans both to be indexed.
You'd likely add a canonical tag on /clothing/women/jeans?sortBy=PriceLow, indicating that /clothing/women/jeans is the primary version of that page and the other version is a duplicate. The same thing is true for URL parameters appended as session identifiers.
You can easily identify when Google is spending time crawling non-canonical pages with Botify's non-indexable indicator. For example, one of our eCommerce clients has an extreme case of crawlable non-canonical URLs. In fact, non-canonical URLs represented 97% of the one million pages crawled by Botify.
Even though the indexable URLs numbered only about 25,000, Google only managed to crawl a little more than half in the course of a month. Google's crawl budget allowed for more than the total number of indexable URLs, but the remainder of the budget was spent on non-indexable URLs.
This is unfortunate since the site could potentially have achieved a near 100% crawl ratio, making it more likely that more pages would drive traffic. Another possible result of omitting this mass of non-canonical URLs from being crawled is that more pages could be crawled more frequently, and we find that more frequently crawled pages tend to produce more visits.
Google called this problem out as a waste of crawl budget years ago, yet it still exists as a major problem for SEO.
💡 The solution? Use your robots.txt file to tell search engines what not to crawl
Wasting server resources on these types of pages will drain crawl activity from pages that do actually have value, which may prevent or delay Google from discovering your great content.
By using your site's robots.txt file, you can tell search engine bots what to crawl and what to ignore. If you're unfamiliar, robots.txt files live at the root of websites and look like this:
Visit Google's documentation for more information on creating robots.txt files.
So how do these files help preserve your crawl budget?
To use the same example of a large eCommerce site with a faceted navigation that lets you sort the content without changing it (e.g. sorting by price, lowest to highest), you could use your robots.txt to disallow search engines from crawling those sort pages because they're duplicates of the original page. You don't want search engines wasting time on them since you don't want them in the index anyway.
This reminds us of a story Technical SEO Manager at REI Ryan Ricketts shared at our Crawl2Convert conference. His team cut their website down from 34 million URLs to 300,000 and saw drastic crawl budget improvements. Or when Hubspot's Aja Frost cut down the number of pages Google had access to and increased traffic.
Your robots.txt file can be an important step to take in directing search engines away from your unimportant content and toward your critical content. If you're a Botify customer, know that our crawler will follow the rules defined for Google in your website's robots.txt file. However, you can also set up a virtual robots.txt file to override those rules.
It's important to note that disallowing search engines from certain sections or pages on your site does not guarantee that search engines won't index those pages. If there are links to those pages elsewhere, such as in your content or sitemap, search engines may still find and index them. See step #3 for more on that.
2. Improving page load times by optimizing your JavaScript
If your website makes heavy use of JavaScript, you may be wasting your crawl budget on JavaScript files and API calls.
Consider this example.
A customer with a large enterprise website switched from client-side rendering to server-side rendering (SSR). Almost immediately, we could see from log file analysis that Google was spending more time on the website's critical content. Because Google was receiving the fully-loaded page from the server, there was no need for it to spend time on JavaScript files and API calls.
While JavaScript isn't the only thing that can lead to slow page load times, it often adds seconds of load time to a page. Because "How fast/slow are the pages loading?" is a criteria Google uses in crawl budget, your use of JavaScript may very well be a big contributing factor to Google missing your important content.
💡 The solution? Take the burden of rendering JavaScript off search engines
Switching to SSR or a dynamic rendering solution like SpeedWorkers can free up search engine bots to spend more time on your important pages because they no longer have to spend time rendering JavaScript when they visit your pages.
Page speed is a user experience and ranking factor, but remember, it's also a crawl budget factor. If you work on a large site that uses JavaScript, particularly if the content frequently changes, then you may want to consider prerendering your content for search engine bots.
3. Minimizing crawl errors & non-200 status codes
One of the criteria that helps Google determine how much time to spend on your site is "Is the crawler running into errors?"
If Googlebot runs into a lot of errors while crawling your site, such as 500 server errors, that could lower your crawl rate limit, and consequently, your crawl budget. If you're noticing a high volume of 5xx errors, you may want to look into improving your server capabilities.
But non-200 status codes can also simply constitute waste. Why spend Google's time crawling pages you've deleted and/or redirected when you could direct their time toward only your live, current URLs?
Consider this example.
A small publisher (< 100k pages) that already had a somewhat high share of non-200 response codes in its crawl (40% on average). But one day, a bug was introduced causing previously unseen malformed URLs to be crawled. Now instead of crawling the publisher's real and valuable pages, Google spent most of the time on the error URLs - eventually consuming 90% of the crawl.
💡 The solution? Clean up your internal linking and make sure your XML sitemap is up-to-date
In addition to blocking search engine bots from crawling bad URLs, it's also a good idea to avoid linking to pages with non-200 status codes.
To avoid wasting your crawl budget, make sure you're linking to the live, preferred version of your URLs throughout your content. As a general rule, you should avoid linking to URLs if they're not the final destination for your content.
For example, you should avoid linking to:
- Redirected URLs
- The non-canonical version of a page
- URLs returning a 404 status code
Don't waste your crawl budget by sending search engine bots through multiple middlemen (a.k.a. chains and loops) to find your content. Instead, link to the ultimate destination.
Also, avoid common XML sitemap mistakes such as:
- Listing non-indexable pages like non-200s, non-canonicals, non-HTML, and no-indexed URLs.
- Forgetting to update your sitemap after URLs change during a site migration
- Omitting important pages, and more.
Including only live, preferred URLs and making sure you're not leaving out key pages that you want search engines to crawl and index is critical. Have old product pages? Make sure to expire them and remove them from your sitemap.
You can use Botify to audit your sitemap for errors to reduce your crawl waste.
4. Checking your crawl rate limit in Google Search Console
Google gives you the option to change Googlebot's crawl rate on your site. This tool can affect your crawl rate limit, which is part of how Google determines your site's crawl budget, so it's an important one to understand.
While you don't have to use this function, it is available to anyone who wants to modify what Google's algorithms have determined is the appropriate crawl rate for your site.
If the crawl rate is too high, Googlebot's crawl may put too much strain on your server, which is why Google gives webmasters the option of a limited crawl rate. However, this could result in Google finding less of your important content, so use it with caution.
💡 The solution? Adjust your crawl rate in GSC
To adjust your crawl rate, go to the crawl rate settings page for the property you want to adjust. You'll see two options: "Let Google optimize" and "Limit Google's maximum crawl rate."
If you want to increase your crawl rate, it's a good idea to check and see if "Limit Google's maximum crawl rate" has been selected accidentally.
5. Increasing the popularity of your pages
URLs that are more popular on the internet tend to be crawled more often by Google.
One way that Google might judge the popularity or at least the relative importance of a page is by viewing its depth. Page depth (or "click depth") is the number of clicks it takes to get to a page from the home page.
Another signal of popularity on your site is internal linking. If a page is linked to several times, it implies that the page is popular.
💡 The solution? Decrease depth and increase internal links to important pages
To help Google better understand how important and popular your pages are, it's a good idea to make your important pages closer to the home page, as well as link to them more often.
While you can't link to every page from your home page, be strategic about your internal linking and site architecture. If a page is buried on your site and/or not linked to very often, there's a good chance Google will view it as less popular and crawl it less often.
6. Refreshing stale content
Google may no longer be crawling a page because it's stale and has not changed the past few times it's crawled your site. Google also wants to prevent pages from becoming stale in their index.
One way to identify whether you have stale content on your site is to isolate posts that were published before a certain date. For example, if you have a pretty aggressive publishing cadence (i.e. multiple posts every day), you may want to filter and view posts older than three months old. For sites that publish less often, you may choose to view posts older than three years old. It just depends on your cadence.
Additionally, you could pair this filter with the "active/not active" filter. This would allow you to see all older posts that aren't getting organic search traffic.
These may be good candidates to improve.
💡 The solution? Refresh your stale content
There are lots of ways you can refresh your stale content, such as:
- Correct any out-of-date information
- Scan for spelling and grammatical errors
- Update any internal links that point to old resources, and add new ones where relevant
- Identify which queries the page used to rank for, see what the SERP landscape for those terms looks like now, and update accordingly
There are lots more ideas you can use in the article How To Identify & Refresh Stale Evergreen Content: The Underdogs that Drive Long-Term Traffic.
How one site increased crawl by 19x to double their organic search traffic
Now that you're familiar with crawl budget and how to optimize it, you may be eager to get going on some projects on your own. However, your boss may be asking you to make the business case for a project like this before they let you invest the time and resources.
We've got you covered.
A large online auto marketplace website came to Botify with a huge problem -- 99% of the pages on their site were invisible to Google. Talk about the need for crawl budget optimization.
Here are the exact steps they took to increase their crawl by 19x:
- Crawl all the pages in the site structure
- Import log files to understand which of those pages Google is/isn't crawling
- Identify all non-indexable pages in the site structure
- Update the robots.txt file to prevent crawl waste found after performing step 2
- Improved internal linking, including decreasing page depth and overhauling their breadcrumb structure
- Updated the sitemap to include only indexable URLs
💡 You can download the full case study, which provides a bit more detail on how they performed each step, here: The Invisible Site: How an Online Auto Marketplace Increased Their Google Crawl by 19x.
Improving crawl can improve your revenue
Applying these optimizations on a site with millions of pages can open up a wealth of opportunity -- not only for your crawl budget but your site's traffic and revenue, too!
That's because of the SEO funnel principle, which shows that improvements at the crawl phase have downstream benefits for the ranking, traffic, and revenue phases as well, which your stakeholders will definitely be happy about.
Crawl budget isn't just a technical thing. It's a revenue thing. So bring the bots - and visitors - only to the good stuff! Botify was built for solving these kinds of problems, so if you'd like to learn more or see it in action, get in touch! We'd love to show you around.