The first step is awareness: do search engines know that your content exists? For some of it, no doubt. But all of it? Not so sure.
That’s why for SEO, crawl ratio is a key indicator. It shows how much of your content Google sees. What could be more important than that? It’s plain and simple: a search engine cannot show in search results a page it doesn’t know exists.
Crawl ratio, the basis of it all
Being thoroughly crawled by Google is often taken for granted. It shouldn’t. Crawl ratio deserves to be on your radar. It should be a priority for medium to large websites in particular.
We need to compare two things: what’s really on your website vs what Google actually sees.
The former means crawling your website with a crawler that will explore every single page that exists (and that is linked in a way that search engines can explore). That’s what the Botify crawler that comes with the Logs Analyzer does.
The latter means extracting from your web server log files the list of pages explored by Google’s bots - over a month or so, delay after which pages are very unlikely to rank, as Google needs fresh information.
This is one of the key takeaways provided by the Botify Logs Analyzer and its website analysis.
For instance, for this large website, the crawl ratio is 81%:
This other website has a very low crawl ratio of 11%:
Note that these graphs also show orphan pages, those seen by Google but not found on the website (in grey). These also need our attention, but that’s another topic - we’ll talk about it in another post. Right now, let’s focus on pages which ARE on your website.
The graph above provides the big picture. Then the question to ask is: should all pages found by Botify be crawled by Google? Are they all legitimate, high quality pages that are good target page for SEO? Chances are, some are not crawled while they shouldn’t be, and some are not crawled while they should be.
A view by type of pages helps answer that question. The Botify Logs Analyzer’s website crawl report also provides a detailed view of pages found by Botify on the website, with crawl ratios by type of page. In the graph below, each bar shows the crawl volume for a page category. Green represents pages that were crawled by Google over the analyzed 30-day period (that would be the overlap between the two discs in a graph like the one above - although here it’s not the same website). Red represents pages that weren’t.
This view showing page categories, which are usually defined to match page templates, tells us which pages are important from a user’s perspective: we know which templates correspond to important content. But perhaps, if we look closer within a category, a significant amount of pages are duplicates?
Only show the good stuff: reduce the scope to quality pages
In all likelihood, a website with a very large number of pages includes:
- Duplicates or near-duplicates. There can be technical causes such as tracking parameters in URLs (read about this in a post on excessive depth), or print versions of pages. There can also be business reasons such as a product page for each color or option that is available.
- Low-quality pages (containing very little information). These can be generated for instance by user actions links such as ‘share a page’, etc. if placed in a < a href > tag, or by a contact form with URL parameters that create a different URL from each page.
- Pages that are not target pages for SEO, for instance lists of products resulting from a combination of too many navigation filters, as explained as part of depth issues.
Some of these pages, such as duplicates due to tracking parameters, should not exist at all - not for search engine, not for users. Others, such as pages resulting from combinations of many filters, shouldn’t be seen by search engines but should remain available to users.
Bottom line, we need to make sure that low quality pages that have no chance of generating visits (and lower the overall quality score of the website, from the search engine’s perspective) are not crawled any more by Google.
Once we have done that, we would like to see 100% of remaining pages crawled by Google.
How can we make that happen?
To increase crawl rate: facilitate access to pages
Your website’s internal linking defines how pagerank flows inside your website, and for large sites, the notion of ‘link juice’ remains a key factor to be crawled by search engines.
- Reduce depth
As you know, pagerank decreases with depth, it’s mecanical. As a result, Google’s crawl rate also decreases with depth. Here is an example of Google’s crawl volume and crawl rate by depth (number of clics from the home page), and the same information as percentages. The overall volume by depth is that of pages found by Botify, green shows pages crawled by Google, red pages that were not crawled by Google.
What should you do?
Work on navigation and pagination. See our post on the top 5 depth issues and their solutions
- Avoid pagerank waste
If you haven’t look into it closely, chances are that pages that receive the largest amount of ‘juice’ are not those that deserve it most.
What should you do?
Work on internal linking. There is no miracle recipe, actual actions are very site-specific.
Crawl efficiency: allow Google to see more with the same crawl budget
In addition to encouraging Google to get key content first, it would be great to get the search engine to get more of it, wouldn’t it?
The crawl budget that Google allocates to your website, or, loosely, the time it is ready to spend crawling it, is based on criteria that are build over time. It can change, but not overnight. The safest and quickest way is to optimize Google’s crawl is to make sure it crawls your site in a more ‘efficient’ or ‘useful’ way, using the same budget.
What should you do?
Check your website’s performance, more specifically the html page download time, which is what matters for search engines’ bots. As we explained recently, Increased performance has the power to significantly boost crawl volume. Do this for key content pages (product pages, article pages, main navigation pages…).
Botify helps you prioritize by showing page performance and active page ratio (% of pages that generate organic visits) by type of page : you will be able to start working on pages that are slower AND are worth working on.
The graph below shows crawl volume and page performance by type of page .