Fact: Google doesn't know half of your website. What can you do?

In Google's worldview, a significant portion of your website doesn't exist. "Half of it" may be an overstatement for some websites, but it may also be an understatement for others, particularly large websites.

Why do we want draw your attention to this? Because getting priorities right is key to any type of optimization, all the more so for SEO. And if search engines don't know a significant part of your website, isn't that the first SEO problem we should solve?

Pages can't rank if search engines don't know they exist

Stating the obvious here, to point out that we shouldn't take for granted that Google crawls everything that exists on the Internet. I've heard puzzled clients say: “Google is so powerful, has such virtually unlimited resources, that shouldn't be an issue?”

Truth is, however powerful the search engine is, it still has to manage priorities. It may focus on new content discovery, vs. refreshing existing content. It will also typically explore highly popular pages hundreds or thousands times a day, while completely ignoring other pages, doing so knowingly or simply because it never came across a link to the page.

So where do your pages stand? Do they exist in Google's world view? This type of graph can be a real eye-opener:

Graph from the Botify report / Search Engines tab / Google / Top charts.
(The Search Engines section appears in your reports if you subscribed to Botify Log Analyzer).

This graph confronts two views:

  • Blue disc: what the website looks like when Botify explored it thoroughly, starting from the home page and following links down to the very last page (which also most closely reflects what a user browsing the website can see)
  • Red disc: what the website looks like from Google's perspective (as detected in 30 days of web server log files, where every single request to the web server is registered, whether from an actual user or from a search engine robot exploring the website).

This example shows a very typical situation - although the size of each disk and the overlapping surface may vary.

The bottom line is that, in the vast majority of cases, Google has a very skewed view of your website:

  1. The search engine only knows about a small portion of the pages that are on your website (the overlapping part in purple). As a result, the blue part is simply invisible to Google search users.
  2. The search engine also explores a number of pages which are not currently linked on your website (orphan pages, in the red, non-overlapping part). Such a waste of crawl resources is bad news (you probably don't want Google to see most of these), but the good news is that this crawl budget exists, and you can make efforts to encourage Google to spend it on pages that are actually on your website.

Why so many orphan pages?

Orphan pages, crawled by Google but not found on your website, may result from different things :

  • A normal phenomenon if your website includes pages that rapidly expire (such as classifieds), as the website crawler performs a quicker snapshot of the website than search engines, which need more time to explore the website and get a picture with a "longer exposure" (hence the 30 days of log files). The Botify analysis will allow to see this, as it also shows crawl rate by URL segments (templates): in the case of a classifieds website, you will be able to verify that orphan pages are only ads.
  • Older pages (previous versions, etc.) that are still explored by search engine for some time after they are removed from the website, if not indefinitely when they are linked from other websites. In that case, you can make sure to return the appropriate status code (HTTP 404/410 or HTTP 301 redirect to similar content).
  • The analysis scope that you defined: will also appear as orphan pages that are linked but are very deep in your website, deeper than you allowed the Botify crawler to go; for instance this is a very large website, with several million pages and you decided to analyze only the first million. But the finding remains valuable: the graph shows the top of your website, vs. what Google explores.

Let's focus on the first problem - making sure Google explores your website as thoroughly as possible.

Which structural indicators have an impact on Google's crawl rate?

Google will use a number of signals to decide to crawl pages, and how often to crawl them. Among the top signals, besides website popularity and authority of course, are user's visits and behavior, as well as content quality. But these rich signals are only available for a comparatively small number of pages on the internet. What about the rest? For those, Google only has Pagerank to fall back on.

Which is why the Botify report includes an Internal Pagerank indicator: it allows to see how the website's pagerank flows in the website structure, how it is distributed among pages. Hopefully, it primarily goes to important pages, and accurately reflects the pages' importance.

See below, an example of the percentage of pages crawled by Google on a website, shown by Internal Pagerank:

Graph from the Botify report / Search Engines tab / Google / Top charts

Now, the Internal Pagerank is not something you can directly tweak. It depends on the internal linking of the website (which is what you can adjust), and is heavily related to page depth. Most of a website's "link juice" is at the top, and the deeper you go in the website, the less there is.

Page depth is measured as follows: The home page is at depth 0, pages linked on the home page are at depth 1, and so on. When there are several paths to reach a page, its depth is the number of clicks of the shortest path.

Let's look at examples of the impact of page depth on Google's crawl rate. The graphs below show, in blue, pages explored by Google, and in red, those that weren't.

Graphs from the Botify report / Search Engines tab / Google / Top charts

It's extremely rare to find close to 100% of pages explored after a certain depth, usually 3 (or perhaps 4 for high volumes, and that of course also depends on the website popularity), and the proportion of pages crawled by search engines steadily decreases with depth.

Rule of thumb: try to have most of the website's volume no deeper than 5.
And of course, check that your key content (products for an e-commerce website, articles for editorial content…) is not too deep. Read about most common causes of deep pages, and how to understand what your deep pages are .

Working template by template

So... once you are aware of the global situation, what can you do?

First, look at the same information (overall crawl rate, crawl rate by depth, crawl rate by Internal Pagerank), template by template. Botify allows to analyze a website by template, by defining Segments, prior to the analysis in the project settings, based on URL patterns.
This view by template will allow to define priorities, and see which internal linking optimizations can be done: For instance, for product pages, you can add product-to-product links (horizontal navigation, with user justifications such as "Similar products", "Accessories for this product", etc.).

The graphs below show the distribution by segment, for all pages crawled by Google (on the left), and for all active pages (those that generated visits fro Google results, on the right):

Graph from the Botify report / Search Engines tab / Google / Segments

This graph shows Google's crawl ratio, by segment:

Graph from the Botify report / Search Engines tab / Google / Segments

You can also get more details for a given segment, and Google's crawl ratio and active pages ratio by depth for that segment, using a report filter:

Graph from the Botify report / Search Engines tab / Google / Top Charts, with report filter applied (one pagetype segment selected)

This other graph shows, for each segment, how often Google crawls pages (among all those found by the Botify crawler on the website). The percentage indicates the amount of days with Google crawls, over a 30-day period: for instance, > = 80% means at least 24 days over the 30-day period considered for the analysis.
This is a great indicator of the interest Google is showing in of your website's templates:

Graph from the Botify report / Search Engines tab / Google / Segments

And for an overview of SEO efficiency for each of your templates, check out the graph below. Each horizontal bar represents a template on your website, the size of the bar on the left shows the number of distinct pages crawled by Google over 30 days, and the size of the bar on the right indicates the number of organic visits from Google over the same period.

Graph from the Botify report / Search Engines tab / Google / Conversion

  • When there is virtually no volume on the left, we're looking at top tail pages (typically the home page and top navigation).
  • On the other hand, when there is significant volume on the left, but nothing much on the right, this can be seen as Google crawl waste: it's worth considering whether these pages should be disallowed to search engines, or perhaps completely removed from the website (if they are not interesting for users either).
  • When there is significant volume on both sides, it is usually a long tail area of the website which converts well - this can be confirmed by another chart, with organic visits frequency from Google results:

  • Graph from the Botify report / Search Engines tab / Google / Segments

All this is a sneak preview of what you can find in the new Search Engines section of the Botify report, with the Log Analyzer option.