Could you be unknowingly cutting loose valuable pages? To put it another way, do you have orphan pages with existing or potential organic traffic? The answer is probably yes. It is for most websites. Re-attaching some of these pages to your website structure would allow to tap into their full potential.
Orphan pages are pages explored by Google that users can't find while navigating on your website: they are not linked anywhere on your website. As a result, the Botify crawler doesn't find them either.
Orphan pages have weakened traffic potential, but that's not all. The other problem, actually even more frequent, is enormous amounts of crawl waste from Google. We recently talked about pages in your site that Google doesn't know exist - because the search engine can't, or won't explore them (the red part on the left, in the graph below). In the vast majority of cases, there are also orphan pages (the grey part on the right).
In the following example, more than 70% of the pages explored by Google on the website are orphan pages:
There are two kinds of orphan pages: the expected, inevitable, normal orphan pages resulting from known causes; and the unexpected.
So the first thing to do when we see a high volume of orphan pages is to check what they look like and if they are expected or not.
Expected reasons for orphan pages:
Pages linked on external websites, usually redirected. Redirected pages are all orphans, as internal links should always go directly to the correct page.
A few pages returning errors. These errors were identified and corrected on the website but Google still crawls the URLs for a while. Nothing to worry about.
Expired pages from a website with a large number of pages with a short lifespan, for instance classifieds that expire very quickly. They expire within the analysis time frame. We should only start worrying if they remain orphans for a long time. Otherwise, the amount of orphan pages simply hints at the website's pages rotation rate and should be seen as food for thought.
Frequent causes of orphan pages that shouldn't exist but are crawled by Google:
Expired pages still returning content: some websites simply stop linking expired content (such as products removed from the catalog) and fail to return a status code to indicate that the content does not exist any more (HTTP 404 or 410), or redirect the page to similar content (newer version of the product for instance). As a result the old page is still available.
Pages left out from a previous migration: they are not redirected and the old content is still available. Either there is some similar content on your website, and these old pages should be redirected to these current pages (page-to-page redirections); or there isn't, and these pages should return HTTP 404 (not found) or 410 (gone).
A syntax error while generating sitemaps, which creates erroneous URLs (which can still return content and create duplicates, or return HTTP errors)
A syntax error while generating canonical tags, which creates erroneous URLs (HTTP 200 or errors as well).
Pages that are not always linked in the website structure. Some websites use navigation pages (lists of content, such as category pages or internal search result pages) that are only linked when one or several criteria are met. For instance, sub-categories will only appear in a menu when the list is not empty or reaches a minimum number of items. The right approach is to determine when a page ceases to be a target for organic traffic according to business criteria, and when it does, remove it once and for all: remove links and return HTTP 404 or 410. Until then, it should always be linked in the website.
Yes, you read correctly! There can be both expected and unexpected orphan pages generated by expired content. The difference is in the HTTP status code. Both were linked on the website when Google crawled the pages, and they were not linked any more when the Botify crawler explored the website. But once the content expired, the normal orphan page says it's gone (it returns HTTP 404 or 410), while the abnormal one still exist (it returns HTTP 200). The difference will show in the logs analyzer. In the first case, the number of HTTP 404 will grow steadily and the number of HTTP 200 will be relatively stable, while the number of HTTP 200 will keep growing over time with abnormal orphan pages.
So, what next? How do we know what we're looking at?
The logs analyzer helps identify orphan pages. It also allows tells us if some would be worth reintegrating into the web structure, with information about visits generated by orphan pages.
Let's take go back to our example. There are approximately 800K orphan pages crawled by Google, way more than the 300K pages explored on the website. The Log Analyzer's crawl report shows page distribution by type of page, for each.
The distribution by type of page is very different from what Google finds in the website structure.
A quick look at the log analyzer's daily history graphs tells us that the green pages that represent 61% of orphan pages in the graph above are redirected:
This graphs shows Google's daily crawl volume on this particular category of pages, by status code. The pages almost always return an HTTP 301 status code (permanent redirection), shown in orange.
The report also tells us which types of orphan pages are active (an active page is a pages which generated at least one visits over the analyzed 30-day period), and how this compares to active pages in the website structure:
And most of all, the report indicates how this translates into organic visits. On this website, 5% of organic visits are generated by orphan pages.
In this example, the type of page that generates 79% of organic traffic on the website (in the structure), also generates 7% of traffic on orphan pages. And the two categories of pages that generate most traffic on orphan pages are actually broad buckets for "other" types of pages, which were not categorized more precisely, as they are very few on the website (the graphs above combine all values below 1%, but the report can show finer details).
The full list of categorized orphan pages, along with their number of organic visits and number of crawls from Google, is provided with the report. This will allow to investigate these orphan pages, and decide how to treat them.
And if you happen to find a surprisingly large amount of orphan pages with organic visits, you can bet these are mostly Google Adwords visits that were not properly identified - missing Adwords identifier parameter in the URL, for instance.