Log File Analysis Technical SEO

Orphan Pages & SEO: What Are Orphan Pages & How Do I Find Them?

orphan pages

Do you have pages with ranking and organic search traffic potential but are missing from your site structure? Or pages that intentionally aren’t in your site structure but Google is finding them anyway?

The answer is probably yes. At least, it is for most websites!

These are called orphan pages, and re-attaching the good ones to your website structure allows you to tap into their full potential (as does blocking search engine bots from your low-value ones!).

So, what exactly are orphan pages?

Orphan pages are pages that aren’t linked to anywhere on your site. Because there are no links to them, website visitors and site crawlers won’t be able to find them either.

So, how do you find orphan pages?

You’ll need to use both a site crawler and log file analyzer. Here’s how.

How to find orphan pages

If a site crawler helps you find pages in your site structure, then a log file analysis tool can help you find orphan pages that aren’t in your site structure. You’ll want both to find all the orphan pages on your website.

We talk a lot about pages in your site structure that Google doesn’t know exist. These are the pages that search engines can’t or won’t visit, and they’re represented by the blue circle in the Venn diagram below – pages Botify found (which you can know from a site crawl) but Google has not (which you can know from your log files).

The other side of that Venn diagram, the red circle, represents pages Google has found but your site crawler didn’t because they aren’t linked to anywhere on your site. Those are your orphan pages.

So, Google-missed pages and orphan pages are two sides to the same coin, and you need access to both a full crawl of your website and your server log files in order to to find them.

orphan pages

Why are orphan pages bad for SEO?

Orphan pages cause two main SEO problems:

  1. Low Rankings & Traffic: Even if they contain great content, orphan pages typically don’t rank well in SERPs or get much organic search traffic.
  2. Crawl Waste: Low-value orphan pages (e.g. duplicate pages) can be stealing crawl budget from your important pages.

When orphan pages comprise a sizeable chunk of the pages Google explores on your website, like the more than 70% in the example below, you get a good idea of just how big the problem is.

Botify Log Analyzer report pages found by crawler versus pages found by Google

How do I fix orphan pages?

There are two kinds of orphan pages:

  1. The expected orphan pages you don’t typically need to be concerned about
  2. The unexpected orphan pages that you probably should be concerned about

The route you take to fix your orphan pages will depend on what type they are. So, the first thing to do when we see a high volume of orphan pages is to check what they look like and if they are expected or not.

Expected orphan pages: not typically cause for concern

Once you run a site crawl and compare it with your server log files to find pages Google is finding but that aren’t in your site structure, you can click on “found by Google” to get a list of all your orphan pages.

Many of these orphan pages will be coming from:

1. Pages that don’t currently exist on your site, but another site is linking to. It’s common to get an external link to a page that you then remove or redirect. Because the old link still exists on that other website, Google will still find it.

How to fix: Since you don’t control the links on other websites, the only way to fix this type of orphan page is to reach out to the site owner and ask them to update to the correct new location of the page.

2. Pages returning non-200 status codes. Google may still choose to crawl pages returning things like 4xx status codes even after they’ve been corrected on your site.

How to fix: Google will eventually stop crawling these. Nothing to worry about.

3. Expired pages. This is common on websites with a large number of short lifespan pages, for instance, classifieds that expire very quickly.

How to fix: We should only start worrying about expired pages found by Google if they remain orphans for a long time. Otherwise, the amount of orphan pages simply hints at the website’s pages rotation rate and should be seen as food for thought.

Unexpected orphan pages: potential cause for concern

1. Expired pages still returning content. Some websites simply stop linking to expired content (such as products removed from the catalog) and fail to return a status code (like HTTP 404 or 410) to indicate that the content doesn’t exist anymore. As a result, the old page is still available.

How to fix: In addition to removing links to expired content, you should make sure to update the expired page with the proper status code. If the content is no longer available, make sure to 404 or 410 it.

2. Pages left out from a previous site migration: These are pages that aren’t redirected and therefore old content might still be available.

How to fix: If there is similar content on your new website, you should redirect these old URLs to them. If there isn’t, these old/left-out pages should be returning a 404 or 410 status code.

3. A syntax error while generating sitemaps: These create erroneous URLs, which can still return content and create duplicates, or return HTTP errors.

How to fix: If you spot erroneous URLs created by a syntax error, get with your development team to collaborate on a solution.

4. A syntax error while generating canonical tags: These create erroneous URLs. These URLs could be serving 200 OK status codes or error codes.

How to fix: If you spot erroneous URLs created by a syntax error, get with your development team to collaborate on a solution.

5. High-quality, important pages that aren’t linked in your website structure: Some websites use navigation pages (lists of content, such as category pages or internal search result pages) that are only linked when one or several criteria are met. For instance, sub-categories will only appear in a menu when the list is not empty or reaches a minimum number of items. Whether an error of automation or not, there are plenty of cases in which we might neglect to link to high-value pages.

How to fix: The right approach is to determine when a page ceases to be a target for organic traffic according to business criteria, and when it does, remove it once and for all: remove links and return HTTP 404 or 410. Until then, it should always be linked to somewhere on the website.

Expired content orphan pages

When pages expire, it can create orphan pages. Sometimes, this is normal and expected. In other cases, it’s abnormal and taking action to fix is necessary.

The difference between expected and unexpected orphan pages for expired content is in the HTTP status code. In both cases, the pages were linked to on the website at the time Google crawled the pages, and they were not linked any more when the Botify crawler explored the website. Then, once the content expired, the normal orphan page says it’s gone (it returns HTTP 404 or 410), while the abnormal one still exists (it returns HTTP 200).

Here’s how to spot the difference in LogAnalyzer:

  • Normal orphan pages: The number of HTTP 404 pages will grow steadily and the number of HTTP 200 will be relatively stable.
  • Abnormal orphan pages: The number of HTTP 200 will keep growing over time.

So, what next? How do we know what type of orphan page we’re looking at so that we can know what action to take?

To LogAnalyzer!

How to analyze your orphan pages

Let’s go back to our example (the site with more than 70% of their pages orphaned). On that site, there are approximately 800K orphan pages crawled by Google, way more than the 300K pages explored on the website.

With our site crawl data, we can also understand the difference between how Google crawls orphan pages vs. pages in the site structure. As you can see from the example below, the distribution by type of page is very different from what Google finds in the website structure.

Botify Log Analyzer report graph showing distribution by category of pages crawled by Google in the site structure vs outside the structure (orphan)

A quick look at LogAnalyzer’s daily history graph tells us that the green pages that represent 61% of orphan pages in the graph above are redirected:

Botify Log Analyzer pages from a category crawled by Google, by HTTP status code

The history graph shows Google’s daily crawl volume on this particular category of pages, by status code. The pages almost always return an HTTP 301 status code (permanent redirection), shown in orange.

This graph also tells us which types of orphan pages are active (i.e. generated at least one visit from organic search over the analyzed 30-day period), and how this compares to active pages in the website structure. As you can see, there are far fewer orphaned pages getting organic traffic as opposed to pages in the site structure.

Botify Log Analyzer report graph showing distribution by category of pages, of pages with visits from Google in the site structure vs outside the structure (orphan)

And, perhaps even more importantly, the report indicates how this translates into total organic visits. On this website, just 5% of the organic visits are generated by orphan pages, meaning 95% of this site’s organic visits are coming from pages in the site structure, even though just 30% of the site’s total pages are in the structure.

Botify Log Analyzer report graph showing distribution by page category of visits from Google, in the site structure vs outside the structure (orphan pages)

In this example, the type of page that generates 79% of organic traffic on the website (in the structure), also generates 7% of traffic on orphan pages. And the two categories of pages that generate most traffic on orphan pages are actually broad buckets for “other” types of pages, which were not categorized more precisely, as they are very few on the website (the graphs above combine all values below 1%, but the report can show finer details).

In Botify, you can get the full list of categorized orphan pages, along with their number of organic visits and the number of crawls from Google, in the same report. This will allow you to investigate these orphan pages so you can decide how to treat them.

(P.S. If you happen to find a surprisingly large amount of orphan pages with organic visits, you can bet these are mostly Google Ads visits that were not properly identified – e.g. missing the Ads identifier parameter in the URL.)

May 5, 2021 - 4 mins

How Do Core Web Vitals Impact The Full Marketing Team?

Log File Analysis Technical SEO
Jun 6, 2021 - 5 mins

Top E-Commerce Priorities For Digital Transformation

Log File Analysis Technical SEO
Jul 7, 2016 - 3 mins

1 Big Reason You Should Still Remove Those 301 Redirects

Log File Analysis Technical SEO