It started with a good idea: implementing customized 404 pages with a few relevant links to deep content in addition to top navigation. But for the website we are going to look at, it went wrong. Suddenly the website doubled in size.
Usually, SEO work regarding 404s mainly consists of :
- Making sure the website returns the appropriate HTTP 404 response when content is not found, to avoid indexing empty pages (which happens if they return HTTP 200 – OK)
- Removing links to 404 pages from the website.
Customized 404 Pages
For a better user experience, it is a good idea to implement customized 404 pages instead of a laconic “Page not found” message, to present the user with rich navigation back to content. We may think at first that this is only about user experience, but actually, this approach potentially impacts SEO as well: Google makes no secret that the links returned in the content of a 404 page may be crawled.
In theory, links found in the 404 pages are also found elsewhere. The most common approach is to include top level navigation and links to popular pages. Alternative approach: suggest content that is as close as possible to the content the user was expecting, through internal search results based on words found in the requested URL.
But what happens if for some reason, the 404 pages generate plenty of new URLs? Google could crawl a large number of pages which are counter productive for SEO, as they are not SEO landing pages, and may consume significant crawl budget. These URLs may flood the rest of the website with useless content.
Let’s look at a website which found out the hard way, and see how the problem is identified with Botify Analytics.
In the report, things look quite good in terms of HTTP status codes: a small number of redirections (in blue), few errors (404 in orange).
We know something’s off though, as soon as we look at the page depth graph:
The deepest URL is 22 clicks deep, as we can find out by clicking on the “URLs by depth” block to see URLs and depth, with the deepest URLs first. It is also obvious that URLs listed in the sample are search results with pagination.
Let’s see how many search pages there are on the website.
In the URL Explorer, select the following filter:
- Path starts with “/search” (path is the part of the URL which starts with “/” after the domain, and stops before the “?” if any, or at the end of the URL)
And click on “Apply”.
There are 2,232 search pages, including pagination. That’s more than half of the website!
If we look at one of these pages, the search query is quite strange, it looks more random than your typical search query. Now, we are going to try to understand where these search results are linked from: we need to identify the incoming links of the first page of a results list.
Let’s look at first pages of search results, that is to say URLs which don’t have any pagination parameter. We need to add a filter which indicates there is no pagination parameter:
- Query String does not contain “page=”
The query string, when there is one, is the part of the URL which begins with “?” and ends at the end of the URL.
Note that we could as well have used a filter on the full URL instead:
- URL does not contain “page=” (it works because we won’t find this in the URL path)
Let’s display incoming links sample (start typing the field name in the fields area and click in the drop-down list to select):
And click on “Apply”.
You won’t see it on the anonymized screenshot, but it turns out that all these URLs without pagination parameter only have incoming links from other search pages with pagination.
So where could these search pages be linked from?
Let’s try another approach: Let’s look at the search pages which are the least deep.
Let’s remove the pagination filter (click on the little red arrow), click on “Apply” and select the “depth distribution” tab in the results table:
We can see that search pages begin at depth 4. So let’s add a filter to look only at depth 4:
- Depth = 4
Click on “Apply”.
The results list shows there is no search page without pagination at depth 4. Let’s click on a URL with page=1 from the results list, and look at its inlinks:
The first inlink listed is not a search URL.
Let’s click on it:
It’s a 404. If we click on the red “LINK”, we’ll see what the page looks like on the website.
So now we understand what’s going on:
404 pages return search results. These search results include pagination, and paginated pages link back to a duplicate first page (P0 in the graph below, which has an URL without pagination parameter):
We’ll need to identify all 404 pages that link to a search URL with a page=1 parameter.
Let’s remove the depth filter add a the following filter for pagination :
- Query String contains “page=1&”
We added & because we don’t want to include page=10, 11, 100, etc., so we need to specify the character which follows the “1”, in our case, “&” because there are other parameters following pagination in the URL. The “&” is preceded by “” as special characters need to be escaped.
Note that we could as well have used fiters on the full URL instead:
- URL contains “/search?query=” (we have to escape the ? by placing a “” before it, as it is a special character)
- URL contains “page=1&” (it works because we won’t find this in the URL path)
We can browse these results just like we did earlier to identify where sample depth 4 urls URLs were linked from.
But if there are quite a few, we can also use a top-to-bottom approach. What we know about the URLs we are looking for is that:
- They return 404
- They have ougoing links (not all 404s do)
- They have very few incoming links, probably only one
This translates into the following filters:
- HTTP code = 404
- Unique number of follow internal outlinks > 0 (the page has outgoing links to pages on the website)
- Unique number of internal inlinks <= 3 (there are no more than 3 incoming links from distinct pages. In most likelyhood, there will be only one, but let’s allow for a couple in case the same error link is on a couple of pages)
Let’s display the filtered fields as well as a sample of internal inlinks (start typing the field name in the fields area and click in the drop-down list to select):
Here are the results (as suspected, there is only one inlink per URL):
We now have in the “sample of internal inlinks URLs” column the list of pages where a link need to be corrected (that from the URL column)
In our example, the 404s are caused by:
- Contact information links which are intended to send an email, but the “mailto:” part is missing in the links,
- tags which are not closed properly and include a portion of the code or text that follows,
- Missing “/” which cause URL paths to be considered as relative: the resulting URL is incorrect as it include a repeated portion.
The problem of useless search pages linked on the website is solved by removing the 404s listed above. Removing pagination from the 404 search results will prevent future regressions, as well as orphan pages from 404s linked from external websites.
Two Reasons Why it Went Wrong
In summary, things went wrong because:
- A search query based on words from the URL does not always work very well. It is only adequate under two conditions: URLs are very structured and contain enough keywords; and the 404 URLs are either old URL which we failed to redirect, or not too far off the correct URL – with only a small portion altered. Results start going awry if there is very little semantic information in the URL, or nothing relevant at all (which was the case for URLs built from email addresses)
- Search results included A LOT of pagination. For the purpose of redirecting a user from a 404 to the actual content, the first page is generally enough. This means that internal search results included in 404 pages should of course be sorted by relevance, and should include first page results only.
Any related experience? Let the community know!