It started with a good idea: implementing customized 404 pages with a few relevant links to deep content in addition to top navigation. But for the website we are going to look at, it went wrong. Suddenly the website doubled in size.
Usually, SEO work regarding 404s mainly consists of :
For a better user experience, it is a good idea to implement customized 404 pages instead of a laconic “Page not found” message, to present the user with rich navigation back to content. We may think at first that this is only about user experience, but actually, this approach potentially impacts SEO as well: Google makes no secret that the links returned in the content of a 404 page may be crawled.
In theory, links found in the 404 pages are also found elsewhere. The most common approach is to include top level navigation and links to popular pages. Alternative approach: suggest content that is as close as possible to the content the user was expecting, through internal search results based on words found in the requested URL.
But what happens if for some reason, the 404 pages generate plenty of new URLs? Google could crawl a large number of pages which are counter productive for SEO, as they are not SEO landing pages, and may consume significant crawl budget. These URLs may flood the rest of the website with useless content.
Let’s look at a website which found out the hard way, and see how the problem is identified with Botify Analytics.
In the report, things look quite good in terms of HTTP status codes: a small number of redirections (in blue), few errors (404 in orange).
We know something’s off though, as soon as we look at the page depth graph:
The deepest URL is 22 clicks deep, as we can find out by clicking on the “URLs by depth” block to see URLs and depth, with the deepest URLs first. It is also obvious that URLs listed in the sample are search results with pagination.
Let’s see how many search pages there are on the website.
In the URL Explorer, select the following filter:
And click on “Apply”.
There are 2,232 search pages, including pagination. That’s more than half of the website!
If we look at one of these pages, the search query is quite strange, it looks more random than your typical search query. Now, we are going to try to understand where these search results are linked from: we need to identify the incoming links of the first page of a results list.
Let’s look at first pages of search results, that is to say URLs which don’t have any pagination parameter. We need to add a filter which indicates there is no pagination parameter:
Note that we could as well have used a filter on the full URL instead:
Let’s display incoming links sample (start typing the field name in the fields area and click in the drop-down list to select):
And click on “Apply”.
You won’t see it on the anonymized screenshot, but it turns out that all these URLs without pagination parameter only have incoming links from other search pages with pagination.
So where could these search pages be linked from?
Let’s try another approach: Let’s look at the search pages which are the least deep.
Let’s remove the pagination filter (click on the little red arrow), click on “Apply” and select the “depth distribution” tab in the results table:
We can see that search pages begin at depth 4. So let’s add a filter to look only at depth 4:
Click on “Apply”.
The results list shows there is no search page without pagination at depth 4. Let’s click on a URL with page=1 from the results list, and look at its inlinks:
The first inlink listed is not a search URL.
Let’s click on it:
It’s a 404. If we click on the red “LINK”, we’ll see what the page looks like on the website.
So now we understand what’s going on:
404 pages return search results. These search results include pagination, and paginated pages link back to a duplicate first page (P0 in the graph below, which has an URL without pagination parameter):
We’ll need to identify all 404 pages that link to a search URL with a page=1 parameter.
Let’s remove the depth filter add a the following filter for pagination :
Note that we could as well have used fiters on the full URL instead:
We can browse these results just like we did earlier to identify where sample depth 4 urls URLs were linked from.
But if there are quite a few, we can also use a top-to-bottom approach. What we know about the URLs we are looking for is that:
This translates into the following filters:
Let’s display the filtered fields as well as a sample of internal inlinks (start typing the field name in the fields area and click in the drop-down list to select):
Here are the results (as suspected, there is only one inlink per URL):
We now have in the “sample of internal inlinks URLs” column the list of pages where a link need to be corrected (that from the URL column)
In our example, the 404s are caused by:
The problem of useless search pages linked on the website is solved by removing the 404s listed above. Removing pagination from the 404 search results will prevent future regressions, as well as orphan pages from 404s linked from external websites.
In summary, things went wrong because:
Any related experience? Let the community know!