Crawl & Render Budget

How to Cleanup Bad Status Codes

By Annabelle Bouard

You might be asking yourself, “why are bad status codes a problem?” When Google or another search engine hits a webpage with a bad status code, there is no actual content for it to crawl. Even redirects, although not errors per se, result in missed opportunities, as the search engine bot will not immediately crawl the redirect destination but instead store it for evaluation and, potentially but not systematically, crawl it later.  As for users, redirects are transparent, but pages returning an error provide a poor experience instead of landing on rich, helpful content on your website. Cleaning up these status codes can lead to a more streamlined crawl and a better experience for your users. 

What Are Bad Status Codes? 

A bad HTTP status code is any status code that is not “HTTP 200 – Success”. 

We’re going to address the three main types of bad HTTP codes: client errors (HTTP 4xx), server errors (HTTP 5xx), and redirects (HTTP 3xx). The various status codes are detailed in this Wikipedia article, and a recent Google Developer guide highlights how the search engine interprets them. 

Example of a site with a large amount of new bad HTTP codes

Step 1: Build a list of URLs that will be part of the cleanup

HTTP 404 and other 4xx errors

  • Create a list of pages returning 4xx
  • Rember to exclude those that are normal, if applicable: unlike internal redirects and server errors HTTP responses, some HTTP 4xx may be standard on your site. For example:
    • HTTP 404 or 410 due to a pages’ life cycle, such as classifieds ads that are now expired. Bots will still find those while users wouldn’t because when a user clicks on a page with a list of advertisements generated with new data, search engine robots (and Botify, which has a similar behavior) do not follow links immediately after they are found. They evaluate them, place URLs in a queue of pages to crawl and process them later. By the time the crawler requests the ad page found earlier in a list, it may have expired. If there are expiring pages on your site, you can simply use a filter to exclude the type of page that expire from the list of 404s you want to deal with. 
    • HTTP errors related to access control, like HTTP 401 – unauthorized or HTTP 403 – forbidden. If it is normal for these pages to be forbidden (they are reserved to users logged in, for example), then these are not broken links; they just don’t serve any SEO purpose. These are not actual 404 pages. 

Segmenting by the page type will help differentiate 4xx errors that you don’t need to worry about cleaning up

You can also gain additional insights from the HTTP status codes returned to Googlebot when it explored your site (from your web server logs data imported into the Botify SiteCrawler report): finding out whether the site always returned an error to the search engine over the last 30 days, or returned a mix of HTTP 200 (before it expired or before you implemented a change in the site) and HTTP 404 (after). This can help to diagnose 4xx errors. 

  • When you categorize by type of issue, there can be malformed URLs, old pages (example: on an e-commerce website, product URLs changed after migration still linked on the site). Segmenting by the type of page should help here as well.

HTTP 3xx internal redirects: beware of redirect chains! 

  • Get a list of redirected URLs 
  • See where they redirect to. In all likelihood, they will redirect directly to the appropriate URL (known as a single-hop redirect). 

But there might be redirect chains, in which case what we need is the end of the chain, with the URL that delivers content successfully. If there are chains, you will want to identify apparent causes for additional hops (such as redirecting to an HTTP page followed by a redirect to the same page with HTTPS, adding a folder, and then adding a trailing slash). 

In redirect loops, the correction will also involve adjusting the redirect rewrite rules to break the loop. 

HTTP 5xx server errors: is the error due to the page? 

  • Get a list of pages returning HTTP 5xx
  • Separate those that are:
    • Permanent (really due to the page): this page always returns a server error whenever requested. This would most likely be an HTTP 500 error (internal server error) however we shouldn’t rely too heavily on the type of HTTP 5xx status (500, 502, 503, 504…), as web servers aren’t always able to map the issue to a specific error code.. Considering 5xx at large is a safer approach.
    • Just bad luck because the Botify crawler happened to request the page when the server wouldn’t respond. It had nothing to do with the page itself. This generally happens because there was temporary server unavailability – any page requested during the issue returned a server error. 

How do you find what pages are returning HTTP 5xx error messages? 

  • The quick way: 
    • Check the HTTP status codes stats throughout the crawl in the Analysis info section. If there were an obvious moment when all pages requests got a server error, it would show there (see below), and you can confirm by checking this against the “date crawled” metric (when Botify crawled the page) for pages with HTTP 5xx. 
    • You can also consider that it is doubtful that the same page returned a server error in two consecutive reports, and both times, it was because the crawler happened to request the page when the server wouldn’t respond. So we can identify pages that consistently return a server error by using a filter to select pages that returned a server error in both the current and the previous report – either the same error (1st example below) or any 5xx error both times (2nd example below). 

This will not address new pages returning a server error, but chances are this will handle the bulk of your server errors. Then, you can check existing pages vs. new pages with server errors to see if the same templates tend to be problematic. If this is not conclusive and you have many new pages with server errors, then turn to the other option:

  • The more complicated way: export all pages are returning a server error and crawl this list to check which HTTP status code they produce now, using an ad-hoc crawl.

Step 2: Identify how robots find these URLs with bad status codes.

For each set of errors or redirected pages, identify where robots find these URLs, as this is where you will need to go to update the broken links. We’ll replace the error URL with a correct one. In the case of redirects, replace the redirected URL with the redirect target (after checking this is a valid, indexable URL) to avoid unnecessary hop(s).

  • In most cases, robots will simply find these URLs because they are linked to other pages, so we need to see these source pages (inlinks metrics for the broken URLs). The inlinks sample includes up to 300 source URLs, which means that all inlinks will be listed in many cases, and if there are more than 300, the sample will probably be large enough to understand which templates these links are coming from. 

You can also look at individual links (instead of looking at a list of links to a given page as shown above) and display source-destination data. As soon as one filter or one selected metric for the columns belongs to the “full link graph,” the results are displayed with one link per line, with its source URL (1st column and chart widgets) and destination URL (3rd column below). Then, you can also display information about the link, such as the anchor text (2nd column below), which helps understand what these links are, or whether the link is followed or nofollow (not shown here). Another option is to add a filter to show follow links only (exclude source pages with a nofollow meta tag and links with a nofollow in the link itself).

  • Robots may also find these URLs with bad status codes, not through links but other types of relationships between URLs on your site: in canonical tags or via a redirect. So make sure you check those types of sources as well. Metrics to do this are “No. of incoming canonical tags”, “No. of incoming redirects” to identify when these sources exist, and “Canonical from,” “Redirected from” to see samples.

This link view mentioned earlier is also helpful to investigate all links to a page, when the sample of inlinks is large and contains several different types of pages, or when we need details related to the individual links (for instance, we want to filter based on link anchor text patterns). 

There can be a few corner cases that turn out to be a bit complex to understand (such as a succession of redirects and canonical tags). In this case, the critical questions are: How many such URLs are there? Are these strategic pages? Ultimately, do they deserve to invest a lot of time and energy? If so, it may help look at examples and walk backward through the paths robots follow by clicking through related URLs in their URL info page (click on incoming canonical URLs, incoming redirection URLs). 

Step 3: Prepare separate to-do lists based on priorities and types of corrective action and prioritize based on:

  • Type of pages that return errors: we want to correct the most business-critical pages first
    • See where the volumes are, by group of status codes and segment, in the HTTP section, Segments subsection:

Or in the Insights, by individual type of status code:

  • By type of source page: we want to correct broken links that are found in strategic pages. 
    • See what types of pages the errors are mainly linked from. You can start by clicking in the broken links destinations pie chart in the Outlinks section, for instance, on the 4xx outlinks, and then refine using the page type pie chart:
  • By the number of errors to correct in the source pages: first address pages that contain a large number of broken links, so that by processing a small number of source pages, we’ll address the bulk of errors. 
    • For instance, the previous view also filters based on a specific number of broken 4xx outlinks by updating the filter to the desired threshold. 
    • You can also expand this view to see pages with any type of broken outlinks, which is convenient if you want first to address pages that contain the highest amount of broken outlinks of any kind. To do so, simply update the filter to use the metric that looks at all types of broken outlinks:

And update the columns to see the total count for any broken link as well as the count and sample for each of the three types:

  • If there is a lot to correct, you may need also to address the most urgent based on organic performance – of the error pages themselves or the pages they are linked from 
    • Select broken URLs with a history of organic traffic and/or impressions and/or significant Google crawl: they were generating traffic before, but now they don’t return content successfully anymore. 
    • Select source pages with broken outlinks that have a history of organic traffic and/or impressions and/or significant Google crawl: because the source pages are highly visible, users are more likely to come across the broken links they contain, and Google to see them. 

Cleaning up broken or bad status codes might seem daunting, but it is critical to the long-term health of your website if they represent a significant portion of your pages. Your crawl budget can become inundated with URLs that return bad status codes, meaning that our most important content isn’t getting discovered by search engines. Once you’ve gone through the work of cleaning up problematic status codes, the accessibility of your key content will be improved, and you can move onto the next step towards optimal discoverability: optimizing depth and linking. 

Get more articles like this in your inbox monthly!