A website is rarely completely static. New pages appear, old pages disappear. A couple of weeks ago, we talked about checking how well Google crawls brand new content.
Let’s now see how we can keep an eye on URLs which drop off Google’s radar – when they don’t return content any more, but return an error or a redirection. In some case, it’s the expected behavior. In other cases, it’s not. For instance, in an editorial website, pages are not supposed to vanish; older articles are expected to remain available. On a classifieds website, on the other hand, ads expire. An e-commerce website also includes content that may expire, although, usually at a slower pace than on a classifieds website.
Indicator in Botify Log Analyzer: Lost pages
In Botify Log Analyzer, pages which disappear are monitored through the “Lost pages” indicator. This indicator is found in the Crawl Distribution section of the log analyzer, in the Data Overview tab.
A “Lost page” is a page which, when crawled by Google:
- Used to return content normally (with an HTTP status code 200 – OK, or HTTP 304 – Not Modified),
- Started returning an HTTP code with an error or a redirection at some point, during the displayed period (the last 30 days by default, or any custom period)
- Still returned an error or a redirection when last crawled during the displayed period: this was not a temporary problem after which the page reversed to normal behavior.
As a result, the page was “lost” during the displayed period.
Here is an example with a website specialized in housing classifieds, where most lost pages correspond to expired ads which now return an 410 HTTP status code (Gone):
The distribution of lost pages provides valuable information:
- Distribution over time time helps verify we’re dealing with expected changes such as periodical product catalog updates, or a one-time website migration.
- Distribution by type of page (by “tag” according to URL categorization defined at Botify Log Analyzer’s setup) shows if lost pages correspond to pages we didn’t want to remove, vs unwanted pages that we removed knowingly.
On this other classifieds website for instance, there is a low, regular amount of lost pages in the classifieds category (expired ads details pages in light green), combined with a surge of lost pages, which include classifieds as well as dealer (“Pro”) pages, towards the end of the period. In this example, lost URLs are now redirected.
From the data table below the graph, we can zoom in on a page category and see details within that category (click on the category name), or click on “View URLs samples” to go to the Export tab and get a CSV file.
In this case, this is an update of the dealer section, which impact both dealer navigation pages and classifieds pages.
Here are the details for the classified category:
And here are the details for dealer pages:
Compare lost pages to new crawled pages to analyze content rotation
Looking at the number of lost pages in light of the number of new pages crawled (“New Unique URLs crawled”), or pages which are crawled for the first time ever, is interesting to get a grasp of content rotation – typically for products or ads. In the case of normal content rotation, the number of lost pages will be in the same ballpark as the number of new crawled pages.
This comparison needs to be applied to a long period (30 to 60 days), especially when dealing with a large amount of crawled pages: Google needs time to crawl the content and it doesn’t make sense to analyze content rotation on a partial view.
Monitor pages removal
The lost pages indicator is also very convenient to monitor pages removal. Let’s say for instance that you just removed some useless pages that were generated by mistake. Google doesn’t know these pages are gone until the search engine crawls them again and gets the appropriate HTTP status code (HTTP 404 – Not Found or HTTP 410 – Gone). The fact that URLs appear as Lost Pages in Botify Log Analyzer indicates that Google got the information.
This is more precise than simply looking at HTTP status codes: If we just look at all 404 crawled by Google, we may also see pages that have been returning HTTP 404 for some time, that Google still keeps checking. Lost pages tell us precisely which were lost during the time frame we are looking at.
Do you find this useful? Do you see other interesting usage scenario? Let us know!