What Is Log File Analysis?

Log file analysis uses log, or records, from web servers to measure the crawl behavior of search engines and determine potential issues or opportunities for SEO. 

With every HTTP request for a web page or resource (such as an image, CSS or JavaScript file), request headers are sent to the server that identify the client, or browser, that is requesting the page. The initial response, or HTTP status code, from the web server is included as well.

On most web servers, these request logs are recorded and stored for a period of time (how long usually depends on how much traffic the server gets and how quickly log data accumulates). 

These headers indicate the User Agent requesting the content and include some detail on its features, which servers use to return the most suitable code and content to the client or browser. 

Search engines use automated crawlers (or “bots”) to discover, render and index web content. These crawlers are identifiable by their User Agents, which are shared publicly:

Log file analysis for SEO focuses on the crawling behavior of search engine crawlers in particular, excluding other User Agents.

The Data Included in Log Files

The data contained within log files is normally limited to basic information:

  • URL path of the requested resource
  • Query string of the requested resource
  • User Agent requesting the resource
  • IP address (or physical location) of the User Agent
  • Time stamp (when the request was received)
  • Request type (GET or POST, indicating whether the request is to receive or provide data)
  • HTTP or status code of the server’s response

The Purpose of Log File Analysis for SEO

While log file analysis can be used for a variety of objectives, in SEO the primary aims are usually to:

  1. Identify problematic pages or sections that cause search engines to waste resources, crawling low value or invalid URLs
    • For example, search engine crawlers may discover and crawl apparent URLs in JavaScript code that are invalid or do not contain useful content – these can be blocked via robots.txt
  2. Monitor the HTTP status codes returned to search engine crawlers, which impact the rendering, indexing and ranking of pages in search results
    • For example, search engine crawlers may receive a large number of 302 redirects from a web server (which may not pass authority/PageRank or other indexing properties) – these redirects can be changed to 301 redirects, or the references to these redirected URLs can be updated to new/valid destination URLs
  3. Otherwise optimize crawl budget to ensure search engines can efficiently access pages intended for search results

How To Do Log Analysis

The process of analyzing log data can be a complex technical process, but it is summarized in three basic steps:

  1. Collect/export the right log data (usually filtered for search engine crawler User Agents only) for as wide a time frame as possible
    • While there is no “right” amount of time, two months (8 weeks) of search engine crawler logs is often sufficient 
  2. Parse log data to convert it into a format readable by data analysis tools (often tabular format for use in databases or spreadsheets)
    • This technical step often requires Python/similar programming to collect data fields from logs and convert them to CSV or database file format
  3. Group and visualize log data as needed (typically by date, pagetype and status code) and analyze it for issues and opportunities

While log data is usually simple in format, it can quickly add-up to gigabytes of data even when filtered for requests from search engine crawlers in a limited time frame. This data is often too large for desktop analysis tools like Excel.

It is often most efficient to use specialized log analysis software to parse, organize and visualize log data.