As Mike King wrote in his tour de force article about the technical SEO renaissance, "it has always been a crapshoot as to whether that content actually gets crawled and, more importantly, indexed." You need to be measuring and monitoring this content because search engines are using it for crawling, indexing, and ranking. It affects your bottom line.
- Pair this information with crawl metrics from Log Files and Traffic metrics from web analytics
Being able to crawl a website and join that data with other highly relevant information like server Log Files and Web Analytics is a core competency. With this breakthrough, you can get an accurate view of your site structure and how search engines are crawling it.
In a normal crawl of the page, we found 141 outgoing links, none of them to other products.
We can grab one of the comments and search to see whether it’s indexed… and it is:
We created a custom HTML extract to capture the number of comments (using
<div class="shout-body">) on the Last.fm page as well as the amount of content overall. In our normal crawl, we found no comments.
Success - 11 comments found! Now we can evaluate on-page content in a way that is much more similar to how search engines do than we could before.
- Respect robots.txt rules
- Be configured to not execute certain JS files, such as web analytics (to avoid inflating traffic metrics)
- Caches resources to reduce load on the website
- Capture and follow links created using onClick or other handlers
- Render JS content that only loads in response to request from specific user agents