A much anticipated feature is now available: the ability to customize robots.txt rules for the Botify crawler.
It’s called Virtual Robots.txt. Simply enter the new rules in Botify’s crawl setup interface, and this Virtual Robots.txt will override the robots.txt from your website.
Virtual Robots.txt, what for?
The Botify crawler’s default behavior is to follow the rules defined for Google in your website’s robots.txt file, alternatively those defined for any robot.
What if you only want to crawl a subset of the URLs currently allowed to robots? You may want to leave out some content that is not central to your website analysis but might take a serious toll on crawl time (such as forums or user comments).
What if, on the contrary, you want to crawl URLs that are currently disallowed to robots? For instance, a brand new version of your website only available in a staging environment.
Anything becomes possible with the Virtual Robots.txt.
How it works
It’s extremely simple. You will find a “Add Virtual Robotx.txt” button at the bottom of the Botify crawl setup page:
The easiest – and safest – way to go is to copy and paste the existing robots.txt file from your website, and apply your changes. The Botify crawler supports the standard robots.txt syntax, as well as Google’s most common extensions (such as a mid-string wildcard, for instance “Disallow: /resources/*/data/”).
The Botify crawler will follow the directives for the Botify user-agent, or those for the Googlebot user-agent, or those for any (*) user-agent : it selects one set of rules only, the first available in that order. This provides flexibility when setting up the Virtual Robots.txt : you can update the Googlebot section, or create a new section for Botify.
In the case of multi-domain crawls
What if you are crawling subdomains (*.mywebsite.com) or multiple domains (www.mywebsite.com, www.mywebsite.co.uk, www.mywebsite.de, etc.)? There could be as many distinct robots.txt files. Well, no problem, the Virtual Robots.txt can combine several regular robots.txt files. All you need to do is add a specific header which indicates what protocol and domain a robots.txt content applies to, and add those one after the other:
[header] # ex: [http://www.webmysite.com], for the website's main domain regular robots.txt content [header] # ex: [http://*.webmysite.com], for all other sub-domains regular robots.txt content [header] # ex: |https://*], for all https pages regular robots.txt content etc.
For header syntax and options, please refer to the Virtual Robots.txt FAQ.
No header is needed for ONE robots.txt content. The robots.txt rules will then apply to ALL crawled domains.
Only cover the domains you want to change
As mentionned at the beginning of this post, the Virtual Robots.txt supersedes robots.txt files. This means that if the Virtual Robots.txt includes rules for a given domain (more specifically, for protocol + domain), then the robots.txt from the website will be ignored as a whole. However, the Botify crawler will still use the online robots.txt file for domains not covered in the Virtual Robots.txt.
Got any question? Check out the Virtual Robots.txt FAQ.
While we’re covering robots.txt options for the Botify crawler:
There is an alternative to a Virtual Robots.txt : simply add a set of rules for Botify (user-agent: botify) to your website’s robots.txt file. However, for some of you, updating the robots.txt is not as straightforward as it seems, you may not have easy access to the file, it may involve a 3rd party, etc. It will most probably be more time consuming than using the Virtual Robots.txt, especially when the crawl covers several domains. Either way, it’s up to you!