Did you ever wonder how to crawl only a portion of your website? Playing with the crawl’s start URL and the Virtual Robots.txt option in Botify Analytics allows to do just that. Let’s look at situations where you might want to perform a partial crawl, and see how to do it.
Why Analyze a Section of Your Website
There are many situations where restricting the crawl to an area of your website this might be useful. You may want to analyze:
- A folder with a linguistic version of a website
- Everything BUT a folder – for example, an e-commerce website with an editorial section you want to leave out
- A subdomain only, which happens to have its top navigation in another subdomain
- The website’s navigation only, excluding content pages. For instance, in an e-commerce website, the category tree and search pages, but not product pages, nor user ratings etc.
Crawl a Folder Only
Let’s say we want to crawl the english version only. The crawler setup will be as follows:
Add a Virtual Robots.txt to the crawler configuration: click on “Add Virtual Robots.txt”, copy your website’s robots.txt file content, and add rules to restrict the crawl to the folder.
Crawl Everything BUT a Folder
It’s even easier:
Start the crawl normally from the home page, copy your robots.txt file in the virtual robots.txt and simply add a disallow rule for that folder.
Crawl a Subdomain Only, When the Start URL is Elsewhere
In many cases, crawling only a subdomain is very straightforward. No need to use a Virtual Robots.txt. All we need to do is list that subdomain as the only allowed domain, and make sure that the start URL is in that subdomain.
But what if the subdomain doesn’t have a proper home page? It becomes trickier if the top navigation for the subdomain’s content is placed in another subdomain. Which is quite common, actually: imagine that there is a forum subdomain (forum.mywebsite.com), and that the forum home – and perhaps its top navigation with the forum main themes – is on the main subdomain (www.mywebsite.com):
In that case, we need to allow both subdomains, and carefully list all the restrictions in the virtual robots.txt to only allow the top navigation in the main domain.
If the forum navigation is placed in a /forum/ folder on www.mydomain.com, all we need to do is use the following Virtual Robots.txt:
This is a simple example. If forum URLs on www.mywebsite.com were not conveniently located in a folder, we would have to list more detailed rules to cover them all with “Allow:” lines.
Notice that we used a [www.mywebsite.com] header, which means that the rules will only be applied to that domain. For forum.mywebsite.com, the Botify crawler will use the robots.txt file found online. If we wanted to restrict the crawl on the forum subdomain as well (for instance, crawl only a folder), we would add another section with a [forum.mywebsite.com] header – in that case, don’t forget the User-agent line, as each section corresponds to a full, independant robots.txt.
For more details on the Virtual Robots.txt option’s functionality, check out the FAQ page.
Crawl Navigation Only
If an e-commerce website has a large number of products, crawling the whole website is not the best approach to analyze the navigation structure: you may reach a very large number of URLs before covering all of the navigation. It can be interesting to perform an unrestricted crawl, and stop a at certain depth, and perform a second crawl on navigation only, with no depth limitation.
If the website has the following levels of navigation:
1) Universe (URL pattern: contains “_u[identifier]”)
2) Section (URL pattern: contains “_s[identifier]”)
3) Category (URL pattern: contains “_c[identifier]”)
4) Subcategory (URL pattern: contains “_m[identifier]”)
The Virtual Robots.txt that allows to crawl only navigation is as follows:
That’s it! You’re all set to start the crawl!
As a general rule, it is possible to analyze any part of a website, as long as we are able to define:
- An entry point (or, if needed, up to 3 entry points)
- URL patterns that define the only links that must be followed by the crawler to remain within the allowed area – or the full list of URL patterns that are not allowed.
Are you facing a situation which is not covered by our examples, and wondering how to proceed? Let us know!