Search is complex.
To get the organic traffic and conversions we want, we first need to make sure our pages are crawled, rendered, indexed, and ranking.
This is what we call the SEO funnel or “full-funnel methodology.”
As a funnel, this gets narrower at every step. In other words, there’s no guarantee that all the pages on your site will be crawled, no guarantee all your crawled pages will be indexed, and so on.
While that makes sense in theory, we wanted to understand it in practice. This involved a study of 413 million unique web pages and 6.2 billion Googlebot requests over a 30-day period.
We found that:
Why is that?
While there’s no simple answer, we knew that the growing size and complexity of the web had some role to play, so we reached out to Google’s Martin Splitt to shed some light on these issues.
You can keep scrolling to read the full interview, watch the conversation on-demand, or use the index below to jump to a particular section:
To help us understand the evolution of the search process, Martin walked us through the history of the web and how it’s evolved in the past 20 years.
“Originally, the web was a document platform. So you would have a bunch of documents like your homepage, a services page, etc. All of these things are informational. It’s basically like a page out of a book — a static document. That’s why it’s called a web page. Since then, we’ve introduced more interactivity to the web. The old web still exists, and it’s still perfectly fine to build static websites.”
“However, a lot of people want more. They want to have the opportunity to add comments, they want live chat, they might build entire applications. You could, for instance, build an application that allows you to manage all your household appliances or manage a shared shopping list. You can still put that application on the web but it’s more interactive. It’s not just informational. Someone doesn’t go there just to fetch information. Someone goes there to use the application.”
“The web is really transforming into an application platform rather than a purely document platform. That’s where things get challenging. The line between what is application and what is content blurs.”
“A lot of people are very worried about this happening, and this does happen, but then again, this also happens if you change your server-side configuration, if you accidentally add a new robots.txt and disallow everything, etc.”
The results were split pretty evenly across the board:
Someone specifically asked Martin if viewing the cached version is a good way to know.
While Martin said there’s no dedicated tool that allows you to pop in a URL to see “has this been rendered?” and get back a Yes/No answer, we can safely assume that everything in the index has been rendered, and there are some other tools we can use, but the cache option isn’t one of them.
“Using the cache option to figure out if the page has been rendered is a bad idea because the cache feature is really old and hasn’t been maintained in the last couple of years. There’s no one actively working on it. The cache extracts information at some point during the search process, I believe sometimes it extracts it before rendering and sometimes after. So, it’s not a debug tool. It’s just a convenience feature so that if your server goes down, we have a copy saved of that page. That’s not necessarily what’s in the index.”
“If you want to know if we’ve seen your content, then you can use Google Search Console’s URL Inspection Tool — just click on ‘view crawled page’ and look at the HTML that we have rendered. If you want to test how that would look like when we crawl, render, and index again, we can do a live test that does pretty much the same thing. There may be small differences because of the way we do caching, but for the most part it’s the most accurate depiction of what’s happening.”
Speaking of testing tools, someone in the audience was worried about the structured data testing tool being replaced.
💡 Structured data is code you can use to mark up your pages to help search engines better understand what it’s about. Some structured data even makes your page qualified to show up in special “rich” features in Google search results. Learn more about structured data here.
They were curious if this happened because Google’s recommendations around schema had changed — are SEOs supposed to only focus on schema that will work for getting rich snippets in Google?
“This exact question is the reason this happened. The structured data testing tool is a tool that isn’t Google specific, technically, because it uses a bunch of validators and rules that aren’t Google product specific. Basically, we were mixing things. The structured data testing tool showed things that wouldn’t necessarily make you eligible for rich results, but at the same time, also showed you validation rules that were not in schema.org but were specific to Google products.”
“So if you want a product to show up in rich results, for instance, then I believe it has to have an image, but that’s not required by schema.org. So people would run their schema.org markup through the structured data testing tool and wonder ‘hey why is this a requirement? Schema doesn’t say it’s a requirement!’ but Google was showing it as required because it’s what they specifically require on top of schema.org standards to show it in rich results.”
“That’s why we decided it’s not good to mix testing structured data accuracy and testing if something qualifies for Google rich snippets. They’re not exactly the same. We decided to make something that was just specific to the Google side of things, which is the rich results test. The rich results test tells you how you are performing in terms of rich results eligibility.”
“When it comes to the structured data testing tool… we would have to make large sweeping modifications to it to untangle the parts that are Google specific from the parts that aren’t. The structured data testing tool is not going away for a while, and there are other tools out there that can help you. Who knows where this is going. Maybe we can eventually open source a version or something like that — there’s no announcements for that yet but we’ll see where things go.”
Since Google’s inception, indexing the web has been accomplished through crawling. But recently, search engines like Bing have started to shift from this crawl-only approach to one that integrates an indexing API.
💡 Indexing APIs allow webmasters to submit content directly to the search engine, rather than relying on the search engine to crawl and find the content.
Botify even has a partnership with Bing that allows us to render pages for our customers and push them directly into Bing’s index. This saves a lot of time for both bots and websites.
Since we knew that only job posts and livestream events are eligible for Google’s indexing API, we asked Martin for more information on how Google views indexing APIs and where they might be going in the future.
“We don’t have any plans to announce in this area, but yes our indexing API is allowing two content types: livestream events and job postings. These are both pretty real time-y. That’s allowed us to experiment with this new format.”
“I’m looking forward to seeing the future of indexing APIs. I can see potential problems with it, and I’m pretty sure Bing has thought about those as well. For example, why not shove every URL that you have into these indexing APIs all the time? You basically go back to square one. No one can crawl and index and process content all the time for everything that is on the internet. The web is too large. If every website out there starts pushing every page every day into these indexing APIs created by whoever offers them, that’s going to be hard.”
Martin then drew a parallel between indexing APIs and sitemaps, which weren’t a thing when Google started crawling the web.
“In the beginning, all we did was find a URL somewhere, fetch that URL, get all the links on that page, and that’s how we discovered that your website had more pages, and we’d go from link to link, and we could understand the priority of your content roughly based on your structure.”
For a visual explanation of this concept, check out this throwback video of Matt Cutts explaining how Google crawls the web.
“So if it’s something that’s on the home page, it’s probably more important than something where you need to click on the home page, then a menu, then another link in that text, etc. then you probably don’t care as much about that content compared to something that’s linked on the home page. Now that’s not complete obviously, because if I have a bazillion products then they can’t all be on the home page, but they’re still important. Maybe the first thousand products are most important to me, but I don’t want to put a thousand products on the home page, so then the sitemap mechanism was invented, where you could tell us what you thought was most important on your site.”
“But as it turned out, eventually, a lot of people were saying everything was important, which didn’t help. That’s why that signal [the priority field in XML sitemaps] deteriorated in usefulness. So basically, we may run into the same problem with the indexing API where we have to give you a quota, then people are like ‘the quota is too small for me’ and everyone starts saying that, but we’d have to draw a line somewhere. I’m not sure if it’s the silver bullet that everyone hopes for. I do think it’s an interesting concept, and as you say, we are trying it out for certain types of content, but we’ll see.”
Next, we asked Martin about the dangers of infinite scroll.
At some point, we remembered hearing Google’s John Mueller saying something about Google rendering pages using a very tall viewport — something like 9,000 pixels — so we wanted to know if that was still the case today.
“Generally yes, it’s not limited to a certain amount of pixels. There are other heuristics that we use, but yeah generally we are using a viewport that allows us to make sure we see all your content. There are implementation-specific details that may change tomorrow, so I can’t give you a number of pixels, but I would just check with the testing tools if we can see your content.”
So we asked Martin to give us an overview.
“In both those scenarios, users and bots are treated the same way — they both get prerendered content from the server. But when you have problems that only concern bots, you can do dynamic rendering. So when the request comes in, you determine if it’s a user or a bot. If it is, you send that request to a dynamic renderer server that renders the page then gives the static HTML back to the bot, whereas if the user makes the request, they just get the client-side rendered version which they render on their own device.”
We then asked Martin about how fast we need to respond to bots. This was based on something we had heard him say previously about “answering the bots as fast as possible to avoid timeouts.”
How fast is fast enough?
“As fast as you can. One of the things that happens when you’re crawling the web is you’re running into a tradeoff you have to make. On the one hand, you want to make as many HTTP requests as possible to get as much content back from a website as possible. If you are an e-commerce site with a million products, optimally as a crawler, I would make a million HTTP requests in one go, get all the product information back, and then I can update my index based on that, and tomorrow I’ll do the same thing. But at the same time, web servers vary in capability.”
“Maybe you’re an e-commerce provider and it’s Black Friday and everyone wants to buy things from your website at the same time, so your web server is already heavily loaded. Now let’s say Googlebot comes along and makes 10 million requests when normally only 100 customers are shopping on your website at a given time. So maybe your server crashes and serves error pages to Googlebot, or even worse, your visitors. That’s something that not only you don’t want, we at Google don’t want to overwhelm and crash your server either. So we’re in this tradeoff situation — we want to get all your content but we don’t want to crash your server.”
“So we look at things like whether your server responds with 5xx errors. When we see this, we know that maybe we need to make less HTTP requests so we’ll slow down a bit. We’ll eventually start trying to see if we can go back to making more requests, but we’ll be very careful. Another thing that happens right before you push a web server over the edge is it starts getting slower, that’s a good sign a web server is about to be pushed over the edge. We do less requests then too.”
During our audience Q&A, Martin got quite a few questions about whether dynamic rendering would be considered cloaking in a variety of scenarios.
According to Martin, no, you don’t have to worry about cloaking.
“Generally speaking, no. If all you do is dynamic rendering, which is serving a pre-rendered version of your page to bots, and a client-side rendered version to your users, and it’s the same or roughly the same content, you would not risk a cloaking penalty.”
Someone then asked a similar question, but in the context of showing or hiding ads when Googlebot requests their page. Again, Martin said this wouldn’t be considered cloaking.
“Generally, Google is relatively good at spotting ads and not requesting or rendering them, so that’s not a concern I would have. If you do, I wouldn’t worry about them. The risk of introducing additional complexity to make it easier for the crawler is higher than the potential benefit. If you’re not seeing problems with your crawl budget, nor with how your website is rendered, it should be fine. It’s also not considered cloaking.”
There was another question about whether serving Googlebot a single consolidated page rather than multiple parameter pages would be considered cloaking. Again, Martin responded that it would not be.
“Generally speaking, most of the cases where people ask if something is cloaking, it’s definitely not cloaking. Cloaking is when you misdirect the user. People are a lot more nervous about cloaking than I think is warranted. Cloaking is specifically about spammy techniques or misleading users. If your website is about cats but for Googlebot you say it’s about dogs, that’s cloaking. But if it’s a matter of showing 5 cats to the user and 3 or 10 cats to the bot, that’s not a problem from our perspective in terms of cloaking. If your parameters show different content and you choose not to show your parameters to Googlebot, it just means we’re not going to see that content.”
One last question related to cloaking was about whether it was OK to use dynamic rendering to remove some URLs from your page, provided that those URLs were already blocked by robots.txt.
“Yes you can do that, but I would advise against it because it feels like it’s adding complexity because you have a version of your website that’s different than what your users see, which means you have a harder time testing it or you might forget about testing it. That mechanism could go rogue and produce incorrect values or errors that you don’t see. It just seems like more complexity and risk than benefit.”
We then had a question from someone whose e-commerce platform was encouraging them to reduce Googlebot’s crawl rate. That seemed like an extreme measure, so they were wondering if that was necessary.
“I would argue that if this platform says it’s an enterprise platform, they should be able to deal with the load. But then again, if you have a really large site and it can’t be handled, that’s just an argument to consider a different platform. You can limit the rate. It is obviously introducing the risk that we are crawling fewer pages than we might want us to, but if you are below, say, 1 million pages, then you should definitely not worry about this unless you have lots of really frequently updating content.”
“Sure! We don’t care. However, it might add complexity. Because if that server misbehaves, but the server you’re using to browse the website is fine, you might be confused when Google tells you something is a 5xx error. It’s just complexity that you should be careful with.”
“As far as the searcher goes, if they come to your website the first time through search results, they probably won’t get personalized results so I wouldn’t worry too much about it. Just make sure you have a good default experience.”
There’s a tool from the old Google Search Console that hasn’t been migrated, which is the Googlebot crawl time tool that shows how quickly Googlebot can access your pages. We asked Martin if this will be updated and added to the new (current) version of Google Search Console, and if so, if it would include rendering time.
“It’s in progress! But no, it won’t include rendering time. The idea behind Google Search Console is to give insights that are actionable. There’s nothing you can do about rendering. It takes however long it takes. You do get an idea of how websites perform in terms of the user — that’s what the Core Web Vitals is about — but don’t worry about us. Basically, you can pretend that it’s instant.”
💡 Core Web Vitals are new metrics that Google will soon begin to consider as part of their ranking signals. Read more about those metrics and the page experience update here!
If you’d like to watch the full interview with Martin Splitt, it’s available here!
For more information, we recommend checking out the following resources: