Back to blog

Technical SEO

Common Crawlability Issues & How to Solve Them

min read

April 23, 2025

Across the search industry, crawlability isn’t always well-understood. The concept is simple: your site needs to be easily accessible to search and AI bots in order for your content to be found, indexed, and shared with consumers. In practice, it gets complex to apply these concepts and optimize a large website at scale. Let’s solve that today!

What is crawlability?

Crawlability refers to a search engine or AI platform’s ability to access a website’s pages and resources. Search and AI bots crawl websites and explore their code to discover new content and update what they know about existing pages. A critical and often overlooked nuance is that what bots have access to crawl is just as important as what they can’t crawl. This is especially true at scale with enterprise websites.

It’s also important to understand that crawlability is the first part of the SEO funnel. If your pages can’t be crawled — or inversely, too many pages are crawlable — your site’s indexation and organic performance will suffer.

To properly understand crawlability, it’s also important to understand crawl budget.

What is crawl budget?

Crawling and rendering pages is an expensive endeavor, and search platforms don’t have infinite resources. Crawl budget refers to the amount of resources Googlebot and other bots will allocate to crawling your site. This is determined by crawl capacity and crawl demand:

Crawl capacity: This is determined based on a search or AI engine’s resources, as well as your site’s ability to respond quickly and reliably to bot requests.
Crawl demand: This is determined by the total perceived inventory, popularity, and freshness of your site’s content.

Google and crawl budget

While it’s commonly assumed that Google crawls primarily through link discovery, that’s not always the case. In fact, most of the time, they’re not, but instead are refreshing their index for pages already known to exist. In technical SEO, we refer to this as refresh and discovery crawl budget:

Refresh crawl budget: This is when Googlebot crawls pages it already knows exists. It typically makes up 75%–95% of total crawl budget.
Discovery crawl budget: This is when Googlebot crawls net new pages. It typically makes up 5%–25% of total crawl budget.

Pro tip: This data is available in Google Search Console > Settings > Crawl Stats report:

‍

‍

AI bots and crawl budget

Most AI bots tend to have fewer resources and bandwidth to crawl than Google does. They tend to crawl a fraction of the pages Google can, and don’t discover content with as much breadth or depth due to their limited ability to render JavaScript.

‍

Pro tip: If you’re looking to optimize for or control the behavior of AI bots, make sure to examine how they’re crawling your site via your log file data. We have an article that goes in-depth on the process here.

Now that you’re an expert on crawlability, let’s get into the weeds!

What are the most common crawlability issues?

Internal link parameterization

The number-one problem we see related to crawlability is the excessive parameterization of internal links. URL parameters, or query strings, add extra information to a URL after a “?” and multiple parameters are separated by “&”. They help filter content and track traffic. Parameterization is useful in many ways:

Create e-commerce sorting
Filter pages dynamically,
Pagination,
Site search,
Translation,
Describe things like product details,
Tracking traffic from campaigns or button clicks are a few of them.

If you want to track clicks from the main navigation or an internal linking module, it’s safer to use event tracking or custom variables in analytics. Some parameterization use cases have a significant negative impact on crawlability.

Here are some of the top issues it can cause:

Page duplication

Duplication is commonly caused by URL parameterization that doesn’t impact rendering, like tracking parameters, but it can also be caused by malformed URL structures that render duplicate pages. While there’s no such thing as an algorithmic “duplicate content penalty,” duplication can cause issues at scale. For example, Google will crawl millions of versions of the same page, all with different URL and query string structures. Until Google renders the page, every single URL is considered different to Google.

Pro tip: Remember, canonicalization is a signal, not a directive. Google commonly ignores canonicals. They don’t solve crawlability issues; in fact, when poorly implemented, they can perpetuate by enabling Google to crawl duplicate pages. Canonicalization is perfectly fine in moderation, but can cause large crawlability issues at scale when not used properly.

Spider traps

A “spider trap” is a situation where a crawler crawls a never-ending loop of duplicate or similar pages, getting stuck and wasting crawl budget. The most common spider traps we see are caused by parameters that reference the URL they received the link from, typically with “ref=”. This is an issue especially if the page contains a link to itself, which then compounds with each additional link referencing the previously parameterized link, creating an endless spider trap.

We commonly see spider traps on sites with relative (vs. absolute) <a> tag structures, where pages render properly on subdomains that shouldn’t be accessible to bots, or by missing a “/” in the relative URL structure.

Site search crawlability

While it’s directly related to parameterization, site search crawlability deserves a spot of its own. Site search pages typically don’t contain any useful content, and they have the potential to wreak havoc on crawl budget if left open for bots to crawl because they can generate a potentially infinite number of erroneous pages.. These pages differ from dynamic listing pages, which are created intentionally to target specific long-tail search intent for e-commerce websites., which may target specific search intent for e-commerce websites.

Deep facets

Last but not least, deep facets can get out of hand quickly. Filtered PLPs with multiple brands, colors, sizes, etc. are often not valuable for organic users. It’s best to audit the performance of facets to determine which facet patterns should be kept open to bots, and which ones should be blocked in your robots.txt file.

Soft 404s

Soft 404s can be anything from error pages (like a custom 404) that render 200 status codes, to depreciated out-of-stock PDPs, PLPs with no product results, or even completely blank pages. These are hard to identify if you don’t know what to look for. Google Search Console has a soft 404 report; it’s a good idea to check there as a starting point to identify patterns you can then fix at scale.

Site latency & server reliability

While commonly thought of from a user perspective in terms of Core Web Vitals, latency is even more important for bots. If bots can’t render your pages quickly and reliably, your crawl budget will undoubtedly suffer. This will mainly be determined by the server capacity and how resource-heavy your pages are. Think about it this way: if a bot spends time waiting for content to be rendered, that's time it cannot spend crawling other pages of your website.

Resource accessibility & speed

Google

Make sure Google can access all resources (mainly JavaScript and API calls) that are essential to render your pages. You can see if any resources are inaccessible by inspecting pages with Google’s Rich Results tool in Google Search Console, then navigating to “View Tested Page > More Info > Page resources” to view any errors.

It’s important that these resources render quickly. To render your pages as fast as possible, "Google aggressively caches resources, so it's best practice to have consistent naming conventions for your resources. Establishing a solid naming convention prevents frequent updating of resource names, which would force Google to refetch and recache the resources and ultimately increase page rendering time." When naming conventions of assets are updated regularly, it forces Google to fetch and cache new resources, resulting in a longer page load.

‍

Pro tip: POST requests cannot be cached. While this is slightly more in the weeds, it’s better not to leverage POST requests at scale, as their inability to be cached can eat up your crawl budget. Here at Botify, we’ve seen examples of POST requests consuming more than 75% of a client’s total crawl budget.

AI bots

As previously mentioned, most AI bots cannot render JavaScript at all so any page content that relies on JavaScript is at risk of being invisible to AI search platforms. You can solve this by rendering and delivering content directly to bots with a solution like Botify’s SpeedWorkers.

Page/click depth

The further away a page is from your homepage, the less likely that page is to get crawled. This is true across every website in every industry.

Outdated XML sitemaps

Dynamic XML sitemaps ensure that even if Google cannot discover a page through your internal architecture, they can discover it another way. As a best practice, XML sitemaps should only include your most important, indexable URLs.

Pages that have no means of discovery will never get crawled, which leads us to our last example:

Orphan pages

Orphan pages exist outside of your internal linking structure; they’re pages that don’t have internal links actively pointing to them.

Pages that can’t be discovered won’t get crawled. If they aren’t crawled, they’ll never be found by consumers in search. While most orphan pages have been linked to at some point, meaning Google will eventually return to refresh the crawl, the point still stands: make sure your important pages live within your website structure.

How do SEOs solve crawlability issues?

Aside from the fixes we’ve covered for the examples shared above, in general, you can solve crawlability issues by doing the following:

Investigate your log file data for the most commonly crawled resources, URL patterns, and query strings, plus the status codes bots receive when they crawl
Leverage your robots.txt file! It’s very common to see brands under-utilize robots.txt.
Deparameterize internal links that don’t impact rendering
Create internal link modules that scale

Site health audits can identify many issues, but the key to understanding and solving your crawlability issues lies in your log file data. Without access to log data analysis at scale, large websites are more or less looking at an entire search universe through a microscope. It’s important to invest in tools that automate and streamline holistic organic search data and log file analysis at scale, helping your search team do more with the same time and resources.

How search marketers can solve crawlability issues at scale with Botify

The Botify suite solves crawlability issues proactively, automatically, and at scale. Here’s how:

Log file analysis & reporting: Botify’s LogAnalyzer combines with our CustomReports templates to allow you to analyze log file data easily and at scale, and share those insights with your team and stakeholders.
Automated indexing: With sitemap generation, partnerships with IndexNow, and our exclusive ability to submit site content directly to AI platforms, you can be certain that your most important pages are known by search engines and consumers.
Rendering and delivering content directly to bots: SpeedWorkers renders your pages for Google and other bots in flat HTML to ensure they crawl your most valuable pages, quickly and completely. You can also avoid wasting resources and easily redirect bots away from low-value pages by stripping parameters, 410ing malformed URLs, or cleaning up internal links
Optimize your site's internal linking structure: Botify's SmartLink identifies internal linking opportunities through your own first-party data and implements them at scale. By implementing a clean and intelligent link structure, you effectively bring your revenue-generating pages up and guide bots toward product and category pages more effectively.

Want to learn more? Connect with our team for a Botify demo!

Get in touch