โ“What is Crawled – Not Indexed and How to Fix for eCommerce

Reading time: 12 Minutes

Indexation is not a prerogative, but a successful achievement.

Search engines can have a bad time when parsing and rendering your content. During the second wave of indexation, critical resources for user experience are those more likely to be phased out due to their own complexities (e.g stylesheets, JavaScript files).

In this post, I will present a few reasons for pages being crawled but not indexed and put forward some advantageous solutions to fix.

What is Crawled – Not Indexed

Crawled – Not Indexed is an indexing reason issued from the Pages report on Google Search Console to alert webmasters that search engines visited a set of pages but havenโ€™t indexed them yet.

As a result, if you operate a site:search for a crawled but not indexed page you will see Google deserting the search results page for that search operator.

Example of a page being crawled but not indexed from a site:search attempt
A “Crawled – not Indexed” page being ghosted from the SERP

Despite sounding similar to the “Discovered – Currently Not Indexed” counterpart, there are fewer nuances determining search engines crawling conduct. In fact, this ticket aims to inform webmasters that a set of pages were probably crawled but indexing was put on hold.

Because indexation is fingered as the culprit, we would normally classify this as an issue.

WarningIssue
โŒโœ…

Why are pages being Crawled – Not Indexed

The non-indexing reason “Crawled – Not Indexed” may affect a set of pages for a few reasons.

  1. XML sitemaps are Missing
  2. Discontinued PDP or PLP
  3. Dynamic pages with query parameters
  4. Thin or Low-Worded Content
  5. Poor Headings Structure
  6. High Time to First Byte(TTFB)
  7. Rendering Bottlenecks on Critical Resources
  8. Poor Mobile-friendly pages.

Letโ€™s explore these in more depth.

XML sitemaps are Missing

Submitting an XML sitemap containing valid URLs is probably the most ancient and DIY prevention against indexation restraints.

Still, a large proportion of eCommerce appears reluctant to follow such practice. The impact of a consistent XML sitemap increases with the size of the online store. Besides, the lack of a dedicated sitemap often engages in a negative correlation with the number of pages crawled but not indexed.

In this example, you can see a PDP (product detail page) being labeled as Crawled – not Indexed within the Google Search Console property of ysl.com/en-gb

Crawled - not indexed page example from an ecommerce store where the page is not included in an XML sitemap

As you can see from the Robots.txt file, no XML sitemap has been submitted for the /en-gb/ sub directory

YSL Robots.txt file
YSL Robots.txt file

Whether this page doesn’t apparently present issues, it easily falls out of Google’s radar. This may represent a major issue for eCommerce, given their SMART objectives are configured with maximizing profits and revenue from existing PDPs.

๐Ÿ’กBONUS

XML sitemaps are often huge files that requires time and dedication to sift through. You can learn how to automate an XML sitemaps audit with Python

Discontinued PDP or PLP

“Out of stock” items or old page categories are another classic reason for hampering indexation. This often translates into pages either returning 404 pages or subtle 2xx pages highly likely to be treated as Soft 404s.

Just like in this example

Page not found deriving from a discontinued PDP
Page not found deriving from a discontinued PDP

A specific model of the sl24 sneakers from Saint Laurent wasn’t available any longer. Despite the configuration of a traditional custom 404 error page, the URL still returns an HTTP 200 status code.

โš ๏ธWARNINGโš ๏ธ

Handling discontinued products takes up a number of in-depth considerations revolving around both revenue goals and an assessment on whether these pages should be temporarily or permanently discontinued.

You can find more on how to handle discontinued products in this blog post from Content King

Dynamic pages with query parameters

What would happen if you left your house door open for the time you’re away?

Chances are burglars will sneak into your property and you will be robbed of valuable items.

Allowing search engines free access to pages with filters contributes to exacerbating crawling and indexing other than wasting crawl budget.

This is a very common instance I often stumble across when auditing larger eCommerce sites:

example of crawled not indexed page with filters and query parameters within the URL
PDP crawled – not indexed PDP with query parameters

Given search engines historically prefer to crawl and index pages with static URLs, you should always make sure to design robust URL structures for your eCommerce.

What really matters here is paying attention to the Robots.txt file directives to ensure you distract web crawlers from accessing pages with filters.

Because they come with query parameters (e.g ?search=), search engines might be enticed to turn these pages into “Crawled – Not Indexed”.

Thin or Low-Worded Content

Here is another classic reason for pages being prevented from indexation.

This is very common for luxury brands that rely on imagery from highly visual campaigns. Due to the fixed seasonality of fashion collections, some eCommerce often showcase at least 3 sub-categories filled with additional sub-folders including empty-worded pages.

Page with too little body content on YSL
Page with too little body content on YSL

It’s now very clear why this example of a summer 2020 collection campaign at Saint Laurent (en-gb) is being classified by search engines as “Crawled – currently Not Indexed”.

Poor Headings Structure

It’s not always supposed to be on thin content though.

Google may find it hard to parse a set of pages where the headings are actually buried within tons of stylesheets, JavaScript, or even coded with unparsable HTML.

In a recent first-hand test on his site, Mordy Oberstein proved that sometimes it’s about how you serve content to users that can negatively impact crawlability and accessibility.

Long story short, you shouldn’t underestimate measuring the overall JavaScript/CSS dependency of your website as meta tags and headings are often more exposed to dynamic content injection deriving from user interaction.

High Response Time for a Crawl Request

If search engines are taking too long to accommodate a crawl request to fetch a page, then chances are your pages are likely to end up as “Crawled – not Indexed” or at the utmost “Discovered – currently not Indexed”.

A high response time for a crawl request can hamper crawling and indexing performance, as Googlebot is forced to wait a long time until the very first bytes of the pages finish loading.

The crawl stats on your Google Search Console root will help you with this check.

avg. crawl response time vs total crawl requests

Let’s break this graph down a bit.

Average page response time for a crawl request This is the avg. response time for a crawl request to retrieve the page content. It does not include retrieving page resources (scripts, images, and other linked or embedded content) or page rendering time.
Total number of crawl requestsThis is the total number of crawl requests to your site, in the time span shown (Google says 90 days but this could get a bit more). Duplicate requests to the same URL are counted.

To prevent your pages from being excluded from the index, you should make sure that the following equation actually occurs

Total number of crawl requests > Avg. page response for a crawl request

Arguably, there are several methods to validate such an equation and they all boil down to site speed optimizations touching on potential rendering bottlenecks.

Without further rambling, you should eyeball the chart to make sure that the average response time is <300 ms to allow the search engines to achieve decent levels of crawlability.

Other considerations around rendering and Core Web Vitals should be made. For instance, having a high LCP score from the CrUx (Chrome User Experience) may represent a symptom.

Rendering Bottlenecks on Critical Resources

As anticipated, pages may have a bad time out in the woods of the rendering process.

Pages being prevented from indexation may suffer from severe discrepancies between the pre-render version (raw HTML) and the post-render version.

Let’s see a few culprits that could prevent successful indexation.

The URL inspection tool from Google Search Console would provide the answer.

You can head to the โ€œ More Infoโ€ tab when using the Google Search Console URL Inspection tool and look at the number of Other Errors.

Other errors from the More Info tab in the URL inspection tool on Google Search console
Other Error (“Altro errore”) pinpointing at stylesheets (“Foglio di stile”)

Here the resources you need to be gazing at are stylesheets and JavaScript files.

Once you found them out, you can measure how much unused resources are being wasted on your site (chrome dev.tool > Coverage) and assess to what extent “Other Errors” convert into proper render-blocking resources

Here’s how to run a quick check

Open the chrome dev.tool
Head to the Network tab
Using the search bar, type in a resource from “Other Errors”
Right-click on the culprit resource and select “Block request URL”

Poor Mobile-friendly pages

This is commonly due to the presence of heavy resources that search engines couldnโ€™t fetch (e.g critical CSS).

Mobile-friendly issues on a crawled-not index page
Mobile-friendly issues on a crawled-not index page

During the second phase of indexing, search engines may skip rendering requests for files like CSS and JavaScript, considering them not worth the render budget.

As a result, important elements of a page can be lost, resulting in web crawlers withdrawing from the crawling process.

It is always a good practice to test with the URL inspection tool as the mobile-friendly tester is not that accurate and is going to be dismissed by Google.

How to Fix Crawled – Not Indexed Pages

As with most SEO processes, fixing indexation restraints doesn’t come in a week.

In the first place, the problem should be diagnosed from different angles using the above-mentioned few hints. The page indexing report on Google Search Console will nail down most of the effort, so you need to work closely with this invaluable first-party tool.

Example of a product detail page classified as Crawled - not Indexed in Search Console
Example of Crawled – not Indexed PDP (Google Search Console)

Due to being one of the most nuanced non-indexing reasons, there’s no single cure for pages affected by indexation delays.

Instead, make sure to consider the following:

  • 1๏ธโƒฃ Submit a dedicated XML sitemap.
    It is widely recommended to consider submitting the XML sitemap via both the Search Console property and the Robots.txt file to prompt search engines to pay a visit to your website’s most relevant pages.

  • 2๏ธโƒฃ Fix discontinued PDP or PLP.
    Depending on items’ availability over time and how much revenue is generated, you should raise assumptions to consider removing these pages and internal links by returning an HTTP 410 response code. This will help you save crawl budget, given that Google tends to crawl less frequently pages returning HTTP 410 than HTTP 404

  • 3๏ธโƒฃ Ensure key PDPs are configured as static URLs.
    Although search engines can crawl parameters, having a clear URL structure benefits the overall website navigation. In case key PDPs and PLPs were caught with unpleasant query parameters, consider the extent to block them in your site’s Robots.txt file.

  • 4๏ธโƒฃ Remove render-blocking resources and keep crawl requests time at bay.
    As anticipated, you need to ensure your pages aren’t reliant on client-side rendering. This could lead to the rendered HTML coming with dynamically injected content that doesn’t exist in the raw HTML. This is an issue in case you expected the missing information to be discovered by web crawlers but it turns out it’s not.

  • 5๏ธโƒฃ Improve headings and evaluate room for content integration.
    This includes rephrasing sentences where the content is either too short or hard to read. Ideally, you should write concise sentences avoiding adverbs and too specific jargon. Rather, target the right entities and find proper synonyms to convey elaborated sentences. You can help rephrase your sentences using Text Analyzer and leverage fitting entities with Google’s NLP tool demo.

  • 6๏ธโƒฃ Adjust search intent so that it aligns with the intent shifts caused by a Core update. This includes tweaking title tags and meta descriptions accordingly but also finetuning your content with contextualized internal links.

Conclusion

There’s a lot going on when it comes to fixing indexation restraints despite pages being crawled.

Having a large proportion of crawled pages flogging Google index to no avail can wreak havoc crawl budget and harm crawl efficacy. On the flip side, you can look at the issue from several angles and use different methods to nail it down.

Hopefully, this post contributed to achieving at least a small part of it.

Let me know in the comments or on Twitter if youโ€™re struggling with “Crawled – not Indexed” pages.

FAQ

What does Indexing mean?

Indexing refers to the practice search engines use to organize the information from the websites that they visit.

Indexing is a common practice for search engines and represents the ultimate parsing stage of strings of structured and unstructured information following web crawling and rendering.

It’s arguably the most important stage, as content not in the index canโ€™t rank for a search result

What is indexed vs non-indexed?

A page can be indexed or non-indexed depending on the ability of search engines to successfully access a page, parse critical resources (e.g CSS/JavaScript), and ultimately store it on Google so that users can have free access

Why my products are not indexed in Google?

eCommerce websites may encounter a few indexation restraints due to the following reasons:

– Lack of a valid XML sitemap containing key product pages
– PDP with Out of Stock Items
– Dynamic pages with query parameters
– Thin or Low-Worded Content
– Poor Headings Structure
– High Time to First Byte(TTFB)
– Rendering Bottlenecks on Critical Resources

How long does it take to get indexed by Google?

It’s not possible to estimate how long a page will take to get indexed after being submitted to Google. This may vary depending on a number of factors ranging from a website’s size to the type of crawling issue encountered.

Further analysis might enable SEO professionals to raise more accurate assumptions, albeit affected by permanent outliers.

Never Miss a Beat

Subscribe now to receive weekly tips about Technical SEO and Data Science ๐Ÿ”ฅ