Indexation is not a prerogative, but a successful achievement.
Search engines can have a bad time when parsing and rendering your content. During the second wave of indexation, critical resources for user experience are those more likely to be phased out due to their own complexities (e.g stylesheets, JavaScript files).
In this post, I will present a few reasons for pages being crawled but not indexed and put forward some advantageous solutions to fix.
Table of Contents
What is Crawled – Not Indexed
Crawled – Not Indexed is an indexing reason issued from the Pages report on Google Search Console to alert webmasters that search engines visited a set of pages but havenโt indexed them yet.
As a result, if you operate a site:search for a crawled but not indexed page you will see Google deserting the search results page for that search operator.
A “Crawled – not Indexed” page being ghosted from the SERP
Despite sounding similar to the “Discovered – Currently Not Indexed” counterpart, there are fewer nuances determining search engines crawling conduct. In fact, this ticket aims to inform webmasters that a set of pages were probably crawled but indexing was put on hold.
Because indexation is fingered as the culprit, we would normally classify this as an issue.
Warning
Issue
โ
โ
Why are pages being Crawled – Not Indexed
The non-indexing reason “Crawled – Not Indexed” may affect a set of pages for a few reasons.
XML sitemaps are Missing
Discontinued PDP or PLP
Dynamic pages with query parameters
Thin or Low-Worded Content
Poor Headings Structure
High Time to First Byte(TTFB)
Rendering Bottlenecks on Critical Resources
Poor Mobile-friendly pages.
Letโs explore these in more depth.
XML sitemaps are Missing
Submitting an XML sitemap containing valid URLs is probably the most ancient and DIY prevention against indexation restraints.
Still, a large proportion of eCommerce appears reluctant to follow such practice. The impact of a consistent XML sitemap increases with the size of the online store. Besides, the lack of a dedicated sitemap often engages in a negative correlation with the number of pages crawled but not indexed.
In this example, you can see a PDP (product detail page) being labeled as Crawled – not Indexed within the Google Search Console property of ysl.com/en-gb
As you can see from the Robots.txt file, no XML sitemap has been submitted for the /en-gb/ sub directory
YSL Robots.txt file
Whether this page doesn’t apparently present issues, it easily falls out of Google’s radar. This may represent a major issue for eCommerce, given their SMART objectives are configured with maximizing profits and revenue from existing PDPs.
“Out of stock” items or old page categories are another classic reason for hampering indexation. This often translates into pages either returning 404 pages or subtle 2xx pages highly likely to be treated as Soft 404s.
Just like in this example
Page not found deriving from a discontinued PDP
A specific model of the sl24 sneakers from Saint Laurent wasn’t available any longer. Despite the configuration of a traditional custom 404 error page, the URL still returns an HTTP 200 status code.
โ ๏ธWARNINGโ ๏ธ
Handling discontinued products takes up a number of in-depth considerations revolving around both revenue goals and an assessment on whether these pages should be temporarily or permanently discontinued.
What would happen if you left your house door open for the time you’re away?
Chances are burglars will sneak into your property and you will be robbed of valuable items.
Allowing search engines free access to pages with filters contributes to exacerbating crawling and indexing other than wasting crawl budget.
This is a very common instance I often stumble across when auditing larger eCommerce sites:
PDP crawled – not indexed PDP with query parameters
Given search engines historically prefer to crawl and index pages with static URLs, you should always make sure to design robust URL structures for your eCommerce.
Dynamic URLs vs Static URLs – How to Audit for improved Crawl Efficacy
Within a tech audit, there are a number of underdogs that we tend to bypass pretty quickly.
In one of my latest audits, I stumbled across so many times dynamic URLs to the point I felt like throwing up๐คข
What really matters here is paying attention to the Robots.txt file directives to ensure you distract web crawlers from accessing pages with filters.
Because they come with query parameters (e.g ?search=), search engines might be enticed to turn these pages into “Crawled – Not Indexed”.
Thin or Low-Worded Content
Here is another classic reason for pages being prevented from indexation.
This is very common for luxury brands that rely on imagery from highly visual campaigns. Due to the fixed seasonality of fashion collections, some eCommerce often showcase at least 3 sub-categories filled with additional sub-folders including empty-worded pages.
Page with too little body content on YSL
It’s now very clear why this example of a summer 2020 collection campaign at Saint Laurent (en-gb) is being classified by search engines as “Crawled – currently Not Indexed”.
Poor Headings Structure
It’s not always supposed to be on thin content though.
Google may find it hard to parse a set of pages where the headings are actually buried within tons of stylesheets, JavaScript, or even coded with unparsable HTML.
In a recent first-hand test on his site, Mordy Oberstein proved that sometimes it’s about how you serve content to users that can negatively impact crawlability and accessibility.
I donโt know if itโs me or if itโs something happening more often these days in general, but Iโve seen more & more of my pages on the SEO Rant site being crawled but not indexed.
So I made some changes. Lo & behold the pages are now indexed
Long story short, you shouldn’t underestimate measuring the overall JavaScript/CSS dependency of your website as meta tags and headings are often more exposed to dynamic content injection deriving from user interaction.
High Response Time for a Crawl Request
If search engines are taking too long to accommodate a crawl request to fetch a page, then chances are your pages are likely to end up as “Crawled – not Indexed” or at the utmost “Discovered – currently not Indexed”.
A high response time for a crawl request can hamper crawling and indexing performance, as Googlebot is forced to wait a long time until the very first bytes of the pages finish loading.
The crawl stats on your Google Search Console root will help you with this check.
Let’s break this graph down a bit.
Average page response time for a crawl request
This is the avg. response time for a crawl request to retrieve the page content. It does not include retrieving page resources (scripts, images, and other linked or embedded content) or page rendering time.
Total number of crawl requests
This is the total number of crawl requests to your site, in the time span shown (Google says 90 days but this could get a bit more). Duplicate requests to the same URL are counted.
To prevent your pages from being excluded from the index, you should make sure that the following equation actually occurs
Total number of crawl requests > Avg. page response for a crawl request
Arguably, there are several methods to validate such an equation and they all boil down to site speed optimizations touching on potential rendering bottlenecks.
Without further rambling, you should eyeball the chart to make sure that the average response time is <300 ms to allow the search engines to achieve decent levels of crawlability.
Other considerations around rendering and Core Web Vitals should be made. For instance, having a high LCP score from the CrUx (Chrome User Experience) may represent a symptom.
Rendering Bottlenecks on Critical Resources
As anticipated, pages may have a bad time out in the woods of the rendering process.
Pages being prevented from indexation may suffer from severe discrepancies between the pre-render version (raw HTML) and the post-render version.
Let’s see a few culprits that could prevent successful indexation.
The URL inspection tool from Google Search Console would provide the answer.
You can head to the โ More Infoโ tab when using the Google Search Console URL Inspection tool and look at the number of Other Errors.
Other Error (“Altro errore”) pinpointing at stylesheets (“Foglio di stile”)
Here the resources you need to be gazing at are stylesheets and JavaScript files.
Once you found them out, you can measure how much unused resources are being wasted on your site (chrome dev.tool > Coverage) and assess to what extent “Other Errors” convert into proper render-blocking resources
Here’s how to run a quick check
Open the chrome dev.tool
Head to the Network tab
Using the search bar, type in a resource from “Other Errors”
Right-click on the culprit resource and select “Block request URL”
Poor Mobile-friendly pages
This is commonly due to the presence of heavy resources that search engines couldnโt fetch (e.g critical CSS).
Mobile-friendly issues on a crawled-not index page
During the second phase of indexing, search engines may skip rendering requests for files like CSS and JavaScript, considering them not worth the render budget.
As a result, important elements of a page can be lost, resulting in web crawlers withdrawing from the crawling process.
It is always a good practice to test with the URL inspection tool as the mobile-friendly tester is not that accurate and is going to be dismissed by Google.
How to Fix Crawled – Not Indexed Pages
As with most SEO processes, fixing indexation restraints doesn’t come in a week.
In the first place, the problem should be diagnosed from different angles using the above-mentioned few hints. The page indexing report on Google Search Console will nail down most of the effort, so you need to work closely with this invaluable first-party tool.
Example of Crawled – not Indexed PDP (Google Search Console)
Due to being one of the most nuanced non-indexing reasons, there’s no single cure for pages affected by indexation delays.
Instead, make sure to consider the following:
1๏ธโฃ Submit a dedicated XML sitemap. It is widely recommended to consider submitting the XML sitemap via both the Search Console property and the Robots.txt file to prompt search engines to pay a visit to your website’s most relevant pages.
2๏ธโฃ Fix discontinued PDP or PLP. Depending on items’ availability over time and how much revenue is generated, you should raise assumptions to consider removing these pages and internal links by returning an HTTP 410 response code. This will help you save crawl budget, given that Google tends to crawl less frequently pages returning HTTP 410 than HTTP 404
3๏ธโฃ Ensure key PDPs are configured as static URLs. Although search engines can crawl parameters, having a clear URL structure benefits the overall website navigation. In case key PDPs and PLPs were caught with unpleasant query parameters, consider the extent to block them in your site’s Robots.txt file.
4๏ธโฃ Remove render-blocking resources and keep crawl requests time at bay. As anticipated, you need to ensure your pages aren’t reliant on client-side rendering. This could lead to the rendered HTML coming with dynamically injected content that doesn’t exist in the raw HTML. This is an issue in case you expected the missing information to be discovered by web crawlers but it turns out it’s not.
5๏ธโฃ Improve headings and evaluate room for content integration. This includes rephrasing sentences where the content is either too short or hard to read.Ideally, you should write concise sentences avoiding adverbs and too specific jargon. Rather, target the right entities and find proper synonyms to convey elaborated sentences. You can help rephrase your sentences using Text Analyzer and leverage fitting entities with Google’s NLP tool demo.
6๏ธโฃ Adjust search intent so that it aligns with the intent shifts caused by a Core update. This includes tweaking title tags and meta descriptions accordingly but also finetuning your content with contextualized internal links.
Conclusion
There’s a lot going on when it comes to fixing indexation restraints despite pages being crawled.
Having a large proportion of crawled pages flogging Google index to no avail can wreak havoc crawl budget and harm crawl efficacy. On the flip side, you can look at the issue from several angles and use different methods to nail it down.
Hopefully, this post contributed to achieving at least a small part of it.
Let me know in the comments or on Twitter if youโre struggling with “Crawled – not Indexed” pages.
FAQ
What does Indexing mean?
Indexing refers to the practice search engines use to organizethe information from the websites that they visit.
Indexing is a common practice for search engines and represents the ultimate parsing stage of strings of structured and unstructured information following web crawling and rendering.
It’s arguably the most important stage, as content not in the index canโt rank for a search result
What is indexed vs non-indexed?
A page can be indexed or non-indexed depending on the ability of search engines to successfully access a page, parse critical resources (e.g CSS/JavaScript), and ultimately store it on Google so that users can have free access
Why my products are not indexed in Google?
eCommerce websites may encounter a few indexation restraints due to the following reasons:
– Lack of a valid XML sitemap containing key product pages – PDP with Out of Stock Items – Dynamic pages with query parameters – Thin or Low-Worded Content – Poor Headings Structure – High Time to First Byte(TTFB) – Rendering Bottlenecks on Critical Resources
How long does it take to get indexed by Google?
It’s not possible to estimate how long a page will take to get indexed after being submitted to Google. This may vary depending on a number of factors ranging from a website’s size to the type of crawling issue encountered.
Further analysis might enable SEO professionals to raise more accurate assumptions, albeit affected by permanent outliers.
Simone De Palma
Technical SEO Executive
Simone De Palma is a Technical SEO Executive at Dentsu and the founder of SEO Depths.
In his previous life, he was a grad student in Marketing and Management at Universitร IULM in Milan and worked as SEO Specialist in digital agencies in Italy and in the United Kingdom.