Indexation is not a prerogative, but a successful achievement.
In this post, I will present a few reasons for pages being crawled but not indexed and put forward some advantageous solutions to fix.
What is Crawled – Not Indexed
Crawled – Not Indexed is an indexing reason issued from the Pages report on Google Search Console to alert webmasters that search engines visited a set of pages but haven’t indexed them yet.
As a result, if you operate a site:search for a crawled but not indexed page you will see Google deserting the search results page for that search operator.
Despite sounding similar to the “Discovered – Currently Not Indexed” counterpart, there are fewer nuances determining search engines crawling conduct. In fact, this ticket aims to inform webmasters that a set of pages were probably crawled but indexing was put on hold.
Because indexation is fingered as the culprit, we would normally classify this as an issue.
Why are pages being Crawled – Not Indexed
The non-indexing reason “Crawled – Not Indexed” may affect a set of pages for a few reasons.
- XML sitemaps are Missing
- Discontinued PDP or PLP
- Dynamic pages with query parameters
- Thin or Low-Worded Content
- Poor Headings Structure
- High Time to First Byte(TTFB)
- Rendering Bottlenecks on Critical Resources
- Poor Mobile-friendly pages.
Let’s explore these in more depth.
XML sitemaps are Missing
Submitting an XML sitemap containing valid URLs is probably the most ancient and DIY prevention against indexation restraints.
Still, a large proportion of eCommerce appears reluctant to follow such practice. The impact of a consistent XML sitemap increases with the size of the online store. Besides, the lack of a dedicated sitemap often engages in a negative correlation with the number of pages crawled but not indexed.
In this example, you can see a PDP (product detail page) being labeled as Crawled – not Indexed within the Google Search Console property of ysl.com/en-gb
As you can see from the Robots.txt file, no XML sitemap has been submitted for the /en-gb/ sub directory
Whether this page doesn’t apparently present issues, it easily falls out of Google’s radar. This may represent a major issue for eCommerce, given their SMART objectives are configured with maximizing profits and revenue from existing PDPs.
XML sitemaps are often huge files that requires time and dedication to sift through. You can learn how to automate an XML sitemaps audit with Python
Discontinued PDP or PLP
“Out of stock” items or old page categories are another classic reason for hampering indexation. This often translates into pages either returning 404 pages or subtle 2xx pages highly likely to be treated as Soft 404s.
Just like in this example
A specific model of the sl24 sneakers from Saint Laurent wasn’t available any longer. Despite the configuration of a traditional custom 404 error page, the URL still returns an HTTP 200 status code.
Handling discontinued products takes up a number of in-depth considerations revolving around both revenue goals and an assessment on whether these pages should be temporarily or permanently discontinued.
You can find more on how to handle discontinued products in this blog post from Content King
Dynamic pages with query parameters
What would happen if you left your house door open for the time you’re away?
Chances are burglars will sneak into your property and you will be robbed of valuable items.
Allowing search engines free access to pages with filters contributes to exacerbating crawling and indexing other than wasting crawl budget.
This is a very common instance I often stumble across when auditing larger eCommerce sites:
Given search engines historically prefer to crawl and index pages with static URLs, you should always make sure to design robust URL structures for your eCommerce.
Dynamic URLs vs Static URLs – How to Audit for improved Crawl Efficacy— Simone De Palma 🦊 (@SimoneDePalma2) November 22, 2022
Within a tech audit, there are a number of underdogs that we tend to bypass pretty quickly.
In one of my latest audits, I stumbled across so many times dynamic URLs to the point I felt like throwing up🤢
What really matters here is paying attention to the Robots.txt file directives to ensure you distract web crawlers from accessing pages with filters.
Because they come with query parameters (e.g ?search=), search engines might be enticed to turn these pages into “Crawled – Not Indexed”.
Thin or Low-Worded Content
Here is another classic reason for pages being prevented from indexation.
This is very common for luxury brands that rely on imagery from highly visual campaigns. Due to the fixed seasonality of fashion collections, some eCommerce often showcase at least 3 sub-categories filled with additional sub-folders including empty-worded pages.
It’s now very clear why this example of a summer 2020 collection campaign at Saint Laurent (en-gb) is being classified by search engines as “Crawled – currently Not Indexed”.
Poor Headings Structure
It’s not always supposed to be on thin content though.
In a recent first-hand test on his site, Mordy Oberstein proved that sometimes it’s about how you serve content to users that can negatively impact crawlability and accessibility.
I don’t know if it’s me or if it’s something happening more often these days in general, but I’ve seen more & more of my pages on the SEO Rant site being crawled but not indexed.— Mordy Oberstein 🇺🇦 (@MordyOberstein) September 13, 2022
So I made some changes. Lo & behold the pages are now indexed
Here’s what I found fixed my issues pic.twitter.com/RjvjEkvz6G
High Response Time for a Crawl Request
If search engines are taking too long to accommodate a crawl request to fetch a page, then chances are your pages are likely to end up as “Crawled – not Indexed” or at the utmost “Discovered – currently not Indexed”.
A high response time for a crawl request can hamper crawling and indexing performance, as Googlebot is forced to wait a long time until the very first bytes of the pages finish loading.
The crawl stats on your Google Search Console root will help you with this check.
Let’s break this graph down a bit.
|Average page response time for a crawl request||This is the avg. response time for a crawl request to retrieve the page content. It does not include retrieving page resources (scripts, images, and other linked or embedded content) or page rendering time.|
|Total number of crawl requests||This is the total number of crawl requests to your site, in the time span shown (Google says 90 days but this could get a bit more). Duplicate requests to the same URL are counted.|
To prevent your pages from being excluded from the index, you should make sure that the following equation actually occurs
Arguably, there are several methods to validate such an equation and they all boil down to site speed optimizations touching on potential rendering bottlenecks.
Without further rambling, you should eyeball the chart to make sure that the average response time is <300 ms to allow the search engines to achieve decent levels of crawlability.
Other considerations around rendering and Core Web Vitals should be made. For instance, having a high LCP score from the CrUx (Chrome User Experience) may represent a symptom.
Rendering Bottlenecks on Critical Resources
As anticipated, pages may have a bad time out in the woods of the rendering process.
Pages being prevented from indexation may suffer from severe discrepancies between the pre-render version (raw HTML) and the post-render version.
Let’s see a few culprits that could prevent successful indexation.
The URL inspection tool from Google Search Console would provide the answer.
You can head to the “ More Info” tab when using the Google Search Console URL Inspection tool and look at the number of Other Errors.
Once you found them out, you can measure how much unused resources are being wasted on your site (chrome dev.tool > Coverage) and assess to what extent “Other Errors” convert into proper render-blocking resources
Here’s how to run a quick check
|Open the chrome dev.tool|
|Head to the Network tab|
|Using the search bar, type in a resource from “Other Errors”|
|Right-click on the culprit resource and select “Block request URL”|
Poor Mobile-friendly pages
This is commonly due to the presence of heavy resources that search engines couldn’t fetch (e.g critical CSS).
As a result, important elements of a page can be lost, resulting in web crawlers withdrawing from the crawling process.
It is always a good practice to test with the URL inspection tool as the mobile-friendly tester is not that accurate and is going to be dismissed by Google.
How to Fix Crawled – Not Indexed Pages
As with most SEO processes, fixing indexation restraints doesn’t come in a week.
In the first place, the problem should be diagnosed from different angles using the above-mentioned few hints. The page indexing report on Google Search Console will nail down most of the effort, so you need to work closely with this invaluable first-party tool.
Due to being one of the most nuanced non-indexing reasons, there’s no single cure for pages affected by indexation delays.
Instead, make sure to consider the following:
- 1️⃣ Submit a dedicated XML sitemap.
It is widely recommended to consider submitting the XML sitemap via both the Search Console property and the Robots.txt file to prompt search engines to pay a visit to your website’s most relevant pages.
- 2️⃣ Fix discontinued PDP or PLP.
Depending on items’ availability over time and how much revenue is generated, you should raise assumptions to consider removing these pages and internal links by returning an HTTP 410 response code. This will help you save crawl budget, given that Google tends to crawl less frequently pages returning HTTP 410 than HTTP 404
- 3️⃣ Ensure key PDPs are configured as static URLs.
Although search engines can crawl parameters, having a clear URL structure benefits the overall website navigation. In case key PDPs and PLPs were caught with unpleasant query parameters, consider the extent to block them in your site’s Robots.txt file.
- 4️⃣ Remove render-blocking resources and keep crawl requests time at bay.
As anticipated, you need to ensure your pages aren’t reliant on client-side rendering. This could lead to the rendered HTML coming with dynamically injected content that doesn’t exist in the raw HTML. This is an issue in case you expected the missing information to be discovered by web crawlers but it turns out it’s not.
- 5️⃣ Improve headings and evaluate room for content integration.
This includes rephrasing sentences where the content is either too short or hard to read. Ideally, you should write concise sentences avoiding adverbs and too specific jargon. Rather, target the right entities and find proper synonyms to convey elaborated sentences. You can help rephrase your sentences using Text Analyzer and leverage fitting entities with Google’s NLP tool demo.
- 6️⃣ Adjust search intent so that it aligns with the intent shifts caused by a Core update. This includes tweaking title tags and meta descriptions accordingly but also finetuning your content with contextualized internal links.
There’s a lot going on when it comes to fixing indexation restraints despite pages being crawled.
Having a large proportion of crawled pages flogging Google index to no avail can wreak havoc crawl budget and harm crawl efficacy. On the flip side, you can look at the issue from several angles and use different methods to nail it down.
Hopefully, this post contributed to achieving at least a small part of it.
Let me know in the comments or on Twitter if you’re struggling with “Crawled – not Indexed” pages.
What does Indexing mean?
Indexing refers to the practice search engines use to organize the information from the websites that they visit.
Indexing is a common practice for search engines and represents the ultimate parsing stage of strings of structured and unstructured information following web crawling and rendering.
It’s arguably the most important stage, as content not in the index can’t rank for a search result
What is indexed vs non-indexed?
Why my products are not indexed in Google?
eCommerce websites may encounter a few indexation restraints due to the following reasons:
– Lack of a valid XML sitemap containing key product pages
– PDP with Out of Stock Items
– Dynamic pages with query parameters
– Thin or Low-Worded Content
– Poor Headings Structure
– High Time to First Byte(TTFB)
– Rendering Bottlenecks on Critical Resources
How long does it take to get indexed by Google?
It’s not possible to estimate how long a page will take to get indexed after being submitted to Google. This may vary depending on a number of factors ranging from a website’s size to the type of crawling issue encountered.
Further analysis might enable SEO professionals to raise more accurate assumptions, albeit affected by permanent outliers.