Orphan pages are pages within a website’s architecture that are not linked to the main navigation.
Typically, these pages are not indexed unless they are linked historically or from external sources like XML Sitemaps or external links.
While having a small number of orphan pages is generally not a significant concern according to Google, it can become problematic at scale. Orphan pages can contribute to index bloat and waste crawl budget, leading to lower search rankings.
During my audits of large eCommerce websites, I often come across orphan pages as one of the most common issues.
The main causes of orphan pages include:
- Discontinued product pages: This is the most common cause, where out-of-stock items are still present on valid and indexable pages (returning HTTP 200 and allowing indexing and following).
- Old unlinked pages: Pages that were once published but are no longer linked within the website’s structure.
- Site architecture issues: Poor vertical linking, where pages are not connected properly within the hierarchy of the website.
- Auto-generation of unknown URLs at the CMS level.
- Massive rendering issues that prevent search engines from accessing internal links.
Now that we have a better understanding of orphan pages, we can dive into the details using Python and SEO techniques.
For this tutorial, I will be using sample screenshots from a recent audit conducted on a large fashion luxury eCommerce website.
In this post, I will outline an accurate method using basic Python operations to streamline the process of auditing orphan pages.
🔦 Important Note – I strongly advise you follow along with the tutorial on this Google Colab which I will mainly make reference to throughout this tutorial
Orphan Pages Tech Stack
Auditing Orphan pages requires you to use a combination of first-party and third-party SEO tools.
|Screaming Frog (or a web crawler)||Retrieve Technical insight (HTTP status code, e.g)|
|Google Search Console||Retrieve search performance insight (Clicks, Impressions, CTR, Position)|
|Google Analytics||Retrieve search and business insights ( Sessions, Transactions)|
The insights gained from these tools should then be compared using several Vlookup operations, which involve comparing a cleaned dataset from Screaming Frog with a cleaned dataset that includes Google’s data.
Any URLs that are present in the latter dataset but not found in the Screaming Frog dataset will be classified as orphan pages.
Crawl a Website
Let’s start off by running an ordinary crawl of a website with Screaming frog.
The first step is to make sure to exclude any URLs that are blocked by robots.txt.
In this case, the crawler configuration should be left as by default.
Start the crawl and once it’s complete, export just the internal HTML report.
Next, let’s get the ball rolling with a comprehensive data cleaning session on the dataset.
- Remove any noindex pages
- Remove any non 200 status code
- Remove any non-indexable pages that canonicalize on other resources.
Lastly, make sure to keep this dataset safe as it will be used later on.
Recall I will refer to this file as “Cleaned Screaming Frog” throughout this tutorial.
Work Out Google Search Data
The next step is to retrieve a full list of organic landing pages from Google Search Console and Google Analytics from the last 3 months.
🔦 Important note: you can export up to 5,000 rows of data using either Looker Studio or the Google Search Console API.
I found this blog post very useful to connect with the GSC API.
After obtaining the list of landing page URLs, this data must be cleaned:
- Remove duplicates
- Remove canonicalised pages
- Remove non-indexable pages (pages with no index, redirects and page not found).
This can be done by running a crawl in list mode with Screaming Frog for both the Search Console and the Analytics batch of URLs.
Performing a crawl of the GSC and GA pages brings you more technical information that you could use to add more value to your dataset in order to structure your audit.
At the end of the crawl, you will have:
– The original GSC/GA export (Fig.1)
– The crawl of the GSC/GA URLs (Fig.2)
Here comes the tricky part as we’re going to use a lot of Vlookup.
I used to dread this function back when my executive skills were limited to Excel alone.
However, with the power of Python, not only was I able to streamline the process, but I also discovered how much easier it can be.
Before we dive into the full data transformation process, I remind you to follow along with the Colab shared at the top of this section. It will enhance your understanding and allow you to implement the steps more effectively.
- Use a VLOOKUP formula to compare the original GSC export with the GSC crawl of URLs. This will generate a new file that we will refer to as “GSC cleaned”.
- Use a VLOOKUP formula to compare the original GA export with the GA crawl of URLs. This will generate a new file that we will refer to as “GA cleaned”.
- Use a VLOOKUP formula to compare “GSC cleaned” against “GA cleaned”. This will generate a new file that we will refer to as “Google”.
- You should have the Screaming Frog dataset stored somewhere safe (see first paragraph). So, let’s use once again a VLOOKUP to compare Google with the “Cleaned Screaming Frog”.
If we observe how the Vlookup is performed, we will notice that a left merge includes all rows from
site_crawl and only matching rows from
'how' parameter is set to
'left‘, indicating that the merge should prioritize all the rows in the ‘site_crawl’ data frame and keep only the matching rows from the ‘google’ data frame.
So, any rows that do not have a matching value in the ‘Google’ data frame have missing values (NaN) and will be referred to as Missing at Random (MAR).
In Data Analysis, this term refers to the systematic relationship between missing values and the observed data.
In simple terms, rows that lack a matching value in the Google dataframe are caused by the merging of other variables from the Screaming Frog dataset.
These will be orphaned pages.
Data Verification of the Output
Did we do a good job? If you thought we were finished, think again!
It’s crucial to double-check the accuracy of our previous operations, and the best way to do that is by running another crawl.
This time, the crawl will focus exclusively on the orphan pages we identified in the previous exercise. We want to gather fresh data and ensure that we have the most up-to-date information for analysis.
- Activate the Search Console and Analytics API in Screaming Frog. Then, initiate a crawl in List Mode, specifically targeting the orphan pages.
Remember to synchronize both APIs with the appropriate country settings for inspection.
- Once the crawl is complete, export the data from both the search_console and analytics tabs. We’ll need these datasets for further analysis.
- Now comes the important step of data cleaning. Take the search_console file, and manually filter out URLs labeled as “page with redirect” and “duplicate with user canonical.” We want to focus solely on the relevant data for our analysis.
- Next, perform a Vlookup operation on the search_console dataset against the analytics dataset. This step will allow us to combine the information and gain deeper insights. As a final step, conduct the last round of data cleaning by renaming and removing unnecessary columns.
By following these steps, we can ensure the accuracy of our analysis and have a comprehensive dataset for further investigation.
Finally, run another crawl on the final list of URLs to spot any issues.
- Do not set the API for the crawl.
- Search for URLs not in the sitemap, missing canonicals, and perform crawl analysis to see if URLs are in the sitemap.
Descriptive Analysis to Inform SEO Recommendations
Now that we have identified the orphan pages and are happy with the results, let’s take our audit to the next level.
Performing a brief exploratory data analysis (EDA) on the output will set you apart from the average SEO executive. More importantly, it will provide you with invaluable insights that can guide targeted SEO recommendations.
For instance, we can begin by examining the measures of central tendency, such as the mean, median, and mode, for clicks and sessions. Additionally, we can explore the distribution of data points that impact these metrics.
Let’s consider a specific scenario and analyze the proportion of clicks and sessions based on different classes. The boxplot below illustrates the distribution of clicks and sessions, highlighting the presence of outliers (represented by the blue bubbles) within the dataset.
Based on this analysis, a potential recommendation could involve adding more internal links and including orphan pages in the XML sitemap for URLs that exceed the median or, at most, the third quartile.
Conducting a data-informed orphan pages audit reveals the complexity involved in the process. Despite the challenges, this framework offers an accurate method for SEO executives.
Now, it’s time for you to analyze the orphan page data and determine the set of recommendations. You could choose between the following:
- Improve internal linking: Enhance internal linking by strategically connecting orphan pages to relevant ones, boosting visibility and SEO performance.
- Remove unlinked pages: Identify and eliminate outdated or irrelevant pages to enhance user experience and SEO.
- Review discontinued product pages: Evaluate unavailable product pages, updating information or redirecting to relevant alternatives.
- Monitor CMS issues: Stay vigilant for glitches due to either auto-generated templates at the CMS level or massive reliance on render-blocking resources.
As a final note, bear in mind that auditing orphan pages is an ongoing process where continuous monitoring and optimization will ensure long-term success.
What are Orphan Pages?
Orphan pages are pages within a website’s architecture that are not linked to the main navigation.
Do Orphan Pages get indexed by search engines?
Orphan pages are typically not indexed unless they are linked historically or from external sources like XML Sitemaps or external links.
Why can Orphan Pages be problematic at scale?
While a small number of orphan pages is generally not a significant concern, they can become problematic at scale by contributing to index bloat and wasting crawl budget, leading to lower search rankings.
What are the main causes of Orphan Pages?
The main causes of orphan pages include discontinued product pages, old unlinked pages, site architecture issues, auto-generation of unknown URLs at the CMS level, and massive rendering issues.
What recommendations can be implemented to address Orphan Pages?
Recommendations to address orphan pages include:
– improving internal linking
– removing unlinked pages
– reviewing discontinued product pages
– monitoring CMS issues related to auto-generation and rendering.