👨🏻‍🤝‍👨🏻Duplicate Content Audit with a taste of Analytics with Python

Reading time: 12 Minutes

Running a duplicate content audit is often an overlooked task.

SEO professionals are affected by a blend of confirmation biases and attentional biases, as part of a brain’s tendency to focus on information that supports our beliefs and ignore information that contradicts them.

In SEO, a confirmation bias occurs anytime you interpret the approach to an audit in a way that supports your proven experience. This concatenates with attentional biases, as you are keen to focus more on information rooted in emotions and beliefs while disregarding other information.

As a result of selective perception of the points to prioritise within an audit, we are likely to filter out information that does not fit with our goals or expectations.

Moving abruptly to a technical standpoint, approaching a duplicate content audit could be tricky due to how Google addresses the problems within the Coverage report on the Search Console.

Should you investigate duplicates from the Duplicate without user-selected canonical status or the Duplicate, Google chose different canonical than user status?

And once you’ve found out, is there a way to smooth the auditing process?

In this post, I’m going to cover these questions as I’ll be taking you through an automated approach to audit duplicate content with Python.

Requirements for this Tutorial
1️⃣Google Search Console API
2️⃣ Screaming Frog (or a web crawler of your choice)
3️⃣ Pandas – Python (the basics)
4️⃣ Plotly – Python

What is Duplicate Content

Let me spend a few words on the subject of the audit.

Duplicate content is content available on multiple URLs on a website. Because more than one URL shows the same content, search engines don’t know which URL to list higher in the search results.

As John Mueller has long stated, having the same content repeated across multiple pages wouldn’t lower search rankings. In fact, it’s inevitable for eCommerce or News websites to deal with a certain amount of duplicates.

Nevertheless, having identical content being served on different URLs not only can hamper user experience but can significantly throttle crawling with immediate effects on indexation.

For this reason, in 2011 Google first introduced the Google Panda update. The aim to award high-quality sites with little to no thin or duplicate content has been recently reinforced with the rollout of the Helpful Content Update

But when exactly do you face instances of duplication?

When you have different language versions of a single page, and the primary content remains in the same language .

In other words, if the body of a webpage remains the same – but the header and the footer differ – pages are considered to be duplicates.

In line with the purposes of an audit, you ultimately need to be aware of common causes for duplicate content:

  • Auto-generated pages via in-built CMS (Magento is likely the world champion in page de-duplication)
  • Poor Faceted/filtered navigation
  • Incorrect canonicalisation
  • A significant lack of unique content
  • Device variants: a page with both a mobile and a desktop version
  • Protocol variants: the HTTP and HTTPS versions of a site
  • URL variants – URLs with and without trailing slashes and URLs with and without capital letters

How to Audit Duplicate Content

  • Head to Google Search Console> Pages > look for Duplicate without user-selected canonical > export the list

  • Segment the pages into templates > navigate the website and search for the right CSS selector

  • Use a crawler of your choice (e.g Screaming Frog) to add the custom extraction selectors and crawl the list of pages using the Search Console API

  • Export the following reports:

    • Search Console All
    • Custom Extraction
    • Duplicate page titles
    • Duplicate Meta descriptions

Search Console and the Duplicate Content Status Dilemma

In the Pages report on Google Search Console, you might trip up on similar statuses chanting at duplication.

The difference between “Duplicate without user-selected canonical” and “Duplicate, Google chose different canonical than user” is in how Google determines which page should be indexed when it finds duplicate content on a website.

🔦BONUS

I recommend reading an in-depth take from JC Chouinard nailing down a Google patent to get an understanding of how the search engine handles duplicate content

  • Duplicate without user-selected canonical” means the website owner has not specified which URL should be considered as the canonical version
  • Duplicate, Google chose different canonical than user” means the website owner has specified a canonical URL but Google has chosen a different one.

Glossing over the pivotal importance of a canonical audit to improve visibility in search results, aiming for the “Duplicate without user-selected canonical” status turns out to be the best approach to address duplicate content checks.

Duplicate Content Audit with Python

Python is the goat for data analysis but works out very well to process certain SEO checks, such as an XML sitemap audit for example.

To audit the bespoke Duplicate without user-selected canonical, you first have to install and import a few dependencies.

Here’s what you need to know abou them:

Library Description
Plotly An open-source data visualization and analytics library that provides a high-level interface for creating interactive, web-based visualizations.
Pandas A library for data analysis and manipulation that provides data structures and functions needed to work with structured data, such as spreadsheets or SQL tables.
Numpy A library for mathematical and scientific computing in Python that provides functions for working with arrays, which are the main data structure used in scientific computing.
Plotly Express A high-level library for data visualization built on top of Plotly that provides a simple way to create interactive visualizations with just a few lines of code.
!pip install plotly
import pandas as pd
import numpy as np
import plotly.express as px

Data Processing

Once the machine learning environment has been set up, it’s time to let Python do the heavy lifting for you.

First, let’s import the Google Search Console report that you should have derived from the API via a third-party crawler

search_console = pd.read_excel('search_console_all.xlsx')
df = pd.DataFrame(search_console, columns=['Address',	'Status Code','Clicks',	'Impressions','Coverage','Days Since Last Crawled'])
df.head()

Next, we import the custom extraction file retrieved from the web crawler – in my case is Screaming Frog.

extraction = pd.read_excel('custom_extraction_all.xlsx')
df1 = pd.DataFrame(extraction, columns=['Address','Page Template'])

df1.head()

Rinse and repeat the same process with the file exports about duplicate page titles and meta descriptions

#import the duplicate meta description export

meta_description = pd.read_excel('meta_description_duplicate.xlsx')
df2 = pd.DataFrame(meta_description, columns=['Address','Meta Description 1'])

#import the duplicate page titles export

page_titles = pd.read_excel('page_titles_duplicate.xlsx')
df3 = pd.DataFrame(page_titles, columns=['Address','Title'])

💡BONUS

Learn how meta descriptions can impact your rankings using a powerful neural network such as Sentence Transformers applied to SEO

Next, merge the results into a single data frame and filter out any non-2xx pages. This will allow you to focus exclusively on valid URLs with multiple versions.

result = pd.merge(df, df1, on='Address', how='left')
result = pd.merge(result, df2, on='Address', how='left')
result = pd.merge(result, df3, on='Address', how='left')

#convert data types
result['Days Since Last Crawled'] = result['Days Since Last Crawled'].fillna(0).astype('int64')
result['Status Code'] = result['Status Code'].round(0).astype('int64')
result['Clicks'] = result['Clicks'].fillna(0).astype('int64')
result['Impressions'] = result['Impressions'].fillna(0).astype('int64')

#filter out 3xx pages 
result[result['Status Code'] == 200.0]

Before downloading the full report, you should hone in on further data cleaning

duplicate = result.dropna(subset=['Meta Description 1',	'Title'])
duplicate.to_excel('duplicate.xlsx',index=False)
duplicate.head()

And this is the Excel file you might end up with

That should be the last stop of your duplicate content audit.

Duplicate Content Analysis

If you wish to dive into the rabbit hole and get your hands dirty with a bit of data analysis, don’t churn out and keep reading.

Before proceeding with any advice from the list of dupes, you could leverage the full power of your Search Console insights to learn more about the list of duplicate pages from a search performance standpoint.

To start off, you should make sure dupes don’t generate a significant wave of search traffic.

Although this is generally not the case, some pages may come with a streak of historical clicks or impressions depending on the timeframe you configured prior to run the crawl

That said, let’s investigate the number of duplicates with at least one click

at_least_one_click = (duplicate.groupby('Address')['Clicks'].sum() > 0).sum()
print("Number of unique addresses with at least 1 click: ", at_least_one_click

Likewise, see the number of duplicates with at least one impression

click_percentage = at_least_one_click / len(duplicate) * 100
print("Percentage of Pages with at least 1 click: {:.2f}%".format(click_percentage))

Next, you may want to know how long Googlebot hasn’t touched these URLs for.

This is an interesting stage of the analysis because it allows us to formulate assumptions on the number of pages still floating in the crawl queue.

Once I defined the median value of the “Days Since Last Crawled” column, I let Python calculate the percentage of URLs crawled from the median value upwards (90 days in my case).


mask = duplicate['Days Since Last Crawled'] >= 90

# Count the number of True values in the mask
count = mask.sum()

percentage = count / len(duplicate) * 100

print("Percentage of values >= 90: {:.2f}%".format(percentage))

The results were not surprising as it proved Google’s crawling behaviour of duplicate pages.

As slightly more than 50% of duplicate URLs were last crawled more than 3 months ago, we could infer page variants are being crawled less frequently than unique versions.

🔦BONUS

Measuring crawl behaviour can help you keep track of bottlenecks or indexation opportunities. You cane use Python to measure the crawl efficacy to understand how responsive your website’s pages are

Plotting Duplicate Content

Regardless you’re auditing for duplicate content or making an in-depth analysis, bear in mind that data visualization is a critical step. As such, it has to be done right.

And when I say right, I mean clear and straightforward. I recommend you catch up with this post nailing down all the techniques of data visualization you need to crunch to improve your SEO storytelling

Coming back to the output of your audit, you could investigate duplicates based on the Coverage status issued on the Google Search Console report.

💡Pro Tip

Please note the GSC Coverage report stores historical data, whereas thrid-party crawlers use real-time data to process your URLs. Using the GSC API in combination with a web crawler will help you overcome the temporal friction by outputting real-time indexing statuses.

As the original settings of the audit may have skewed, it’s important to cross-check how Google considers your duplicates in real-time.

# Group by Coverage and count the number of occurrences
coverage = duplicate.groupby(['Coverage']).count().reset_index()

# Sort the results in descending order by count
coverage = coverage.sort_values(by='Address', ascending=False)

# Create a bar chart of the top 5 Page Templates using Plotly Express
fig = px.bar(coverage.head(5), x='Coverage', y='Address')

# Update the chart layout
fig.update_layout(
    title='Duplicate Pages by Index Coverage',
    xaxis_title='Coverage',
    yaxis_title='Count of URLs',
    width=1024, 
    height=600,
    template='simple_white', 
    yaxis = dict(
        tickmode = 'array',
    ),
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    ))

# Show the chart
fig.show()

Another advantageous check is to leverage the dissection into page templates to understand which sections of your site are more affected by duplicate content.

This is a crucial check to run in technical SEO because the end goal of your deliverable is to provide web developers with the area of focus so they can implement changes site-wide.

# Group by Page Template and count the number of occurrences
page_template = duplicate.groupby(['Page Template']).count().reset_index()

# Sort the results in descending order by count
page_template = page_template.sort_values(by='Address', ascending=False)

# Create a bar chart of the top 5 Page Templates using Plotly Express
fig = px.bar(page_template.head(5), x='Page Template', y='Address')

# Update the chart layout
fig.update_layout(
    title='Duplicate Pages by Page Templates',
    xaxis_title='Page Template',
    yaxis_title='Count of URLs',
    width=1024, 
    height=600,
    template='simple_white', 
    yaxis = dict(
        tickmode = 'array',
    ),
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    ))

# Show the chart
fig.show()

Conclusion

Sometimes we’re all too biased from previous experience and beliefs that we are glued to the same old method. As long as it works, there is nothing wrong with that but beware that this may cement your bias wonderwall.

I like to draw three takeaways from this post:

  1. We need to do our best to raise awareness around psychographic biases. If we don’t, then we’re doomed to see half the picture of a vast painting
  2. Running a duplicate content audit is easy if you know where to fetch your stash. Python can help you with that as it allows you to automate a plethora of boring processes.
  3. Bear in mind the difference between Audit and Analytics. Despite covering both, the purpose of an audit is to describe potential issues, whereas Analytics is gaining an understanding of the issue to formulate assumptions

Ad Maiora!

FAQ

What is Duplicate Content?

Duplicate content is content available on multiple URLs on a website. Search engines don’t know which URL to list higher in the search results because more than one URL shows the same content.

Why is Duplicate Content important for SEO?

Having identical content being served on different URLs can not only hamper user experience but can significantly throttle crawling with immediate effects on indexation.

What are common causes for duplicate content?

– Auto-generated pages via in-built CMS
– Poor Faceted/filtered navigation
– Incorrect canonicalisation
– A significant lack of unique content.
– Device variants: a page with both a mobile and a desktop version
– Protocol variants: the HTTP and HTTPS versions of a site
– URL variants: URLs with and without trailing slashes and URLs with and without capital letters

What is the difference between the Duplicate statuses from the Google Search Console?

– “Duplicate without user-selected canonical” means the website owner has not specified which URL should be considered as the canonical version

– “Duplicate, Google chose different canonical than user” means the website owner has specified a canonical URL but Google has chosen a different one.

What libraries do you need to install to perform a Duplicate Content Audit with Python?

You need to install the following libraries: Plotly, Pandas, Numpy, and Plotly Express.

This blog post was generated using automated technology and thoroughly edited and supervised by Simone De Palma