Running a Canonical Audit in Python

Reading time: 13 Minutes

When indexing a site, Google needs to determine the primary content of each page. If Google finds multiple pages on the same site that seem to be the same, it chooses the page that it thinks is the most complete and useful, and marks it as canonical. As a result, canonical pages will be crawled more regularly than duplicates.

Example of rel=canonical
Example of rel=canonical

Running a canonical audit doesn’t mean going through each URL in a desperate attempt to emulate web crawlers. Yet, it’s best practice to set off with appropriate segmentation of your website into page templates.

Likewise, running technical audits means sharing insights with both the client and the developers’ team. Segmenting key pages will enable web devs to learn about the impacted areas of the website so they can automate technical implementation from the backend.

Concurrently, this will provide the client with an overview of the most hit sections of his website that would lead to strategy considerations spanning from business to pure marketing.

Running a consistent audit on key page sections without overboarding the fine line between audit and analytics is pivotal to meeting the demand of stakeholders

In this post, I’ll show you how you can automate a quick canonical audit using Python prior to refreshing how canonical mishaps can occur, the most frequent issues, and potential fixes.

🔦Beware this post is aimed at providing an alternative to understand your website’s canonicals, not to question the reasons behind disfunctions

Common Canonical Issues

Canonicalization helps prevent duplicate content and improve crawl efficacy, or the overall crawl responsiveness in the time it takes to index your content.

The main reason for devising consistent canonicalization across your website is to prevent Google from wasting time in seeking the best version of a page to show off on the SERP.

From a technical standpoint, canonicals come with a limited set of common mishaps that you will find referenced within any third-party crawler

For example, Screaming Frog acknowledges the following cards:

IssueDescription
CanonicalisedThe page has a “canonical URL” that differs from its own URL. This is telling search engines not to index the original page but to consolidate crawl and link properties onto the target “canonical URL”.
Although in an ideal world, only one version of a URL should be linked to, extreme circumstances such as duplicate content require to canonicalise a page to the most similar version
MissingThere’s no canonical URL present either as a link element or via HTTP header. If a page doesn’t indicate a canonical URL, Google will identify what they think is the best version or URL. This can lead to ranking unpredictability, and hence generally all URLs should specify a canonical version.
MultipleThere are multiple canonicals set for a URL (either multiple link elements, HTTP header, or both combined). This can lead to unpredictability, as there should only be a single canonical URL set by a single implementation (link element, or HTTP header) for a page.
Non-Indexable CanonicalThe canonical URL is a non-indexable page. This will include canonicals that are blocked by robots.txt, no response, redirect (3XX), client error (4XX), server error (5XX) or are ‘noindex’. Canonical versions of URLs should always be indexable, ‘200’ response pages. Therefore, canonicals that go to non-indexable pages should be corrected to the resolving indexable versions.

Automate a Canonical Audit with Python

The very first thing to do before starting an audit is to run a crawl and export the findings.

When collecting data, you should avoid combining the insights from multiple web crawlers.

Crawlers are built on various data parsing as they leverage multiple algorithms that are trained differently.

If blending crawl insights from Screaming Frog with those from Lumar would increase the amount of data at your disposal, this will eventually affect their accuracy as you’re drifting away from a standardized measurement.

🔦 You might be interested in automating a Core Web Vitals audit that flex Python code

To start off an automated audit of your website’s canonicals, you need to export a few files into a machine learning environment. This is how I do it:

  • Run a crawl with one of your favourite tool – I’d use Screaming Frog.

  • File export of Canonical Issues (e.g Missing Canonicals, Canonicalised)

  • File export of Search Console_All to see how Google deems your canonicalization strategy (Search Console > export all)

Once you’ve downloaded the above reports, it’s time to set up a machine learning environment that will help us process the data.

I recommend using Google Colab. This is an extremely user-friendly notebook as it comes with most of the data libraries useful to run basic data analysis and machine learning tasks.

Install and Import dependencies

We only need to install and import a few libraries

I’m not going to rest on all dependencies conscious of time, but here are the most useful ones.

%%capture
!pip install polyfuzz plotly dash dash_bootstrap_components 

Polyfuzz is one of my favourite Python libraries for its proven robustness behind string matching and grouping. If you want to test URL similarity or build your own SERP similarity tool then I highly recommend using it.

Plotly is a high-level, declarative charting library that enables plotting highly visual plots. By far my top data visualization Python library, this comes in handy when aiming for clear plotting of automated market analysis with Python

from polyfuzz.models import TFIDF
from polyfuzz import PolyFuzz
import plotly.express as px
import pandas as pd
import pandas
import numpy as np
from IPython.core.display import display, HTML
from IPython.display import display_html, display_markdown
display(HTML("<style>.container { width:100% !important; }</style>"))
def md(text):
    return display_markdown(text, raw=True)

Importing display_html, display_markdown will return results with the likes of a bolded text indicating the main point of the issue. This comes in beneficial to elapse the attention span on the audit.

Cleaning the Google Search Console Report

Before exporting the Search Console report to see how Google evaluates your canonicalization strategy, it’s crucial to process a bit of data cleaning in the file.

As part of a canonical screening, you want to rule out those URLs that:

  • Are blocked by Robots.txt
  • Have a noindex tag
  • Have redirect rules in place
  • Return server errors

As a result, you will have to filter out those URLs from your spreadsheet before progressing to the next phase.

🔦 Please note that URLs labeled as “Canonicalised,noindex” from Search Console do not have to be ruled out. In turn, we need them to learn more about adverse canonical patterns

Importing Crawl Files

First, we’re going to import the Missing Canonical report.

Open the file and read the Address and Indexability status columns as they will contain respectively the URL and NaN values due to missing canonical information.

Next, we replace the values from the ‘Indexability Status‘ column with ‘Missing Canonical‘ where the value is NaN

df = pandas.read_excel('missing_canonicals.xlsx', 
                   usecols=['Address', 'Indexability Status'])

# Create new column with default value of 'Missing Canonical'
df['Issue'] = 'Missing Canonical'

df['Issue'].fillna('Missing Canonical', inplace=True)
missing_canonicals = df.drop(['Indexability Status'], axis=1)

missing_canonicals.head()

Next, we open the Canonicalised URLs file and read the Address and Indexability status columns as they will contain the URLs that are canonicalised.

canonicalised = pandas.read_excel('canonicalised.xlsx', 
                   usecols=['Address', 'Indexability Status'])
#renaming columns
cols = ['Address','Issue']
canonicalised.columns = cols 

canonicalised.head()

And finally, we import the Search Console report retrieved by using the API with Screaming Frog.

You can have all the default columns from the main report as we’re going to select only the User-Declared Canonical and the Google-Selected Canonical

🔦Please, make sure to have your dataset cleaned from all the above mentioned indexability status before to import the file

search_console = pandas.read_excel('/content/2023-01-18 Bourcheonr KO search_console_all.xlsx',
                 usecols=['Address','User-Declared Canonical','Google-Selected Canonical'])

df2 = pd.DataFrame(search_console, columns=['Address','User-Declared Canonical','Google-Selected Canonical'])
df2    

Now that we have two import files outlining two of the most common canonical issues, it’s time to merge them into a single data frame that we’re calling issue_concat

And finally, we use the Search Console report to merge findings with canonical issues and save them all into a new data frame we call result

#merge canonical issues
issue_concat = pd.concat([missing_canonicals, canonicalised], axis=0)

#merge canonical issues with Search Console Insights
result = issue_concat.merge(df2,  on='Address')
result.to_excel('canonical_overview.xlsx',index=False)
result.head()

Understanding Canonical Issues

We’re now in the position to get a first impression of our canonical audit by outlining the main distribution of issues

issue = result.groupby('Issue').size().reset_index(name='counts').sort_values('Issue', ascending=False)
fig = px.histogram(issue, x='Issue', y='counts')
fig.update_layout(
    title='Main Canonical Issues',
    xaxis_title='Issues',
    yaxis_title='Count',
    width=1024, 
    height=600,
    template='plotly', 
    yaxis = dict(
        tickmode = 'array',
        tickvals = [i for i in np.arange(11)],
    ),
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    ))
fig.show()
main canonical issues shocased with plotly

Understanding Canonical Issues by Page Template

As anticipated, running a technical SEO audit is not about drilling on every single page.

Mapping out the most hit areas of the website will not only hand in the client an actionable understanding of the pain points, but help developers focusing on specific templates for automating technical fixes across the website.

You can use RegEx to extract the most common strings from a URL to draw page templates based on business goals and especially on the site structure.

To do so, we can retrieve the first merged data frame we created – issue_concat

issue_concat.loc[issue_concat['Address'].str.contains(r'.*?html$'), ['Page Template']] = 'PDP'
issue_concat.loc[issue_concat['Address'].str.contains(r'^(?!.*(\.html|faq|homepage|services|privacy|retailer|\?p=)).+'), ['Page Template']] = 'PLP'
issue_concat.loc[issue_concat['Address'].str.contains(r'faq'), ['Page Template']] = 'FAQ'
issue_concat.loc[issue_concat['Address'].str.contains(r'homepage'), ['Page Template']] = 'Homepage'
issue_concat.loc[issue_concat['Address'].str.contains(r'.*service|.*privacy-|.*contact-us'), ['Page Template']] = 'Service'
issue_concat.loc[issue_concat['Address'].str.contains(r'retailer'), ['Page Template']] = 'Store Locator'
issue_concat.loc[issue_concat['Address'].str.contains(r'.*p='), ['Page Template']] = 'Pagination'
issue_concat

Once we’ve found a common thread for our pages, we can plot an histogram showcasing how canonical issues occur across each page template.

template_issue = result.groupby(['Page Template'])['Issue'].count().reset_index(name='count').sort_values(by='count', ascending=False)
fig = px.bar(template_issue, x='Page Template', y='count')
fig.update_layout(
    title='Canonical Issues by Page Template',
    xaxis_title='Page Template',
    yaxis_title='Count of Issues',
    width=1024, 
    height=600,
    template='plotly', 
    yaxis = dict(
        tickmode = 'array'),
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    ))
fig.show()
canonical issues by page template

Google’s Understanding of your Canonical Tags

This is the moment when you want to figure out if Google deems your original pages as canonical.

Google-declared canonical reflects what the search engine deems to be the most representative version of the URL. Bear in mind, though, that this is not always the case as Google’s machine learning algorithms are not very good yet at matching your page’s business goals.

From the result table, we’re going to inspect the length of both the Page URL column and the Google-Selected Canonical column as a requirement to fire up a similarity calculation using TF-IDF first and then Polyfuzz

page_url = result['Address'].tolist()
cleanedList = [x for x in page_url if str(x) != 'nan']
len(page_url)
google_url = result['Google-Selected Canonical'].tolist()
cleanedList2 = [x for x in google_url if str(x) != 'nan']
len(google_url)

Next up, we’re going to leverage Polyfuzz to measure the similarity between the two lists. The script first creates an instance of a TFIDF (term frequency-inverse document frequency) model and then uses the match()method to compare both lists.

Finally, the get_matches() method retrieves the similarity score between the Address and the Google-selected canonical columns.

tfidf = TFIDF(n_gram_range=(3,3), min_similarity=0.95, cosine_method='knn')
model = PolyFuzz(tfidf)
model.match(cleanedList, cleanedList2)
similarity = model.get_matches()
google canonical understanding

Next, we create a new data frame with the variable “similarity” sorted in descending and rename the columns to ‘From’, ‘To’, ‘Self-Referencing Canonical’.

We also convert values from the Self-referencing column into boolean so the data frame responds to the object of the research.

outcome = pd.DataFrame(similarity)
outcome.sort_values('Similarity', ascending=False, inplace=True)
cols = ['From','To','Self-Referencing Canonical']
outcome.columns= cols 
outcome['Self-Referencing Canonical'] = outcome['Self-Referencing Canonical'].round(2).astype('int64')
outcome['Self-Referencing Canonical'] = outcome['Self-Referencing Canonical'].astype('bool')

outcome

After performing a bit of data cleaning, we can come out with the final dataset.

dataset = pd.concat([result, outcome], axis=1)
final = dataset.drop(['From','To'], axis=1)
final.to_excel('canonicalisation_audit.xlsx',index=False)
final.head()
canonical issues final dataframe

And identify how many URLs have User-Declared Canonical using the display_html, display_markdown library

md(f"## URLs that have User-Declared Canonical: {final['User-Declared Canonical'].notna().sum():,} ({final['User-Declared Canonical'].notna().mean():.1%})")

>> URLs that have User-Declared Canonical: 24 (2.6%)

And ultimately find out how many URLs have Google-Selected Canonical 

md(f"## URLs that have Google-Selected Canonical: {final['Google-Selected Canonical'].notna().sum():,} ({final['Google-Selected Canonical'].notna().mean():.1%})")

>> URLs that have Google-Selected Canonical: 788 (86.4%)

Conclusion

As anticipated, this framework is limited to auditing canonical issues to deliver your clients and web developers an overview of how both the eCommerce and then Google managed canonicalization across the website.

Despite being devoted to auditing purposes, this doesn’t prevent it from upgrading to more mature analytics. In fact, you can just play around with this script using functions such as GroupBy and plotting Pivot tables to scratch the audit surface.

FAQ

The following FAQ section provides a recap of the most salient parts of this post.

This section was first generated with the aid of large language models such as ChatGPT-3 and manually fact-checked prior to being published.

What is canonicalization and why is it important?

Canonicalization is the process of specifying the preferred version of a web page to search engines. This helps prevent duplicate content and improves the crawl efficacy, or the overall crawl responsiveness in the time it takes to index your content. The main reason for devising consistent canonicalization across your website is to prevent Google from wasting time in seeking the best version of a page to show off on the SERP.

What are the most common issues with canonicalization?

Some common issues with canonicalization include missing canonicals, canonicalized pages that are not the preferred version, and incorrect implementation of canonicals. These issues can lead to confusion for search engines and can negatively impact the visibility of your website on the SERP.

What is the best tool to use for an automated canonical audit?

One popular tool for an automated canonical audit is Screaming Frog. It allows you to run a crawl and export the findings, which can then be processed using a machine learning environment such as Google Colab.

How can I prevent duplicate content on my website?

To prevent duplicate content on your website, you should implement consistent canonicalization across all pages. This means specifying the preferred version of a page to search engines so that they know which version to index and show on the SERP.

Can I use multiple web crawlers for my canonical audit?

It is not recommended to use multiple web crawlers for your canonical audit as they are built on various data parsing and leverage different algorithms that are trained differently. Combining crawl insights from different crawlers may increase the amount of data at your disposal, but it will also affect their accuracy as you’re drifting away from a standardized measurement. It is best to stick to one web crawler for your audit

Related Posts