Broken Backlinks Audit on 4xx Pages with Python

Reading time: 7 Minutes

Sometimes marketers and SEOs underestimate the different processes involving data parsing undertaken by tools.

Data are processed by adopting several different methodologies.

One of the first things to note is third-party tools (e.g Screaming Frog) would read the information in real-time and avoid historic data storage. This means they would normally avoid returning cached information to prioritize what currently sits on a website.

On the other end, first-party tools such as Google Search Console would both read the information in real-time and allow historic data storage. This means that you could find cached information about your web pages and have access to whenever you like.

Most of the time, blending different sources of data can wreak havoc on your SEO analysis and it’s arguably one of the most common errors you could make.

SEO is a fancy nuanced world. Whether you should avoid comparing apples to pears and thrive on the steep ladder to data consistency, sometimes blending data from different sources can help you zoom in on a subject of analysis.

Assuming that you have more than a good reason to include the data processing of another tool in your SEO analysis, then you can read the following post.

This will break down a step-by-step tutorial to identify broken backlinks sleeping on pages with errors.

💡Tip

If you’re in a rush, you can jump on this Google Colab script that will nail down the process for you

Data Blending to look for Error Pages with Broken Backlinks

Simulating a website crawl with a paid tool is incredibly handy but comes with a few pitfalls.

When you use a third-party crawler, you will find out about the proportion of pages with errors at the moment when the artificial web crawler is being ordered to execute the task.

As opposed to that, tools such as Google Search Console will nail down the full history of pages with errors since the dawn of your website.

The truth is the artificial web crawler may not be able to mimic human behavior and may miss out on some errors that a real user might encounter. Additionally, the data outputs from the crawler may not be completely accurate and should be carefully evaluated and compared to other sources to ensure their reliability.

That’s where data blending may come in handy if used with suitable intentions.

Combining the results from the crawler with those from Google Search Console can provide a more comprehensive picture of the website and its potential issues.

And catching up on historic broken links that linger on pages with errors is just an example.

Identify Error Pages with Broken Backlinks

Now that we know how valuable this check could result for juicy SEO optimizations, let’s break down the steps of the process


⚠️ Notice

Please note the screenshots used for the purpose of this tutorial may contain obfuscated parts. This measure was taken in compliance with confidential agreements with the owners of the websites subject of the analysis.


  • Head to Google Search Console and select the root domain from your website’s property

    Select the root domain on Google Search Console
  • Head to the Index Coverage Report and open the “Not Found(404)” report

    Index coverage report on Google Search Console
  • Export the list of 404 pages from the right-hand side toggle

    export list of 404 from Google Search Console
  • Export the backlinks report from Ahrefs

    ahrefs backlink report
  • Open the list of 404 pages retrieved from Google Search Console and create a new column next to “Last crawled” to save some space for the Vlookup

    Pages with errors in excel from search console
  • Open the list of pages with backlinks exported with a third-party tool (Ahrefs, e.g)

    Pages with backlinks exported from Ahrefs
  • Operate a traditional Vlookup in the cells where we previously saved some space and search for pages with errors in the exported list of pages with backlinks.

Then, assess the result:

  • If the Vlookup operation returns any matches, then historic pages with 4xx come with broken backlinks. If this is the case, recommend having redirects to prevent backlinks to be lost.
  • If the Vlookup operation does not return any matches, then historic pages with 4xx do not come with broken backlinks. If this is the case, just move forward.

You should end up with a similar structure:

Vlookup output

Operate a Vlookup with Pandas

As follows, I share an alternative to the boring Vlookup process in Excel.

I appreciate that many of you would be happy with Excel but I am aware of how much of a nightmare Vlookup could be.

First, let’s import the Google Search Console dataset full of pages with errors.

import pandas as pd
gsc = pd.read_excel('/content/gsc 404.xlsx')
df = pd.DataFrame(gsc, columns=['URL','Last crawled']) 
df.head()

💡Tip

Please make sure to remove spaces in the above rows containing labels as they may not be readable

Next, we upload the Ahrefs’ backlinks report

internal = pd.read_excel('/content/ahrefs backlinks.xlsx')
df2 = pd.DataFrame(internal, columns=['URL','External links'])	
df2.head()

So we’re ready to compute the Vlookup

backlinks = df.merge(df2,  on='URL')
backlinks

After that, we need a bit of data cleaning aimed at removing duplicates

output = backlinks.drop_duplicates('URL')
output

And finally, we can print and save the output.

Please note we sorted the output by broken links and applied a bit of styling to the data frame

x = output.sort_values(['External links'], ascending=False).reindex(columns=['URL','External links','Last crawled'])
x.to_excel('404 pages with broken links.xlsx', index=False)
x.style.background_gradient()
Vlookup in Pandas with Python
Broken links on pages with errors (Example used for demonstration purpose)

And here’s a handy table inferring that a page with 4xx errors houses around 200 broken backlinks. Given this task had already been archived, I can confirm these were both legacy service pages and permanently discontinued product pages.

Though, they still had some backlinks. Hence, the recommendation was to redirect these URLs to either their parent category or to the closest destination in a bid to unlock broken backlinks

Conclusions

The lesson we need to learn from this tutorial is that comparing data from different sources should be avoided if you don’t have enough understanding of the data you’re going to handle.

Data blending can still be profused across your SEO analysis provided you’re knowledgeable enough.

That being said, piecing together pages with errors from Google Search Console to find broken backlinks can disclose quick fixes you don’t want to miss out.