🔗Broken Backlinks Audit on 4xx Pages with Python

Reading time: 7 Minutes

Sometimes marketers and SEOs underestimate the different processes involving data parsing undertaken by tools.

Data are processed by adopting several different methodologies.

One of the first things to note is third-party tools (e.g Screaming Frog) would read the information in real-time and avoid historic data storage. This means they would normally avoid returning cached information to prioritize what currently sits on a website.

On the other end, first-party tools such as Google Search Console would both read the information in real-time and allow historic data storage. This means that you could find cached information about your web pages and have access to whenever you like.

Most of the time, blending different sources of data can wreak havoc on your SEO analysis and it’s arguably one of the most common errors you could make.

SEO is a fancy nuanced world. Whether you should avoid comparing apples to pears and thrive on the steep ladder to data consistency, sometimes blending data from different sources can help you zoom in on a subject of analysis.

Assuming that you have more than a good reason to include the data processing of another tool in your SEO analysis, then you can read the following post.

This will break down a step-by-step tutorial to identify broken backlinks sleeping on pages with errors.

💡Tip

If you’re in a rush, you can jump on this Google Colab script that will nail down the process for you

Data Blending to look for Error Pages with Broken Backlinks

Simulating a website crawl with a paid tool is very handy but comes with a few pitfalls.

When you use a third-party crawler, you will find out about the proportion of pages with errors at the moment when the artificial web crawler is being ordered to execute the task.

As opposed to that, tools such as Google Search Console will nail down the full history of pages with errors since the dawn of your website.

The truth is the artificial web crawler may not be able to mimic human behavior and may miss out on some errors that a real user might encounter. Additionally, the data outputs from the crawler may not be completely accurate and should be carefully evaluated and compared to other sources to ensure their reliability.

That’s where data blending may come in handy if used with suitable intentions.

Combining the results from the crawler with those from Google Search Console can provide a more comprehensive picture of the website and its potential issues.

And catching up on historic broken links that linger on pages with errors is just an example.

💡 BONUS

To streamline data blending, you would typically use the VLOOKUP function in Excel. With Python, you can combine large amounts of data much more easily using the same function. You can find inspiration from the guide on auditing orphan pages.

Identify Error Pages with Broken Backlinks

Now that we know how valuable this check could result for juicy SEO optimizations, let’s break down the process


  • 1️⃣ Head to Google Search Console and select the root domain from your website’s property

    Select the root domain on Google Search Console
  • 2️⃣ Head to Pages and open the “Not Found(404)” report

    Index coverage report on Google Search Console
  • 3️⃣ Export the list of 404 pages from the toggle on right-hand side

    export list of 404 from Google Search Console
  • 4️⃣ Head to Ahrefs and export the backlinks report

    ahrefs backlink report
  • 5️⃣ Open the list of 404 pages retrieved from Google Search Console and create a new column next to “Last crawled” to save some space for the Vlookup

    Pages with errors in excel from search console
  • 6️⃣ Open the list of pages with backlinks exported Ahrefs.

    We are only interested in the headers revealing information about the pages where the backlinks point to.

    Make sure to rename “Target URL” as “URL” prior to submitting the list of URLs in the following Vlookup

  • 7️⃣ Operate a Vlookup in the cells where we previously saved some space and search for pages with errors in the exported list of pages with backlinks.

Then, assess the result:

  • If the Vlookup operation returns any matches, then historical pages with 4xx come with broken backlinks. If this is the case, I recommend having redirects to prevent backlinks from being lost.
  • If the Vlookup operation does not return any matches, then historical pages with 4xx do not come with broken backlinks. If this is the case, just move forward.

You should end up with a similar structure:

Vlookup output

Operate a Vlookup with Pandas

As follows, I share an alternative to the boring Vlookup process in Excel.

First, let’s import the Google Search Console dataset full of pages with errors.

We can safely drop the “Last crawled” header as it’s not going to add useful information to our analysis. Feel free to uncomment the # in the script.

import pandas as pd
gsc = pd.read_excel('/content/gsc 404.xlsx', sheet_name='Table')
df = pd.DataFrame(gsc, columns=['URL','Last crawled'])
#OPTIONAL - remove Last Crawled
#df.drop('Last crawled', axis=1, inplace=True)
df.head()

💡Tip

Please make sure to remove spaces in the above rows containing labels as they may not be readable

Next, we upload the Ahrefs’ backlinks report

internal = pd.read_excel('/content/ahrefs backlinks.xlsx')
df2.head()

So we’re ready to compute the Vlookup

backlinks = df.merge(df2,  on='URL')
backlinks

After that, we might need a bit of data cleaning to remove potential duplicates. This step is not mandatory as it entirely depends on the size of the output generated following the vlookup.

output = backlinks.drop_duplicates('URL')
output

Once the output is ordered by the number of backlinks, we can save and print the dataframe.
Note that a bit of cool styling has been added to make it more readable.

x = output.sort_values(['External links'], ascending=False).reindex(columns=['URL', 'External links','Page traffic','Anchor'])
x.to_excel('404 pages with broken links.xlsx', index=False)
x.style.background_gradient()

All you have now is a handy table showing a list of 404 pages with broken links.

They all turned out to be a list of old service pages and permanently discontinued product pages housing alive and kicking backlinks.

Use Data Blending to Identify Broken Backlinks on 404 Pages

When working with data from multiple sources, it is important to have a thorough understanding of the data to avoid potential pitfalls. While data blending may not hurt your SEO analysis, it should be approached only if you master the sources of data you’re going to handle.

One area where data blending can prove beneficial is in identifying broken backlinks using information from Google Search Console. By piecing together pages with errors, you can quickly identify and address any issues that may be negatively impacting your website’s performance.

In this scenario, you should redirect broken URLs either to their parent category or to the most relevant destination to ensure a seamless user experience and maintain the integrity of your website’s overall structure.

Never Miss a Beat

Subscribe now to receive weekly tips about Technical SEO and Data Science 🔥