How to Measure Crawl Efficacy to Demystify Crawl Budget

Reading time: 6 Minutes

As a marketer, should you be more focused on controlling the number of products you can market or the amount of time it takes to reach your customers?

As of today, the trade-off in digital marketing awards quality over quantity. This is where an underrated old marketing KPI comes to play, the Time-To-Market (TTM).

This KPI is aimed at measuring the time it takes for a product to become available to the end user and represents a powerful source of competitive advantage.

In SEO, TTM is largely affected by crawl and rendering speed as the more a site is crawlable and free from render-blocking assets, the quicker it can become eligible to get indexed. 

TTM is important in both the news industry and eCommerce. Whether adding a new item or making on-page SEO changes, the faster the update to a product page, the quicker you’ll see the benefits of optimization.


Still, in SEO there is a common misconception that tends to trump the crawl budget proportion over the effective quality of pages that should be submitted to the search engines.

Fostered by the Crawl Stats report on Google Search Console, the Total Crawl Request is one such metric that reflects the number of requests that could benefit a website.

The total number of crawl requests is a vanity metric though, as more crawls per day doesn’t necessarily result in faster indexing of important content but only increases server load and expenses. Instead, the focus should be on quality crawling that provides SEO value.

A recent article from Search Engine Land shone a light on the misleading value of crawl budget and offers a new perspective that focuses on shifting from a quantitative optimization of the crawl rate to a more qualitative approach.

“Crawl efficacy” measures the time between when a page is published and when search engines start to crawl it.

In other words, crawl efficacy estimates how quickly a newly submitted page gets crawled. The longer it takes, the higher the chances that your site is struggling with crawling issues.

In this post, I will provide you with a small Python script designed to calculate crawl efficacy in minutes for your target site.

🦊 Measure Crawl Efficacy using an XML sitemap and a crawl with GSC API
🦊 Inform your SEO strategy with reviewed tactics improving indexation

Requirements

To run this Python script, you need to upload a couple of files beforehand to train the model.

I recommend using Google Colab for this purpose, as it is a great ready-made solution for beginners. Most of the required libraries are installed by default, and you can take advantage of a powerful GPU to speed up script execution.

The files to upload are:

  • A crawl of the XML sitemap from your target site enclosing an XLSX file with “Address” and “loc mod” as columns. You can find a tutorial to scrape an XML sitemap with Python.

    You don’t need to go through the steps of the tutorial, as the only requirement here is to extract the <last mod> attribute. You can scrape it as soon as you’ve installed the required dependencies.

    Once you’ve got your list of URLs and their XML attributes, I suggest you trim down the dataset to make it resemble something like that:
Last mod XML sitemap scraping
  • A Screaming Frog crawl with the Google Search Console API enabled. Export the search_console_allas an XLSX file and feel free to choose the parameters you prefer to keep in the dataset.

    The only requirement here is you don’t drop the “Address” and the “Last Crawl” columns.

    Here’s an example of what you may want to keep from the Screaming Frog crawl with the Google Search Console API enabled.
Screaming Frog Crawl with the Google Search Console API enable in an Excel spreadsheet

How to Measure Crawl Efficacy?

You can measure crawl efficacy by comparing the <last mod> parameter from an XML sitemap with the last crawl date retrievable from the GSC API.

Crawl Efficacy = LastMod date - Last Crawl date 

Remember, the lower the difference, the better the crawl efficacy.

Import the Crawl with Search Console Parameters

Once we have the required files uploaded, we are going to import the Screaming Frog crawl into our environment so we can create a Pandas data frame and clean up messy data.

Once the dataset is created, we convert the Status Code column into integers to improve data accuracy, thereby improving data validation with respect to the population of the sample.

import pandas as pd

Last_Crawl = pd.read_excel('/content/SF crawl Last Crawl GSC.xlsx')
df = pd.DataFrame(Last_Crawl, columns=['Address','Status Code', 'Title 1','Clicks', 'Impressions', 'CTR', 'Summary', 'Coverage', 'Last Crawl'])

#convert the Status Code column into integer to show integral values
df ['Status Code'] = df ['Status Code'].astype(int)

If we don’t make any changes, we will end up with a messy dataset that contains some missing observations (NaN values). This is because we are dealing with missing data that can only be retrieved by thoroughly searching for them.

In fact, missing data is similar to URLs that return 301 HTTP response codes, indicating that they haven’t passed authority to their destination yet.

Regardless, we ultimately convert these neutral values to 0 to safely exclude them from the data processing.

#remove NaN values messing the dataset
df = df.fillna(0)
df.isnull().sum()
df
Screaming Frog crawl with Search Console API parameters

Import Last Mod from an XML Sitemap Crawl

Next up, we are going to proceed with the XML sitemap crawl file.

If you followed my advice in reshaping the dataset, you will have nothing to do here but leave Pandas reading and setting up your data frame.

Last_mod = pd.read_excel('/content/XML sitemap last mod.xlsx')
df2 = pd.DataFrame(Last_mod , columns=['Address','last mod'])
df2
XML Sitemap crawl with last mod attribute

Calculate the Crawl Efficacy

To perform a calculation of crawl efficacy, we need to create a new column using lambda

Remember the formula:

Last Mod - Last Crawl 

And apply the technical requirements to compile our lambda function.

Next, we’re going to sort the Crawl Efficacy column in descending order to show off the highest distance between <last mod> and Last Crawl.

Finally, we can store the data frame in our machine so it’s ready to download.

sg['Crawl Efficacy'] = sg.apply(lambda row: row['Last Mod']-row['Last Crawl'],axis=1)
#sort Crawl Efficacy by worse values
final = sg.sort_values('Crawl Efficacy', ascending = False)
#save the data frame
final.to_csv('crawl_efficacy.csv',index=False)
final.head() #remove .head() if you want to view the full results straight away

Here’s an example of what you might get if printing final.head of the data frame

Crawl Efficacy Calculation

As we can see, the Pros & Cons article I wrote back in August is receiving the worse crawling treatment. Assuming there are no technical issues, I suspect there is something related to the content structure that I need to review, such as thin content and low-quality copy.

Conclusion

This handy Python script is designed to offer a more reliable SEO metric for measuring crawl responsiveness and promoting crawling and indexation.

It seems like a promising approach to improving the accuracy of this metric. However, please don’t hesitate to reach out if you have any thoughts or critiques of the approach.

Related Posts

Never Miss a Beat

Subscribe now to receive weekly tips about Technical SEO and Data Science 🔥