How to Measure Crawl Efficacy to Demystify Crawl Budget

As a marketer, should you be more focused on controlling the number of products you can market or the amount of time it takes to reach your customers?

As of today, the trade-off in digital marketing is between quality and quantity. This is where an underrated old marketing KPI comes to play, the Time-To-Market (TTM).

This KPI measures the time it takes for a product to become available to the end user and represents a powerful source of competitive advantage.

In SEO, TTM is largely affected by crawl and rendering speed, as the more a site is crawlable and free from render-blocking assets, the quicker it can become eligible to get indexed. 

TTM is important in both the news industry and eCommerce. Whether adding a new item or making on-page SEO changes, the faster the update to a product page, the quicker you’ll see the benefits of optimisation.

In this post, I will provide you with a small Python script designed to calculate crawl efficacy in minutes for your target site.

Understanding Crawl Budget

Many SEO practitioners misunderstand crawl budget, often focusing on the total number of crawl requests instead of the quality and value of the pages being crawled.

Metrics like Total Crawl Requests in Google Search Console can be misleading: more requests per day donโ€™t automatically lead to faster indexing of important pages. Instead, they can increase server load and costs without providing real SEO benefits.

The key is to focus on crawl efficacyโ€”the time between when a page is published or submitted and when Google actually crawls it. Shorter times indicate an efficient crawl process; longer times suggest your site may have crawling issues.

Crawl Budget Explained

๐Ÿ“Š
Crawl Budget
=
โฑ๏ธ
Crawl Rate

Mainly a concern for
very large websites

+
๐ŸŽฏ
Crawl Demand

Main constraint for
small & mid-sized sites

๐Ÿ” What drives Crawl Demand?

๐Ÿ†• New URLs

Pages Google hasnโ€™t crawled before increase demand due to freshness signals.

๐Ÿ”„ Content Changes

Frequently updated pages are crawled more often to stay current.

โญ Quality & Popularity

Important, well-linked pages are prioritised and refreshed more often.

๐Ÿงช How to audit Crawl Demand (the right way)

โŒ Donโ€™t focus on

Total crawl requests alone. More crawls per day do not guarantee faster indexation and often just increase server load and costs.

โ†’
โœ… Focus on Crawl Efficacy

Measure the time between when a page is published or submitted and when Google starts crawling it.

What does it mean? The longer the time between a page is published and Google first fetch, the more the hurdles preventing Googlebot from allocating an efficient path to frequent pages discovery.

Googleโ€™s crawling process is now guided by predictive machine learning.

Google uses aggressive caching and throttling to crawl efficiently while protecting the servers it accesses. Sites with regularly updated content are more likely to get recrawled, while stale pages may be deprioritised.

For example, if a website is cluttered with UTM-tracking links or other low-value internal URLs, Googlebot recognises that these pages have little PageRank to pass along and will often skip crawling them altogether.

This can lead to a long list of โ€œDiscovered โ€“ currently not indexedโ€ pages in Google Search Console.

This is not an issue, but a missed opportunity โ€“ it stems from a negative trade-off between deterred SEO in favour of comfortable web analytics tracking processes.

Requirements

To run this Python script, you need to upload a couple of files beforehand to train the model.

I recommend using Google Colab for this purpose, as it is a great ready-made solution for beginners. Most of the required libraries are installed by default, and you can take advantage of a powerful GPU to speed up script execution.

The files to upload are:

  • A crawl of the XML sitemap from your target site enclosing an XLSX file with “Address” and “loc mod” as columns. You can find a tutorial to scrape an XML sitemap with Python.

    You don’t need to go through the steps of the tutorial, as the only requirement here is to extract the <last mod> attribute. You can scrape it as soon as you’ve installed the required dependencies.

    Once you’ve got your list of URLs and their XML attributes, I suggest you trim down the dataset to make it resemble something like that:
Last mod XML sitemap scraping
  • A Screaming Frog crawl with the Google Search Console API enabled. Export the search_console_allas an XLSX file and feel free to choose the parameters you prefer to keep in the dataset.

    The only requirement here is you don’t drop the “Address” and the “Last Crawl” columns.

    Here’s an example of what you may want to keep from the Screaming Frog crawl with the Google Search Console API enabled.
Screaming Frog Crawl with the Google Search Console API enable in an Excel spreadsheet

How to Measure Crawl Efficacy?

You can measure crawl efficacy by comparing the <last mod> parameter from an XML sitemap with the last crawl date retrievable from the GSC API.

Crawl Efficacy = LastMod date - Last Crawl date 

Remember, the lower the difference, the better the crawl efficacy.

Import the Crawl with Search Console Parameters

Once we have the required files uploaded, we are going to import the Screaming Frog crawl into our environment so we can create a Pandas data frame and clean up messy data.

Once the dataset is created, we convert the Status Code column into integers to improve data accuracy, thereby improving data validation with respect to the population of the sample.

import pandas as pd
Last_Crawl = pd.read_excel('/content/SF crawl Last Crawl GSC.xlsx')
df = pd.DataFrame(Last_Crawl, columns=['Address','Status Code', 'Title 1','Clicks', 'Impressions', 'CTR', 'Summary', 'Coverage', 'Last Crawl'])
#convert the Status Code column into integer to show integral values
df ['Status Code'] = df ['Status Code'].astype(int)

If we don’t make any changes, we will end up with a messy dataset that contains some missing observations (NaN values). This is because we are dealing with missing data that can only be retrieved by thoroughly searching for them.

In fact, missing data is similar to URLs that return 301 HTTP response codes, indicating that they haven’t passed authority to their destination yet.

Regardless, we ultimately convert these neutral values to 0 to safely exclude them from the data processing.

#remove NaN values messing the dataset
df = df.fillna(0)
df.isnull().sum()
df
Screaming Frog crawl with Search Console API parameters

Import Last Mod from an XML Sitemap Crawl

Next up, we are going to proceed with the XML sitemap crawl file.

If you followed my advice in reshaping the dataset, you will have nothing to do here but leave Pandas reading and setting up your data frame.

Last_mod = pd.read_excel('/content/XML sitemap last mod.xlsx')
df2 = pd.DataFrame(Last_mod , columns=['Address','last mod'])
df2
XML Sitemap crawl with last mod attribute

Calculate the Crawl Efficacy

To perform a calculation of crawl efficacy, we need to create a new column using lambda

Remember the formula:

Last Mod - Last Crawl

And apply the technical requirements to compile our lambda function.

Next, we’re going to sort the Crawl Efficacy column in descending order to show off the highest distance between <last mod> and Last Crawl.

Finally, we can store the data frame in our machine so it’s ready to download.

sg['Crawl Efficacy'] = sg.apply(lambda row: row['Last Mod']-row['Last Crawl'],axis=1)
#sort Crawl Efficacy by worse values
final = sg.sort_values('Crawl Efficacy', ascending = False)
#save the data frame
final.to_csv('crawl_efficacy.csv',index=False)
final.head() #remove .head() if you want to view the full results straight away

Here’s an example of what you might get if printing final.head of the data frame

Crawl Efficacy Calculation

As we can see, the Pros & Cons article I wrote back in August is receiving the worse crawling treatment. Assuming there are no technical issues, I suspect there is something related to the content structure that I need to review, such as thin content and low-quality copy.

Conclusion

This handy Python script is designed to offer a more reliable SEO metric for measuring crawl responsiveness and promoting crawling and indexation.

It seems like a promising approach to improving the accuracy of this metric. However, please don’t hesitate to reach out if you have any thoughts or critiques of the approach.

Summarise this post