📏How to Measure Crawl Efficacy in Python

January 31, 2026

As a marketer, should you be more focused on controlling the number of products you can market or the amount of time it takes to reach your customers?

As of today, the trade-off in digital marketing is between quality and quantity. This is where an underrated old marketing KPI comes to play, the Time-To-Market (TTM).

This KPI measures the time it takes for a product to become available to the end user and represents a powerful source of competitive advantage.

In SEO, TTM is largely affected by crawl and rendering speed, as the more a site is crawlable and free from render-blocking assets, the quicker it can become eligible to get indexed.

TTM is important in both the news industry and eCommerce. Whether adding a new item or making on-page SEO changes, the faster the update to a product page, the quicker you’ll see the benefits of optimisation.

In this post, I will provide you with a small Python script designed to calculate crawl efficacy in minutes for your target site.

Table of Contents

Understanding Crawl Budget

Many SEO practitioners misunderstand crawl budget, often focusing on the total number of crawl requests instead of the quality and value of the pages being crawled.

Metrics like Total Crawl Requests in Google Search Console can be misleading: more requests per day don’t automatically lead to faster indexing of important pages. Instead, they can increase server load and costs without providing real SEO benefits.

The key is to focus on crawl efficacy—the time between when a page is published or submitted and when Google actually crawls it. Shorter times indicate an efficient crawl process; longer times suggest your site may have crawling issues.

Crawl Budget Explained

📊

Crawl Budget

⏱️

Crawl Rate

Mainly a concern for
very large websites

🎯

Crawl Demand

Main constraint for
small & mid-sized sites

🔍 What drives Crawl Demand?

🆕 New URLs

Pages Google hasn’t crawled before increase demand due to freshness signals.

🔄 Content Changes

Frequently updated pages are crawled more often to stay current.

⭐ Quality & Popularity

Important, well-linked pages are prioritised and refreshed more often.

🧪 How to audit Crawl Demand (the right way)

❌ Don’t focus on

Total crawl requests alone. More crawls per day do not guarantee faster indexation and often just increase server load and costs.

→

✅ Focus on Crawl Efficacy

Measure the time between when a page is published or submitted and when Google starts crawling it.

What does it mean? The longer the time between a page is published and Google first fetch, the more the hurdles preventing Googlebot from allocating an efficient path to frequent pages discovery.

Google’s crawling process is now guided by predictive machine learning.

Google uses aggressive caching and throttling to crawl efficiently while protecting the servers it accesses. Sites with regularly updated content are more likely to get recrawled, while stale pages may be deprioritised.

For example, if a website is cluttered with UTM-tracking links or other low-value internal URLs, Googlebot recognises that these pages have little PageRank to pass along and will often skip crawling them altogether.

This can lead to a long list of “Discovered – currently not indexed” pages in Google Search Console.

This is not an issue, but a missed opportunity – it stems from a negative trade-off between deterred SEO in favour of comfortable web analytics tracking processes.

Requirements

To run this Python script, you need to upload a couple of files beforehand to train the model.

I recommend using Google Colab for this purpose, as it is a great ready-made solution for beginners. Most of the required libraries are installed by default, and you can take advantage of a powerful GPU to speed up script execution.

The files to upload are:

A crawl of the XML sitemap from your target site enclosing an XLSX file with “Address” and “loc mod” as columns. You can find a tutorial to scrape an XML sitemap with Python.

You don’t need to go through the steps of the tutorial, as the only requirement here is to extract the <last mod> attribute. You can scrape it as soon as you’ve installed the required dependencies.

Once you’ve got your list of URLs and their XML attributes, I suggest you trim down the dataset to make it resemble something like that:

A Screaming Frog crawl with the Google Search Console API enabled. Export the search_console_allas an XLSX file and feel free to choose the parameters you prefer to keep in the dataset.

The only requirement here is you don’t drop the “Address” and the “Last Crawl” columns.

Here’s an example of what you may want to keep from the Screaming Frog crawl with the Google Search Console API enabled.

How to Measure Crawl Efficacy?

You can measure crawl efficacy by comparing the <last mod> parameter from an XML sitemap with the last crawl date retrievable from the GSC API.

Crawl Efficacy = LastMod date - Last Crawl date

Remember, the lower the difference, the better the crawl efficacy.

Import the Crawl with Search Console Parameters

Once we have the required files uploaded, we are going to import the Screaming Frog crawl into our environment so we can create a Pandas data frame and clean up messy data.

Once the dataset is created, we convert the Status Code column into integers to improve data accuracy, thereby improving data validation with respect to the population of the sample.

 import pandas as pd
Last_Crawl = pd.read_excel('/content/SF crawl Last Crawl GSC.xlsx')
df = pd.DataFrame(Last_Crawl, columns=['Address','Status Code', 'Title 1','Clicks', 'Impressions', 'CTR', 'Summary', 'Coverage', 'Last Crawl'])
#convert the Status Code column into integer to show integral values
df ['Status Code'] = df ['Status Code'].astype(int)
 

If we don’t make any changes, we will end up with a messy dataset that contains some missing observations (NaN values). This is because we are dealing with missing data that can only be retrieved by thoroughly searching for them.

In fact, missing data is similar to URLs that return 301 HTTP response codes, indicating that they haven’t passed authority to their destination yet.

Regardless, we ultimately convert these neutral values to 0 to safely exclude them from the data processing.

 #remove NaN values messing the dataset
df = df.fillna(0)
df.isnull().sum()
df
 

Import Last Mod from an XML Sitemap Crawl

Next up, we are going to proceed with the XML sitemap crawl file.

If you followed my advice in reshaping the dataset, you will have nothing to do here but leave Pandas reading and setting up your data frame.

 Last_mod = pd.read_excel('/content/XML sitemap last mod.xlsx')
df2 = pd.DataFrame(Last_mod , columns=['Address','last mod'])
df2
 

Calculate the Crawl Efficacy

To perform a calculation of crawl efficacy, we need to create a new column using lambda

Remember the formula:

Last Mod - Last Crawl

And apply the technical requirements to compile our lambda function.

Next, we’re going to sort the Crawl Efficacy column in descending order to show off the highest distance between <last mod> and Last Crawl.

Finally, we can store the data frame in our machine so it’s ready to download.

 sg['Crawl Efficacy'] = sg.apply(lambda row: row['Last Mod']-row['Last Crawl'],axis=1)
#sort Crawl Efficacy by worse values
final = sg.sort_values('Crawl Efficacy', ascending = False)
#save the data frame
final.to_csv('crawl_efficacy.csv',index=False)
final.head() #remove .head() if you want to view the full results straight away
 

Here’s an example of what you might get if printing final.head of the data frame

As we can see, the Pros & Cons article I wrote back in August is receiving the worse crawling treatment. Assuming there are no technical issues, I suspect there is something related to the content structure that I need to review, such as thin content and low-quality copy.

Conclusion

This handy Python script is designed to offer a more reliable SEO metric for measuring crawl responsiveness and promoting crawling and indexation.

It seems like a promising approach to improving the accuracy of this metric. However, please don’t hesitate to reach out if you have any thoughts or critiques of the approach.

Simone De Palma

Technical SEO Executive

Simone De Palma is a Technical SEO Executive at iProspect UK and the founder of SEO Depths.

He graduated in Marketing and Management from Università IULM before completing a degree in Digital Marketing and Data Science at Leeds Beckett University.
Simone has worked as an SEO Specialist in digital agencies in Italy and the United Kingdom and he’s a contributor for the Search Engine Land.

When he’s away from his double screens, he enjoys cooling down with a refreshing swim at the pool. You could find him exploring art museums or enjoying the company of a classic romance.

How to Measure Crawl Efficacy to Demystify Crawl Budget

Understanding Crawl Budget

Crawl Budget Explained

🔍 What drives Crawl Demand?

🧪 How to audit Crawl Demand (the right way)

Requirements

How to Measure Crawl Efficacy?

Import the Crawl with Search Console Parameters

Import Last Mod from an XML Sitemap Crawl

Calculate the Crawl Efficacy

Conclusion

Simone De Palma

Summarise this post

Subscribe

How to Measure Crawl Efficacy to Demystify Crawl Budget

Understanding Crawl Budget

Crawl Budget Explained

🔍 What drives Crawl Demand?

🧪 How to audit Crawl Demand (the right way)

Requirements

How to Measure Crawl Efficacy?

Import the Crawl with Search Console Parameters

Import Last Mod from an XML Sitemap Crawl

Calculate the Crawl Efficacy

Conclusion

Simone De Palma

Summarise this post

Share this: