As a marketer, should you be more focused on controlling the number of products you can market or the amount of time it takes to reach your customers?
As of today, the trade-off in digital marketing is between quality and quantity. This is where an underrated old marketing KPI comes to play, the Time-To-Market (TTM).
This KPI measures the time it takes for a product to become available to the end user and represents a powerful source of competitive advantage.
In SEO, TTM is largely affected by crawl and rendering speed, as the more a site is crawlable and free from render-blocking assets, the quicker it can become eligible to get indexed.
TTM is important in both the news industry and eCommerce. Whether adding a new item or making on-page SEO changes, the faster the update to a product page, the quicker you’ll see the benefits of optimisation.
In this post, I will provide you with a small Python script designed to calculate crawl efficacy in minutes for your target site.
Understanding Crawl Budget
Many SEO practitioners misunderstand crawl budget, often focusing on the total number of crawl requests instead of the quality and value of the pages being crawled.
Metrics like Total Crawl Requests in Google Search Console can be misleading: more requests per day donโt automatically lead to faster indexing of important pages. Instead, they can increase server load and costs without providing real SEO benefits.
The key is to focus on crawl efficacyโthe time between when a page is published or submitted and when Google actually crawls it. Shorter times indicate an efficient crawl process; longer times suggest your site may have crawling issues.
Crawl Budget Explained
Mainly a concern for
very large websites
Main constraint for
small & mid-sized sites
๐ What drives Crawl Demand?
Pages Google hasnโt crawled before increase demand due to freshness signals.
Frequently updated pages are crawled more often to stay current.
Important, well-linked pages are prioritised and refreshed more often.
๐งช How to audit Crawl Demand (the right way)
Total crawl requests alone. More crawls per day do not guarantee faster indexation and often just increase server load and costs.
Measure the time between when a page is published or submitted and when Google starts crawling it.
Googleโs crawling process is now guided by predictive machine learning.
Google uses aggressive caching and throttling to crawl efficiently while protecting the servers it accesses. Sites with regularly updated content are more likely to get recrawled, while stale pages may be deprioritised.
For example, if a website is cluttered with UTM-tracking links or other low-value internal URLs, Googlebot recognises that these pages have little PageRank to pass along and will often skip crawling them altogether.
This can lead to a long list of โDiscovered โ currently not indexedโ pages in Google Search Console.
This is not an issue, but a missed opportunity โ it stems from a negative trade-off between deterred SEO in favour of comfortable web analytics tracking processes.
Requirements
To run this Python script, you need to upload a couple of files beforehand to train the model.
I recommend using Google Colab for this purpose, as it is a great ready-made solution for beginners. Most of the required libraries are installed by default, and you can take advantage of a powerful GPU to speed up script execution.
The files to upload are:
- A crawl of the XML sitemap from your target site enclosing an XLSX file with “Address” and “loc mod” as columns. You can find a tutorial to scrape an XML sitemap with Python.
You don’t need to go through the steps of the tutorial, as the only requirement here is to extract the<last mod>attribute. You can scrape it as soon as you’ve installed the required dependencies.
Once you’ve got your list of URLs and their XML attributes, I suggest you trim down the dataset to make it resemble something like that:
- A Screaming Frog crawl with the Google Search Console API enabled. Export the
search_console_allas an XLSX file and feel free to choose the parameters you prefer to keep in the dataset.
The only requirement here is you don’t drop the “Address” and the “Last Crawl” columns.
Here’s an example of what you may want to keep from the Screaming Frog crawl with the Google Search Console API enabled.
How to Measure Crawl Efficacy?
You can measure crawl efficacy by comparing the <last mod> parameter from an XML sitemap with the last crawl date retrievable from the GSC API.
Crawl Efficacy = LastMod date - Last Crawl date
Remember, the lower the difference, the better the crawl efficacy.
Import the Crawl with Search Console Parameters
Once we have the required files uploaded, we are going to import the Screaming Frog crawl into our environment so we can create a Pandas data frame and clean up messy data.
Once the dataset is created, we convert the Status Code column into integers to improve data accuracy, thereby improving data validation with respect to the population of the sample.
import pandas as pdLast_Crawl = pd.read_excel('/content/SF crawl Last Crawl GSC.xlsx')df = pd.DataFrame(Last_Crawl, columns=['Address','Status Code', 'Title 1','Clicks', 'Impressions', 'CTR', 'Summary', 'Coverage', 'Last Crawl'])#convert the Status Code column into integer to show integral valuesdf ['Status Code'] = df ['Status Code'].astype(int) If we don’t make any changes, we will end up with a messy dataset that contains some missing observations (NaN values). This is because we are dealing with missing data that can only be retrieved by thoroughly searching for them.
In fact, missing data is similar to URLs that return 301 HTTP response codes, indicating that they haven’t passed authority to their destination yet.
Regardless, we ultimately convert these neutral values to 0 to safely exclude them from the data processing.
#remove NaN values messing the datasetdf = df.fillna(0)df.isnull().sum()df Import Last Mod from an XML Sitemap Crawl
Next up, we are going to proceed with the XML sitemap crawl file.
If you followed my advice in reshaping the dataset, you will have nothing to do here but leave Pandas reading and setting up your data frame.
Last_mod = pd.read_excel('/content/XML sitemap last mod.xlsx')df2 = pd.DataFrame(Last_mod , columns=['Address','last mod'])df2 Calculate the Crawl Efficacy
To perform a calculation of crawl efficacy, we need to create a new column using lambda
Remember the formula:
Last Mod - Last Crawl And apply the technical requirements to compile our lambda function.
Next, we’re going to sort the Crawl Efficacy column in descending order to show off the highest distance between <last mod> and Last Crawl.
Finally, we can store the data frame in our machine so it’s ready to download.
sg['Crawl Efficacy'] = sg.apply(lambda row: row['Last Mod']-row['Last Crawl'],axis=1)#sort Crawl Efficacy by worse valuesfinal = sg.sort_values('Crawl Efficacy', ascending = False)#save the data framefinal.to_csv('crawl_efficacy.csv',index=False)final.head() #remove .head() if you want to view the full results straight away Here’s an example of what you might get if printing final.head of the data frame
As we can see, the Pros & Cons article I wrote back in August is receiving the worse crawling treatment. Assuming there are no technical issues, I suspect there is something related to the content structure that I need to review, such as thin content and low-quality copy.
Conclusion
This handy Python script is designed to offer a more reliable SEO metric for measuring crawl responsiveness and promoting crawling and indexation.
It seems like a promising approach to improving the accuracy of this metric. However, please don’t hesitate to reach out if you have any thoughts or critiques of the approach.