The SEO industry is often plagued by biased myths due to the complexity of the discipline.
“Crawl budget” is one such myth that prompts SEOs to make different assumptions depending on the size of a website. Moreover, its definition has some limitations as it fails to include qualitative signals that describe the health of the crawling process.
In fact, this measure only focuses on the quantity of submittable pages. In Google’s words:
Crawl budget represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches.
Since crawling is the entry point for websites into Google’s search results, making it efficient can significantly impact indexation. However, if crawl budget cannot provide qualitative measures, what makes crawling efficient? How can we measure it?
A recent article from Search Engine Land sheds light on the misleading value of crawl budget and offers a new perspective that focuses on shifting from a quantitative optimization of the crawl rate to a more qualitative approach.
In other words, crawl efficacy estimates how quickly a newly submitted page gets crawled. The longer it takes, the higher the chances that your site is struggling with crawling issues.
In this post, I will provide you with a small Python script designed to calculate crawl efficacy in minutes for your target site.
To run this Python script, you need to upload a couple of files beforehand to train the model.
I recommend using Google Colab for this purpose, as it is a great ready-made solution for beginners. Most of the required libraries are installed by default, and you can take advantage of a powerful GPU to speed up script execution.
The files to upload are:
- A crawl of the XML sitemap from your target site enclosing an XLSX file with “Address” and “loc mod” as columns. You can find a tutorial to scrape an XML sitemap with Python.
You don’t need to go through the steps of the tutorial, as the only requirement here is to extract the
<last mod>attribute. You can scrape it as soon as you’ve installed the required dependencies.
Once you’ve got your list of URLs and their XML attributes, I suggest you trim down the dataset to make it resemble something like that:
- A Screaming Frog crawl with the Google Search Console API enabled. Export the
search_console_allas an XLSX file and feel free to choose the parameters you prefer to keep in the dataset.
The only requirement here is you don’t drop the “Address” and the “Last Crawl” columns.
Here’s an example of what you may want to keep from the Screaming Frog crawl with the Google Search Console API enabled.
How to Measure Crawl Efficacy?
You can measure crawl efficacy by comparing the
<last mod> parameter from an XML sitemap with the
last crawl date retrievable from the GSC API.
Crawl Efficacy = LastMod date - Last Crawl date
Remember, the lower the difference, the better the crawl efficacy.
Import the Crawl with Search Console Parameters
Once we have the required files uploaded, we are going to import the Screaming Frog crawl into our environment so we can create a Pandas data frame and clean up messy data.
Once the dataset is created, we convert the Status Code column into integers to improve data accuracy, thereby improving data validation with respect to the population of the sample.
import pandas as pd Last_Crawl = pd.read_excel('/content/SF crawl Last Crawl GSC.xlsx') df = pd.DataFrame(Last_Crawl, columns=['Address','Status Code', 'Title 1','Clicks', 'Impressions', 'CTR', 'Summary', 'Coverage', 'Last Crawl']) #convert the Status Code column into integer to show integral values df ['Status Code'] = df ['Status Code'].astype(int)
If we don’t make any changes, we will end up with a messy dataset that contains some missing observations (NaN values). This is because we are dealing with missing data that can only be retrieved by thoroughly searching for them.
In fact, missing data is similar to URLs that return 301 HTTP response codes, indicating that they haven’t passed authority to their destination yet.
Regardless, we ultimately convert these neutral values to 0 to safely exclude them from the data processing.
#remove NaN values messing the dataset df = df.fillna(0) df.isnull().sum() df
Import Last Mod from an XML Sitemap Crawl
Next up, we are going to proceed with the XML sitemap crawl file.
If you followed my advice in reshaping the dataset, you will have nothing to do here but leave Pandas reading and setting up your data frame.
Last_mod = pd.read_excel('/content/XML sitemap last mod.xlsx') df2 = pd.DataFrame(Last_mod , columns=['Address','last mod']) df2
Calculate the Crawl Efficacy
To perform a calculation of crawl efficacy, we need to create a new column using lambda
Remember the formula:
Last Mod - Last Crawl
And apply the technical requirements to compile our lambda function.
Next, we’re going to sort the Crawl Efficacy column in descending order to show off the highest distance between
<last mod> and
Finally, we can store the data frame in our machine so it’s ready to download.
sg['Crawl Efficacy'] = sg.apply(lambda row: row['Last Mod']-row['Last Crawl'],axis=1) #sort Crawl Efficacy by worse values final = sg.sort_values('Crawl Efficacy', ascending = False) #save the data frame final.to_csv('crawl_efficacy.csv',index=False) final.head() #remove .head() if you want to view the full results straight away
Here’s an example of what you might get if printing
final.head of the data frame
As we can see, the Pros & Cons article I wrote back in August is receiving the worse crawling treatment. Assuming there are no technical issues, I suspect there is something related to the content structure that I need to review, such as thin content and low-quality copy.
This handy Python script is designed to offer a more reliable SEO metric for measuring crawl responsiveness and promoting crawling and indexation.
It seems like a promising approach to improving the accuracy of this metric. However, please don’t hesitate to reach out if you have any thoughts or critiques of the approach.