Analyzing the Causal Impact of Robots.txt Block on SEODepths

In a dusty bedroom on a warm evening in Rome, my best friend and I came back from a long walk past the old Roman buildings, wearing typical shirts and round sunglasses.

Blame it on the exhaustion and the intense daylight, that was the moment I took the decision to block search engines from my website.

I wasn’t just interested in the outcomes – rather predictable – but in the methodology behind the test. I wanted to experiment with advanced statistics tools and see how blocking search engines with robots.txt would impact visibility.

Table of Contents

TL;DR

If you’re going to bounce back, here is all the basics you need to know about the test I did.

Experiment Overview
Blocked search engines for 3 months to analyze robots.txt impact on organic traffic indicators.

Objective
Assess the impact of a site-wide blocking directive on organic traffic indicators (i.e. clicks) with the aid of Bayesian probability.

Highlights

Initial traffic increase despite blocking search engines.
Indexing issues phased out from the Page Indexing report (e.g. pages with redirects dropped right after the robots.txt block).
Lost my Knowledge Panel
Lost SEO Depths favicon in the search results

And the main takeaway:

After blocking the website for three months, the organic performance declined at a steady pace.

Right, but how bad was the debacle? and how did I run test? were the outcomes statistically acceptable?

Keep reading to find out!

Methodologies & Disclaimers

I chose to focus the test on clicks because this is a metric based entirely on (semi-conscious?)user interaction with search results. This makes clicks more robust to constant visibility fluctuations, which might disrupt impressions way more easily.
The experiment is purely observational, reporting on associations between blocking search engines (independent variable) and organic performance (dependent variable). Although I did hypotheses testing based on the expected outcomes, I couldn’t prove causation because what happened to my small website may not be true for others.

Observational vs Experimental case studies properties – Source: seotistics

Apart from the publication of a guide to Technical SEO for video content on November 15th, the website has not faced any significant changes since the test rolled out.
Time series representing each issue in the Page Indexing report on Google Search Console deliver aggregated data based on historical trends. These lines are generated using a standardized statistical prediction tool known as a moving average.

The “Examples” array below the Indexing graph on the page provides a random sample of affected pages determined by the search index cache. If a page was reported to be affected by certain issues on, say, October 31st, it doesn’t necessarily mean that other pages on the same day are free from similar issues. They might just not be available in the sample.

Page indexing report search console

Test Setup

This was a pretty simple setup.

First off, I determined the roles of variables. Because the objective of the analysis was assessing how clicks would react to an external factor, clicks were set as my dependent variable.

The external factor looming on clicks is my independent variable, which happened to be the mighty robots.txt blocking directive.

types of variables for the test setup – elaborated from Seotistics

So, I simply blocked the whole website from being accessed by all search engine on October 6th.

As we would expect a rise in indexed pages blocked by robots.txt, hypothesis testing can be used to raise the following assumptions about the impact on organic traffic.

H0 = The decrease in traffic share over 3 months is >= 25%. I hypothesize that link equity is poorly distributed across the website.
H1 = The decrease in traffic share over 3 months is <= 25%. I hypothesize that link equity is solid across the website.

Causal Impact for the Win

The above hypotheses have been tested with the aid of CausalImpact R package devised by Google.

Causal impact analysis works by using historical data to make a prediction on a counterfactual, a conditional statement of something that has not happened.

📌 In laymen terms,

Counterfactual = the likes of an event if another one would have never happened

In more educated terms, CausalImpact is based on a statistical model called Bayesian Structural Time Series (or BSTS for short), which uses prior information to predict the outcome of a variable in the absence of a treatment.

CausalImpact is a versatile tool with wide-ranging applications. In addition to assessing the impact of robots.txt blocks on organic performance, it finds utility in marketing analytics. For example, it proves valuable for measuring the outcomes of influencer marketing campaigns and evaluating upper-funnel metrics like site views, clicks, sessions, and user engagement.

I decided to leverage this tool to provide a data-informed solution to evaluate the impact of technical changes on my website. Despite a slight learning curve due to its statistical nature, CausalImpact provides a robust approach to assess the impact of technical implementations and draw meaningful conclusions.

To be fair, I find that dedicating a few hours to brushing up on data analysis and statistics basics is more rewarding than blindly relying on third-party services, which you may know very little about the data parsing and forecasting processes.

Plus, CausalImpact is a library deployed in R language free of charge.

💡PRO TIP

I strongly advise to download the R Studio app for desktop so that you can install CausalImpact as a library and start off with your tests.

The Roadmap to a Full Website Block

Crawl requests dropped as soon as the block was applied, especially for HTML resources.

October

After 30 days from the block, web crawlers seemed to be rolling around certain prominent blog posts, causing crawl requests to be revamped on a few occasions.

Google Search Console gathers random samples of pages to serve as examples of issues. Regardless of the sampling process of my URLs, I am led to assume that the blocking directives were taking some time before becoming effective and web crawlers knocked at my blog’s door now and then as part of their daily routine.

The compromised access to the XML sitemaps since the launch date of the test helps back up my assumption as web crawlers wouldn’t be able to surface any of my material from that end-point either.

If the crawling process was overly compromised, on the indexation front, Google didn’t drop any of my pages but put them up against those that are indeed indexed though blocked by robots.txt.

However, the most visible impact on indexed pages I saw was on pages with redirects.

pages with redirects

Right after the robots.txt block, they gradually waned out causing a decrease in the proportion of non-indexed pages. Interestingly, this coincided with a slight increase in impressions on the website.

Nothing surprising so far.

As Googlebot is prevented from accessing the website, pages consolidating on other destinations are not visible to them.

Traffic-wise, the first month of testing revealed something quite unexpected. Despite the blocks, the site saw a noticeable increase in traffic, suggesting that my pages were still very much alive and kicking in the SERPs.

However, the robots.txt impact on impressions turned out to be spurious or affected by a certain degree of randomness. A Bayesian test conducted with CausalImpact between August and November proved clicks have been statistically more impacted by the block on the website.

The reason I considered data from August and September is that CausalImpact requires twice the amount of data as the month under analysis to function accurately. Even though the increase in clicks is evident in the Search Console UI, I believe it may not capture the complete picture, which a Bayesian analysis addresses instead.

CausalImpact test – Date: 07/08/2023 – 07/11/2023

The first panel illustrates the actual click trends (black line) compared to what the clicks would have been (dotted line) if no change was applied (vertical dotted line)

The third panel showcases the causal impact effect with a downward trend, suggesting the impact of the test was negative.

In the words of the CausalImpact report:

In the time period from August to November, there was a decrease of -23% in clicks with a 95% confidence level.

The probability of obtaining this effect by chance is very small (p = 0.022). This means the causal effect can be considered statistically significant.

November

Crawl requests followed a similar trend to the previous month, with occasional requests primarily focused on HTML resources. This suggests that web crawlers weren’t giving up on my blog pages just yet.

crawl requests – November 2023

The increase in the number of pages that were indexed but blocked by robots.txt came to a standstill as Google Search Console gradually stopped reporting on the proportion of indexed and not-indexed pages.

indexed and not indexed pages in November 2023

When search engine spiders clash against a blocking directive, they will decrease the crawl rate and prevent the Search Console report from including events in the time series.

But let’s dive into the traffic performance bits.

From the 15th to the 20th of November, I put out a new blog post about video SEO. A newsletter called Seotistics, which spread the Data Science verb in SEO, gave my blog a shout-out for featuring useful resources to streamline tasks with Python and data analysis.

organic performance – November

So, after publishing my new blog post, I had to wait for the test to finish before it got indexed. I think sharing it on social media helped search engines notice the new info on my site, making it look like a fresh resource and leading to a sudden spike in traffic.

But sadly, the good times didn’t stick around. Once they saw my site’s access was blocked, the traffic dropped sharply.

However, according to the data in Search Console, the organic traffic outlook for November looked better than the previous month on average.

Impressions and clicks showed a slight increase MoM.

And statistics got me right.

Another Bayesian statistics analysis using data from August to December revealed that the rate of traffic decline slowed down compared to the previous month (-16% vs. -23%)

CausalImpact test – Date: 07/08/2023 – 07/12/2023

Just to refresh what we see on the first panel, the grey line marks the date the test started whereas the black line shows the actual click trends compared to what the clicks would have been if no change was applied (counterfactual prediction).

Finally, the last panel showcases the causal impact effect sticking to a downward trend.

In the words of the CausalImpact report:

In the time span from August to December, there was a decrease of -16% in clicks with a 95% confidence level.

The probability of obtaining this effect by chance is very small (p = 0.039). This means the causal effect can be considered statistically significant.

Admittedly, I couldn’t resist the hindsight bias because I was expecting a slowdown in traffic decline MoM.

Think about it. Denying access to my website for search engines caused a sudden drop in traffic. As bots realized the situation wasn’t changing soon, they slowed down the crawl rate and halted reporting on indexed pages.

This ultimately led to a slower decrease in overall traffic.

December

Unsurprisingly, crawl requests were annihilated throughout December, despite occasional requests flaring up for HTML resources.

Google Search Console stopped reporting on the proportion of indexed and not-indexed pages leaving organic visibility metrics floating up and down indistinctively.

Search performance seemed to go into free fall as well.

Because no posts have been spammed across any marketing channels, there was no way users would stumble upon my website and cheer up visibility.

As you can see, though, the picture is very nuanced. It’s close to impossible to draw a trend line just by looking at the Google Search console time series.

As usual, Causal Impact comes to the rescue, and here’s the verdict on the impact of organic traffic performance.

A BSTS (Bayesian Structural Time Series) analysis accounting for the entire testing period revealed that during the entire testing period:

There was a decrease of -21% in clicks with a 95% confidence level.

The probability of obtaining this effect by chance is very small (p = 0.012). This means the causal effect can be considered statistically significant

CausalImpact test – Date: 07/08/2023 – 07/01/2023

A Slow (Yet Steady) Decline

During the test, I noticed a dip in impressions and clicks first, followed by a consistent decline until the end of the test.

Now, let’s review the hypothesis testing proposal to determine if we can validate the H1 assumption.

Can you recall?

H1 = The decrease in traffic share over 3 months is <= 25%. I hypothesize that link equity is solid across the website.

In the three-month testing period, my website indeed experienced a 21% decrease in click share, therefore I can validate the H1 assumption.

This means that SEO Depths boasts a robust internal linking structure, enabling the website to cushion the effects of a complete robots.txt block.

I Lost My KP and SEODepths Favicon

Before the test, a navigational branded term for my full name would return a Knowledge Panel in the search results for the UK.

Simone De Palma Knowledge Panel

I was so excited that I even bothered my summer holidays in Portugal to write down a post detailing how I earned the knowledge panel.

Shortly after disabling crawling for the entire website, I lost the panel and I was left in oblivion from Google’s knowledge graph.

This is something that occurred in a similar test performed by Kristina Azarenko as reported in her post.

Another significant change brought about by this experiment involved the swift removal of the favicon.

missing favicon and knowledge panel for simone de palma – UK SERP

In the words of Google, both the favicon file and the home page must be crawlable by Googlebot for the favicon to be displayed in search results. As the website was completely blocked by robots.txt, the favicon file was naturally knocked out from the SERP.

Words of Wisdom (A Few)

I want to give a shoutout to Marco Giordano for giving me great guidance on handling SEO testing and interpreting claims on steroids in SEO case studies. You should subscribe to his newsletter which I linked throughout the blog post – did you notice? 🧐

Also, recall this test wasn’t meant to be the ultimate guide for small websites. I was mainly interested in using strong statistics to learn a solid method for testing and understanding the results.

So, try to read less and test more.

I will keep testing,

you will keep testing,

they will keep testing

So that we will keep testing together

Related Posts

Simone De Palma

Technical SEO Executive

Simone De Palma is a Technical SEO Executive at iProspect UK and the founder of SEO Depths.

He graduated in Marketing and Management from Università IULM before completing a degree in Digital Marketing and Data Science at Leeds Beckett University.
Simone has worked as an SEO Specialist in digital agencies in Italy and the United Kingdom and he’s a contributor for the Search Engine Land.

When he’s away from his double screens, he enjoys cooling down with a refreshing swim at the pool. You could find him exploring art museums or enjoying the company of a classic romance.

Add your content…

I Blocked SEO Depths for 3 months: What Happened