How to Automate a Robots.txt Competitor Analysis with Python and Advanced Data Viz

I’ve always found competitor analysis involving robots.txt files to be a bit of a headache because there’s not a heck of a tool out there that just makes the cut.

But recently I was found responsible for carrying out a competitive analysis from a technical SEO standpoint and I couldn’t help but including a breakdown of this infamous txt file for all competitors.

To streamline the process, I built a Python workflow using Advertools to fetch and parse robots.txt files, and Plotly to visualize disallowed directives across competitors with advanced heatmaps.

The result? A scalable, and failry decent approach to streamline a robots.txt competitor analysis that may come in very handy. At least it did to me.

In this tutorial, I’ll walk you through a script that analyzes the robots.txt files of four competitors in the News industry, providing:

  • An overview of of each site’s blocked User-Agents
  • A heatmap to visually compare blocked URL strings clustered in page templates (after rule-based classification)

Requirements & Assumptions

To follow along, you need to open a new notebook in Google Colab

You can find the full notebook on GitHub here to give you a headstart
👉 Robots.txt Competitor Analysis – Google Colab Export

We’re going to use Advertools to help us fetch the robots.txt files and Plotly to create advanced heatmaps.

You don’t need to be a Python expert for this tutorial, but some familiarity with data pre-processing will be useful. That’s because defining labels to cluster disallowed directives can be both time- and resource-consuming.

This is mainly due to the fact that the process depends on the industry you’re analyzing—different sectors have different needs when it comes to drafting a robots.txt file.

Install Dependencies & Fetch All Robots.txt files

Install advertools and plotly first

!pip install advertools
!pip install plotly

Let’s kick off with requesting robots.txt files and storing all the output in a Pandas dataframe.

Just a word of warning on being as much polite as possible with the servers. You don’t want your IP address to be blocked out from being too aggressive on the root of the website you want to scrape data from.

That’s why the python library requests will be required to abide by a time.sleep(5) as a minimum threshold.

It’s still possible you’ll get blocked though, so try increasing the sleeping rate.

import advertools as adv
import pandas as pd
import requests
import time

# List of robots.txt URLs
robotstxt_urls = [
'https://www.bbc.com/robots.txt',
'https://www.theguardian.com/robots.txt',
'https://www.thesun.co.uk/robots.txt',
'https://www.mirror.co.uk/robots.txt'
]

robots_dfs = []

# Loop through each URL, fetch with requests, and parse
for url in robotstxt_urls:
    try:
        print(f"Fetching: {url}")
        # Use advertools.robotstxt_to_df with the URL
        df = adv.robotstxt_to_df(robotstxt_url=url)
        df['robots_url'] = url
        robots_dfs.append(df)

    except Exception as e:
        print(f"Error processing {url}: {e}")

    time.sleep(5)  # Be polite to servers

robots_df = pd.concat(robots_dfs, ignore_index=True)
robots_df

And the output will be slotted in a nice Pandas dataframe

Data Pre-processing

Let’s filter out some noise, therefore unwanted headers and special characters you can find in the disallow directives

import re
robots_df = robots_df.drop(columns=['etag', 'robotstxt_url', 'download_date', 'robotstxt_last_modified'], errors='ignore')
# Keep rows where 'directive' is 'User-agent' or 'Disallow'
robots_df = robots_df[robots_df['directive'].isin(['User-agent', 'Disallow'])]

# Function to clean content using regex
def clean_content(text):
    """
    Clean text by removing special characters and keeping only alphanumeric characters and spaces.

    Args:
        text (str): Input text to clean

    Returns:
        str: Cleaned text with only alphanumeric characters and spaces
    """
    if pd.isna(text):
        return text

    # Convert to string if not already
    text = str(text)

    # Use regex to keep only alphanumeric characters and spaces
    # This removes all special characters like *, =, **, etc.
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)

    # Remove extra whitespace and strip
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    return cleaned_text

# Apply the cleaning function to the 'content' column
robots_df['content'] = robots_df['content'].apply(clean_content)

robots_df

User-Agents Table

The robots_df dataframe has been cleaned up but it contains a broad span of information.

Let’s split it into:

  • User-agents table
  • Directives table
user_agent = robots_df.copy()
user_agent = user_agent[user_agent['directive'] == 'User-agent']
#user_agent.to_excel('user_agent.xlsx',index=False)
user_agent

⚠️ Empty (NaN) content header from the user-agent table above means a wildcard for all user-agents

⚠️⚠️ #user_agent.to_excel(‘user_agent.xlsx’,index=False)

You can uncomment to export the output into an XLSX file to do a bit of manual clean up yourself in Excel. It’s all right, sometimes it’s easier and quicker than twiddling with Python. You can then import the XLSX cleaned file in the first line of code right below

# Upload a xlsx file
import pandas as pd
user_agent_cleaned = pd.read_excel('/content/directive.xlsx')

If you decided to do some DIY cleanup, make sure to replace user_agent with user_agent_cleaned from now on.

Pre-processing & Rule-based classification

If you didn’t clean up anything, stick to the user_agent dataset and tag along with the following lines because we’re going to create labels for the user-agents used by the competitors.

# Fix column name if there's a typo
if 'directive]' in user_agent.columns:
    user_agent = user_agent.rename(columns={'directive]': 'directive'})

# Drop irrelevant columns (optional, if they exist)
user_agent = user_agent.drop(
    columns=['etag', 'download_date', 'robotstxt_last_modified'],
    errors='ignore'
)

# Define the user agent classification function
def classify_user_agent(value):
    if pd.isna(value):
        return "Other/Unclassified"
    
    v = value.lower().strip()
    
    # Google/Alphabet
    if any(k in v for k in ["googlebot", "google extended", "google cloudvertexbot", "mediapartners google"]):
        return "Google"
    
    # OpenAI
    elif any(k in v for k in ["gptbot", "chatgpt user", "oai searchbot"]):
        return "OpenAI"
    
    # Anthropic
    elif any(k in v for k in ["claude web", "claudebot", "anthropic ai"]):
        return "Anthropic"
    
    # Meta
    elif any(k in v for k in ["facebookbot", "meta externalagent"]):
        return "Meta"
    
    # Microsoft
    elif "bingbot" in v:
        return "Bing"
    
    # Apple
    elif any(k in v for k in ["applebot"]):
        return "Apple"
    
    # Yandex
    elif any(k in v for k in ["yandex"]):
        return "Yandex"
    
    # ByteDance
    elif "bytespider" in v:
        return "ByteDance"
    
    # Huawei
    elif "petalbot" in v:
        return "Huawei"
    
    # Cohere
    elif "cohere ai" in v:
        return "Cohere"

    # Perplexity
    elif "perplexity" in v:
        return "Perplexity"
   
    # Baidu
    elif any(k in v for k in ["baiduspider","baidubaikebot"]):
        return "Baidu"
    
    #Amazon
    elif "amazonbot" in v:
      return "Amazon"

    # SEO/Marketing Tools
    elif any(k in v for k in ["ahrefsbot", "mj12bot", "awario", "sentione", "meltwater", "grapeshot", "semetrical"]):
        return "SEO/Marketing Tools"
    
    # Other Bots
    elif any(k in v for k in ["slurp", "ccbot", "scrapy", "magpie crawler", "coccocbot", "newsnow", "news please", 
                             "rogerbot", "daumoa", "sosospider", "ia archiver", "omgili", "piplbot", 
                             "imagesift", "jenkersbot", "scalepostai", "buck"]):
        return "General Crawlers/Scrapers"
    
# Apply classification to the 'content' column
user_agent['user_agent_bucket'] = user_agent['content'].apply(classify_user_agent)

# Preview result
user_agent.value_counts('user_agent_bucket')

Plot a Heatmap of Blocked User-Agents

import pandas as pd
import plotly.express as px

# Function to extract clean site names from robots.txt URLs
def get_site_name(url):
    """Extract clean site name from robots.txt URL"""
    url_mapping = {
        'https://www.bbc.com/robots.txt': 'BBC',
        'https://www.theguardian.com/robots.txt': 'The Guardian',
        'https://www.thesun.co.uk/robots.txt': 'The Sun',
        'https://www.mirror.co.uk/robots.txt': 'The Mirror'
    }
    
    # Return mapped name if exact match, otherwise extract from URL
    if url in url_mapping:
        return url_mapping[url]
    
    # Fallback: extract domain name
    if 'bbc.com' in url:
        return 'BBC'
    elif 'theguardian.com' in url:
        return 'The Guardian'
    elif 'thesun.co.uk' in url:
        return 'The Sun'
    elif 'mirror.co.uk' in url:
        return 'The Mirror'
    else:
        # Generic extraction from domain
        domain = url.replace('https://www.', '').replace('http://www.', '').replace('/robots.txt', '')
        return domain.replace('.com', '').replace('.co.uk', '').title()

# Create a cross tab of robots_url and user_agent_bucket
heatmap_data = pd.crosstab(user_agent['robots_url'], user_agent['user_agent_bucket'])

# Create mapping of site names
site_names = [get_site_name(url) for url in heatmap_data.index]

# Create the heatmap
fig = px.imshow(
    heatmap_data,
    labels=dict(x="User-Agents", y="Websites Robots.txt", color="Count"),
    x=heatmap_data.columns,
    y=site_names,  # Use clean site names instead of URLs
    color_continuous_scale="viridis",
    text_auto=True  # adds value annotations like sns.heatmap(annot=True)
)

fig.update_layout(
    title="Distribution of Blocked User-Agents",
    height=max(400, len(heatmap_data.index) * 60),  # Adjust height based on number of sites
    margin=dict(l=120, r=50, t=80, b=50)  # Increase left margin for site names
)

fig.show()

Disallow Directive Table

From now on it’s all just a rinse-and-repeat process to replicate on the Disallowed directives.

But this one can be more daunting due to the unpredictable number of directives blocked at robots.txt files and the variety of industries you could be using it for.

Let’s roll it back to the original robots_df dataframe

directive = robots_df.copy()
directive = directive[directive['directive'] == 'Disallow']
directive.to_excel('directive.xlsx',index=False)
directive

Just like before, I suggest you export an XLSX file of the output so that you can look into the file yourself and have a read at all the blocked folders.

Pre-processing & Rule-based classification – Ask Claude for Help

If the prospect of clustering loads of subfolders looks like it’s a bit of an overkill, you can always ask Claude to help you out. Here’s how.

  1. Export the Directive Table in XLSX
  2. Copy the list of blocked directives (“content” header)
  3. Paste in Claude with the following prompt:

Feel free to add a new cell in the Colab to allow you some extra space to run the output of the prompt received from the conversation with Claude.

Here’s an example of the output I received from the prompt

import re

def classify_content_term(term):
    """
    Classify content terms into functional categories
    """
    if pd.isna(term):
        return "Other/Unclassified"
    
    # Convert to lowercase and clean
    t = str(term).lower().strip()
    
    # Search & Navigation
    if any(keyword in t for keyword in ['search', 'chwilio', 'websearch', 'find', 'query']):
        return "Search & Navigation"
    
    # Educational Content
    elif any(keyword in t for keyword in ['bitesize', 'education', 'newsround', 'learning', 'curriculum']):
        return "Educational Content"
    
    # Food & Lifestyle
    elif any(keyword in t for keyword in ['food', 'recipes', 'menus', 'shopping list', 'favourites', 'cooking']):
        return "Food & Lifestyle"
    
    # User Management & Authentication
    elif any(keyword in t for keyword in ['users', 'userinfo', 'login', 'sso', 'profile', 'auth', 'account', 'user']):
        return "User Management & Auth"
    
    # Media & Entertainment
    elif any(keyword in t for keyword in ['sounds', 'music', 'artist', 'album', 'radio', 'audio', 'player', 'tv']):
        return "Media & Entertainment"
    
    # Sports
    elif any(keyword in t for keyword in ['sport', 'olympics', 'horseracing', 'racecards', 'results', 'medals', 'events']):
        return "Sports"
    
    # Technical/System
    elif any(keyword in t for keyword in ['ajax', 'css', 'js', 'php', 'api', 'embed', 'wp', 'admin', 'apps', 'json', 'xml', 'asset']):
        return "Technical/System"
    
    # News & Articles
    elif any(keyword in t for keyword in ['news', 'articles', 'headline', 'stories', 'feedarticle', 'most read', 'breaking']):
        return "News & Articles"
    
    # User-Generated Content
    elif any(keyword in t for keyword in ['ugc', 'comment', 'discussion', 'report abuse', 'permalink', 'handlers']):
        return "User-Generated Content"
    
    # External Services & Integrations
    elif any(keyword in t for keyword in ['whsmiths', 'overture', 'brightcove', 'tealium', 'external', 'third party']):
        return "External Services"
    
    # Commerce & Shopping
    elif any(keyword in t for keyword in ['shop', 'buy', 'cart', 'checkout', 'payment', 'order']):
        return "Commerce & Shopping"
    
    # Help & Support
    elif any(keyword in t for keyword in ['help', 'support', 'contact', 'feedback', 'faq']):
        return "Help & Support"
    
    # Archive & Historical
    elif any(keyword in t for keyword in ['archive', 'historical', 'past', 'old']):
        return "Archive & Historical"
    
    # Entertainment & Lifestyle Content
    elif any(keyword in t for keyword in ['celeb', 'celebrity', 'weird', 'cartoons', 'lifestyle', 'entertainment']):
        return "Entertainment & Lifestyle"
    
    # Geographic/Location
    elif any(keyword in t for keyword in ['travel', 'location', 'seaside', 'uk', 'local']):
        return "Geographic/Location"
    
    else:
        return "Other/Unclassified"

# Apply the classification to your directive dataframe
# Assuming your dataframe is called 'directive' and has a 'content' column
directive['content_category'] = directive['content'].apply(classify_content_term)

# Display the distribution of categories
directive['content_category'].value_counts()

Plotting a Heatmap of Disallowed Directives

And now let’s plot the group of subfolders exposed to a Disallow directive in the robots.txt files from the pool of competitor websites.

import pandas as pd
import plotly.express as px

# Function to extract clean site names from robots.txt URLs
def get_site_name(url):
    """Extract clean site name from robots.txt URL"""
    url_mapping = {
        'https://www.bbc.com/robots.txt': 'BBC',
        'https://www.theguardian.com/robots.txt': 'The Guardian',
        'https://www.thesun.co.uk/robots.txt': 'The Sun',
        'https://www.mirror.co.uk/robots.txt': 'The Mirror'
    }
    
    # Return mapped name if exact match, otherwise extract from URL
    if url in url_mapping:
        return url_mapping[url]
    
    # Fallback: extract domain name
    if 'bbc.com' in url:
        return 'BBC'
    elif 'theguardian.com' in url:
        return 'The Guardian'
    elif 'thesun.co.uk' in url:
        return 'The Sun'
    elif 'mirror.co.uk' in url:
        return 'The Mirror'
    else:
        # Generic extraction from domain
        domain = url.replace('https://www.', '').replace('http://www.', '').replace('/robots.txt', '')
        return domain.replace('.com', '').replace('.co.uk', '').title()

# Create a cross tab of robots_url and directive_bucket
heatmap_data = pd.crosstab(directive['robots_url'], directive['content_category'])

# Create mapping of site names
site_names = [get_site_name(url) for url in heatmap_data.index]

# Create the heatmap
fig = px.imshow(
    heatmap_data,
    labels=dict(x="Disallowed Directives", y="Websites Robots.txt", color="Count"),
    x=heatmap_data.columns,
    y=site_names,  # Use clean site names instead of URLs
    color_continuous_scale="viridis",
    text_auto=True  # adds value annotations like sns.heatmap(annot=True)
)

fig.update_layout(
    title="Distribution of Disallowed Directives",
    height=max(400, len(heatmap_data.index) * 60),  # Adjust height based on number of sites
    margin=dict(l=120, r=50, t=80, b=50)  # Increase left margin for site names
)

fig.show()

And that was a wrap!

With a couple of tweaks here and there, you can secure yourself a reliable workflow to streamline a robots.txt competitive analysis.

Summarise this post