🍁XML Sitemap Audit with Python

Reading time: 25 Minutes

XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs.

However, they also play a crucial role when it comes to seizing the competitive batch as they quickly deliver an overview of what a website considers to be the most relevant pages which need to be submitted to the attention of the web crawlers.

In this post, you will learn how to automate an XML sitemap audit to improve your SEO decision-making.


💡Shortcut

Jump straight on the Google Colab script if in a hurry


XML Sitemap: What it is and How to Audit

Sitemaps are vast repositories of the best pages from your website. This document improves the capability of Google and other search engines to crawl your website more efficiently.

In the words of Google:

sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them.

A sitemap helps search engines discover URLs on your site, but it doesn’t guarantee that all the items in your sitemap will be crawled and indexed.

Who needs an XML Sitemap?

When your website’s pages are linked correctly, Google is usually able to find and index most of them. This means that all the important pages on your website can be accessed through navigation options, such as your site’s menu or links placed on other pages.

Despite optimal internal linking, an XML sitemap can help Google find and index larger or more complex websites or specialized files effectively.

Some websites may benefit more from an XML sitemap than others:

  • Your site is really large. As a result, it’s more likely Google web crawlers might overlook crawling some of your new or recently updated pages.
  • Your site has a large archive of content pages that are isolated or not well linked to each other. If your site pages don’t naturally reference each other, you can list them in a sitemap to ensure that Google doesn’t overlook some of your pages.
  • Your site is new and has few external links to it. Googlebot and other web crawlers crawl the web by following links from one page to another. As a result, Google might not discover your pages if no other sites link to them.
  • Your site has a lot of rich media content (video, images) or is shown in Google News. If provided, Google can take additional information from sitemaps into account for search, where appropriate.

Some websites may not receive equivalent benefits from an XML sitemap:

  • Your site is “small”. The site should be about 500 pages or fewer.
  • Your site is comprehensively linked internally. This means that Google can find all the important pages on your site by following links starting from the homepage.
  • You don’t have many media files (video, image) or news pages that you want to show in search results. Sitemaps can help Google find and understand video and image files, or news articles, on your site. If you don’t need these results to appear in images, videos, or news results, you might not need a sitemap.

XML Sitemap Best Practices

Before jumping on the analytics bandwagon, it is required to remark on a list of things to know to approach the analysis with a certain degree of understanding of how XML sitemaps should stand out on a website.

  • Make sure to use consistent URLs. Avoid omitting the safety protocol (HTTPS) from the URLs and specify a URL without www. Instead, upload absolute URLs.
  • Don’t include session IDs and other user-dependent identifiers from URLs in your sitemap. This reduces duplicate crawling of those URLs.
  • Break up large sitemaps into smaller files: a sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed.
  • List only canonical URLs in your sitemaps. If you have multiple versions of a page, list in the sitemap only the one you prefer to appear in search results.
  • Point to only one website version in a sitemap. If you have different URLs for mobile and desktop versions of a page, use one version in a sitemap.
  • Tell Google about alternate language versions of a URL using hreflang annotations. Use the hreflang tag in a sitemap to indicate the alternate URLs if you have different pages for different languages or regions.
  • Encode your sitemap with UTF-8 and use ASCII characters to ensure is readable by Google. This is usually done automatically if you are using a script, tool, or log file to generate your URLs.
  • Avoid submitting Pagination. URLs in the sitemap should only contain pages that you want to rank – and paginated series do not usually fit this criterion. Including paginated series on a sitemap means increasing the chances to have them ranked on the SERP and requesting spending unnecessary requests to web spiders that could compromise your crawl efficacy.
  • Keep in mind sitemaps are a recommendation to Google about which pages you think are important; Google does not pledge to crawl every URL in a sitemap.
  • Google ignores <priority> and <changefreq> values.
  • Google uses the <lastmod> value if it’s consistently and verifiably (for example by comparing to the last modification of the page) accurate.
  • The position of a URL in a sitemap is not important; Google does not crawl URLs in the order in which they appear in your sitemap.
  • Submit your sitemap to Google. Google examines sitemaps only when it first discovers them or when it is notified of an update. Despite plenty of ways to make your sitemap available to Google, it is best practice to submit it through the Search Console.

💡Pro Tip

Referencing XML sitemaps in your robots.txt files is not an official requirement.
This is an SEO myth based on the assumption that placing the sitemap on the robots.txt consolidate the file’s submission signals to Google.

Instead, this may expose your site to security breaches as someone could scrape important information from your website.

Submitting the XML sitemap from your Google Search Console property is simply enough.

Audit XML Sitemaps in Python

Agency-wise, there are many opportunities to confront specific SEO tasks. When screening XML sitemaps of a client website, I like to follow a machine learning-based playbook that allows me to process huge chunks of data and automate most of the boring stuff.
Before kicking off this tutorial on how to automate a sitemap audit, it is necessary to become aware of a few premises.

For the purpose of this tutorial, I am going to test the XML sitemap from Halfords, the largest British retailer of motoring and cycling products and services. 


Install and Import Libraries

We need to install and import a couple of fundamental libraries

%%capture
!pip install advertools dash dash_bootstrap_components jupyter_dash plotly bs4 matplotlib datasets adviz dash_bootstrap_templates

The following packages deserve a brief introduction:

Library Description
Advertools A Python library designed to assist in SEO automation projects. Used for retrieving website’s sitemap.xml.
Pandas A library for data manipulation and analysis in Python. It is imported for easy data manipulation maneuvers.
Adviz A Python library built on top of Advertools for clean data visualization and basic descriptive analysis.
Plotly A high-level, declarative charting library for creating highly visual plots.
import os
import requests
import urllib.parse
from bs4 import BeautifulSoup
import time
import warnings

import advertools as adv
import adviz
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go

import pandas as pd
from IPython.display import display_html, display_markdown, HTML
from IPython.core.display import display

from lxml import etree

import matplotlib
import sklearn

from dash_bootstrap_templates import load_figure_template
import dash
import dash_core_components as dcc
import dash_html_components as html

# Set display options
pd.options.display.max_columns = None
warnings.filterwarnings("ignore")
display(HTML("<style>.container { width:100% !important; }</style>"))

def md(text):
    return display_markdown(text, raw=True)

# Check package versions
for pkg in [adv, pd, plotly, sklearn]:
    print(f'{pkg.__name__:-<30}v{pkg.__version__}')

load_figure_template(['darkly','bootstrap','flatly','cosmo'])

Fetch URLs from the Sitemap

The next step concerns the factual scraping of our sitemap.xml.

Having imported the required libraries, we are now in the position to set up a few functions to prompt the machine to parse the sitemap.

💡 Append either /sitemap.xml or /sitemap_index.html to the end of the URL address to find out whether the target site disposes of a sitemap.xml

Next, we fetch the target site’s sitemap.xml using the adv.sitemap_to_df() function from Advertools to convert the sitemap into a data frame.

The script then converts the date format of the “lastmod” column in the data frame to make it more readable.

Finally, the script drops a few columns to comply with data cleaning techniques before printing the dataset.

sitemap = adv.sitemap_to_df('https://www.halfords.com/robots.txt',
                            max_workers=8,
                            recursive=True)
sitemap.to_csv('sitemap_collection.csv',index=False)

#parse the sitemaps

sitemap_raw = pd.read_csv('sitemap_collection.csv', parse_dates=['lastmod', 'download_date'], low_memory=False)

#overview duplicates
print(f'Original: {sitemap_raw.shape}')
sitemap = sitemap_raw.drop_duplicates(subset=['loc'])
print(f'After removing dupes: {sitemap.shape}')
sitemap.to_csv('sitemap.csv',index=False)
sitemap
scraped URLs from a sitemap

Audit the Size and Number of URLs on the XML Sitemap

A single XML sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed, therefore we need to make sure your file does not break the rule.

Check how many URLs a single file contains

sitemap['sitemap'].value_counts()

Check the size of the XML sitemap

sitemap['sitemap_size_mb'].drop_duplicates().sort_values(ascending=False)

<lastmod>, <changefreq> and <priority>

Sitemaps dispose of a bunch of attributes aimed at instructing search engines about crawling patterns or providing information on the current site architecture.

Some have been recently deprecated, whereas others are still valid.

As anticipated, Google dismissed <priority> and <changefreq> whilst keeping other mandatory values such as <lastmod> and obviously <loc>.

<lastmod> is a valuable element that helps search engines schedule crawls to known URLs.

As confirmed from a recent update on Google’s documentation, the <lastmod> element should be in a supported date format, and its value should accurately reflect the page’s last modification date to maintain trust with search engines.

Using<lastmod>helps highlight updated pages in the sitemap and notify Google about updates.

To get a grip on the <lastmod> usage, we are going to use the Display,HTML module that we previously imported.

#How many URLs have a date in the URL

md(f"## URLs that have lastmod implemented: {sitemap['lastmod'].notna().sum():,} ({sitemap['lastmod'].notna().mean():.1%})")

URLs that have lastmod implemented: 3,136 (99.5%)

💡Pro Tip

XML sitemaps attributes come with different nomenclature. Recall that instead of “Lastmod” you could use whatever the name of the column containing the last modified date will appear on your sitemap.

What about the other 0.5 % of pages that don’t have a <lastmod> ?

You can inspect them by filtering missing values

sitemap[sitemap['lastmod'].isna()]['loc'].tolist()

Now we want to audit existing <changefreq> and <priority>.

Google does not use such elements for either NLP purposes or for rankings.

  • <changefreq> is an element that overlaps with the concept of <lastmod>
  • <priority> is very subjective, hence may not accurately represent a page’s actual priority compared to other pages on a website.

Nevertheless, these fields are very popular with large publishers and eCommerce subject to dynamic injection of fresh content or products on their websites.

Despite Google’s deprecation of the values, many are still using them at full capacity.

#check on priority
md(f"## URLs that have priority implemented: {sitemap['priority'].notna().sum():,} ({sitemap['priority'].notna().mean():.1%})")

#check on changefreq
md(f"## URLs that have changefreq implemented: {sitemap['changefreq'].notna().sum():,} ({sitemap['changefreq'].notna().mean():.1%})")

URLs that have priority implemented: 14,179 (100.0%)

URLs that have changefreq implemented: 14,179 (100.0%)

Inspect Publishing Trends from the XML sitemap

With the aid of plotly express, we can visualize the publishing trends of the website through the <lastmode> tag – in case it exists on the XML sitemap.

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=sitemap['lastmod'].sort_values(),
        y=np.arange(1, len(sitemap) + 1) / len(sitemap),
        mode='markers',
        marker=dict(size=15, opacity=0.8, color='gold'),
        name='ECDF'
    )
)

fig.add_shape(
    type='rect',
    x0='2022-01-01', x1='2022-12-31', y0=0, y1=1,
    fillcolor='gray', opacity=0.2,
    xref='x', yref='paper',
    layer='below',
    name='No publishing',
)

fig.add_shape(
    type='rect',
    x0='2023-01-01', x1='2023-07-05', y0=0, y1=1,
    fillcolor='gray', opacity=0.2,
    xref='x', yref='paper',
    layer='below',
    name='Frequent publishing',
)

fig.update_layout(
    title='{your_client_site} publishing trends<br>(lastmod tags of XML sitemap)',
    template='plotly_dark',
    height=500
)

fig.show()
Publishing trends plot on a time series chart with plotly express

The above time series illustrates a growing publishing trend. This suggests that the website has been adding pages strenuously over the past couple of years despite a few breaks, as you can see from the dark spots across the yellow line.

Pagination Sequences on XML sitemaps

As anticipated, it is recommended to avoid cluttering an XML sitemap with paginated series that bring little to no value to your organic rankings and business goals. You can find out if an XML sitemap contains pagination by using searching for the /p=/ parameter on the loc column from our sitemap.

This will return the previous data frame with an extra column called “Pagination” returning boolean values to address the main question

sitemap['Pagination'] = sitemap['loc'].str.contains(r'/p=|\\?page=', regex=True, na=False)

if sitemap['Pagination'].any():
    print('Pagination found')
else:
    print('No Pagination')
    sitemap.to_csv('sitem.csv', index=False)

sitemap.head()

External Link Ratio

To enhance the audit, we can examine the backlink influence of each URL listed in the XML sitemap.

To do this, we should determine the homepage address of the desired domain to retrieve the XML sitemap from.

homepage = 'https://www.halfords.com'
def count_links( page_url, domain ):
    """Given input page_url, output the total number of outbound links"""
    links_internal = {}
    links_external = {}
    
    # download the html
    res = requests.get(page_url)
    if "html" not in res.headers.get('Content-Type'):
        # this is an image
        return {'parseable': False}
    
    html = res.text
    soup = BeautifulSoup(html)
    
    for a in soup.find_all('a'):
        link = a.get('href')

        # skip missing
        if link is None:
            continue

        # remove params
        link = link.split("?")[0]
        # remove shortcuts
        link = link.split("#")[0]

        # skip missing
        if (link is None) or (link == ''):
            continue

        if (domain in link) or (len(link)>1 and link[0:1]=="/") or (len(link)>2 and link[0:2]=="./"):
            # is internal
            links_internal[link]= links_internal.get(link,0) + 1
        else:
            # external
            links_external[link]= links_external.get(link,0) + 1

    return {"parseable": True, "external": links_external, "internal": links_internal }


domain = urllib.parse.urlparse(homepage).netloc

# test one url
page_url = sitemap.iloc[0]['loc']
print(page_url)

links = count_links( page_url, domain )
links

This was just a test but we can’t go that far unless we want to tear apart Google Colab CPU.

Due to the vastity of the URLs contained in our XML sitemap, it would be detrimental to the CPU memory of our Colab to parse outbound links for every single URL.

Small sites (<10,000 pages) usually have limited sitemaps with fewer submitted URLs. If this is your case, then it might be reasonable to run a casual sampling of 25% of pages and set the output to return no more than 40 sampled pages.

Please, bear in mind that this section may not be useful if you’re running an international website with multiple sitemaps (e.g eCommerce)

# sample 25% of the site
sample_size = 0.25 
# not more than n number of pages
max_n_samples = 40

Let’s run the script and store the outcomes into a Pandas data frame.

# table of pages to test
subset_of_sitemap_df = sitemap.sample(min(max_n_samples, round(sample_size*len(sitemap_url))))

# get domain
domain = urllib.parse.urlparse(homepage).netloc

# links per page
links_per_page = []

# get count of external links per pagex
for index, row in subset_of_sitemap_df.iterrows():
    
    page_url = row['loc']
    
    # count outbound links
    links = count_links( page_url, domain)
    
    # keep track of links per page
    if links.get('parseable'):
        external_links = len( links['external'] )
        links_per_page.append( external_links )
        
# create a dictionary to hold the data
data = {'page_url': [], 'external_links': []}

# fill the dictionary with data
for index, row in subset_of_sitemap_df.iterrows():
    page_url = row['loc']
    links = count_links( page_url, domain)
    if links.get('parseable'):
        external_links = len( links['external'] )
        data['page_url'].append(page_url)
        data['external_links'].append(external_links)

# convert dictionary to dataframe
df = pd.DataFrame(data)
df.to_excel('external_links_on_XML.xlsx', index=False)
df.head()

🔦 Shout out to Alton Alexander for the inspiration coming from one of his brilliant workaround SEO and Data Science. You can find the full code to the analysis of sitewide link quality on Github

Inspect the URL structure of Submitted Pages

One of the features that this Python framework allows you to take advantage of is to get a first-hand look at your target website’s structure.

This is possible using advertools and their built-in functions.

sitemap_url = adv.url_to_df(sitemap["loc"].fillna(''))
sitemap_url.head()

Here’s an excerpt of what you might be able to retrieve at the moment.

deconstruction of the site tree from a sitemap

At a glance, we can get an overview of the top directories used in the sitemap.

Depending on the type of file, Adviz will help you identify the most frequent directories/sub-directories, categories, and the most used words per directory included in the sitemap.

Top Geo Location Values

The primary directory of a URL often regards geolocation.

It’s not rare to stumble across eCommerce shops with multiple XML sitemaps containing an unfathomable list of URLs serving all but the main market or language.

Let’s say that for this example we’re auditing an XML sitemap for a large eCommerce with the following URL structure:

tw/sitemap_index.com

You’ll judge by yourself what the XML sitemap may contain instead

adviz.value_counts_plus(sitemap_url['dir_1'], size=20, name='Market')
count of dir_1 of the URL structure presente in an XML sitemap

Now, let’s plot the count with a polished histogram using plotly.express

country = sitemap_url.groupby('dir_1').url.count().sort_values(ascending=False).reset_index()
country.rename(columns=
 {'dir_1':'country',
  'url': 'count'},
               inplace=True)

#plot a histogram
px.histogram(country,
             x='country',
             y='count',
             title='Ratio of Markets in XML sitemap',
             template='plotly_dark')
histogram showing the distribution of markets in an XML sitemap

Top Page Categories

We can inspect the second directory from a URL path from the XML sitemap to get a grip on the top page categories submitted to the attention of Google.

adviz.value_counts_plus(sitemap_url['dir_2'], size=20, name='Main Directory')
count of dir_2 of the URL structure presented in an XML sitemap

plotting the count with plotly express you should have a histogram similar to that

parent = sitemap_url.groupby('dir_2').url.count().sort_values(ascending=False).reset_index()
parent.rename(columns=
 {'dir_2':'parent category',
  'url': 'count'},
               inplace=True)

#plot a histogram

fig = px.histogram(
    parent.head(25),
    x='parent category',
    y='count',
    title='Main Parent Category pages in the XML sitemap',
    template='plotly_dark'
)

fig.update_layout(
    xaxis_tickangle=25
)

fig.show()
histogram showing the distribution of the most common categories in an XML sitemap

Most common n-grams or bi-grams

Coming back to the Halfords sitemap – the only crawled one with advertools – we can look at the most used uni-grams in the primary directory of the URLs included on the sitemap

(adv.word_frequency(
    sitemap_url_df['dir_2']
    .dropna()
    .str.replace('-', ' '),
    rm_words=['to', 'for', 'the', 'and', 'in', 'of', 'a', 'with', 'is'])
 .head(15)
 .style.format({'abs_freq': '{:,}'}))
most used n-grams from URLs on a sitemap

🔦 Shout out to Elias Dabbas for inspiring this specific section. You can find similar XML sitemap file manipulation and analysis on the Foreign PolicyXML sitemap analysis on Kaggle.

Spotting Irrelevant Country Pages from an XML sitemap folder

Despite not posing a threat to indexability, having multiple URLs that differ from the main purpose of the XML sitemap file may confuse search engines when picking up the most relevant pages you’ve submitted.

Tallying on the current example, we can fetch and download those submitted URLs drifting from the main purpose of the XML sitemap

# Use boolean indexing to filter out URLs that DO NOT target Taiwan(e.g)
filtered_urls = sitemap_url[~sitemap_url['loc'].str.contains('/tw/')]

# Create a new dataframe with the filtered rows
new_dataframe = pd.DataFrame(filtered_urls)
new_dataframe.to_excel('Filtered XML.xlsx',index=False)
new_dataframe['loc'].head()

💡BONUS

eCommerce is having a great time nowadays and optimizing the rigth features can definitely help you boost your online store visibility. Here are 3 common issues with images on luxury eCommerce

Robots.txt

We can ask the machine to return the robots.txt file’s status code for the domain.

import requests
r = requests.get("halfords.com/robots.txt")
r.status_code

If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.

By the way, we can even dig down on the robots.txt analysis by extending the audit to all of the URLs standing on the sitemap.

sitemap_df_robotstxt_check = adv.robotstxt_test("https://www.yoursite.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"])

sitemap_df_robotstxt_check["can_fetch"].value_counts()

What we have just done was to perform the audit for all of the user agents. You should have returned something like that

Outcome of the bulk audit Robots.txt of the URLs in the sitemap
Bulk audit Robots.txt of the URLs

As you can see, we received a True value meaning that all of the URLs in the sitemap.xml are crawlable.

In case the value turned out to be False, it means that some URLs are being disallowed.

You can identify them by running the following lines of code

pd.set_option("display.max_colwidth",255)
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]

Status Code

You can progress your journey throughout the sitemap.xml tags by inspecting the response returned to the URLs included.

To do that, we run a crawl with Advertools by making sure that the in-built crawler scans all the links he can possibly find on the sitemap.

adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl")
df_headers = pd.read_json("sitemap_df_header.jl", lines=True)
df_headers["status"].value_counts()

Next, we paste the above json file within a Pandas function designed to display the URLs status codes with the likes of a clean array.

df_headers = pd.read_json('sitemap_df_header.jl', lines=True)
df_headers.head()

Once you run the script, here is what you might get.

To wrap off the status code discovery, we want to make sure that the sitemap.xml does not present any 404 URLs. According to Google Search Central, an XML sitemap should contain only relevant 2xxx URLs and avoid pages returning toxic response codes.

df_headers[df_headers["status"] == 404]

It goes without saying that if the script returns nothing, it means that there are no 404

Canonicalization

Using canonicalization hints on the response headers is beneficial for crawling and indexing

If you want to include a canonicalization hint on the HTTP header, you need to guarantee that the HTML canonical tags and the response header canonical tags are the same.

To untangle this knot, we are going to trawl through the HTTP headers and chase down the resp_headers_link

df_headers.columns

🚨 WARNING 🚨

In case the script above doesn’t return a resp_headers_link, it means that a response header canonical is not in place. You want to jump on the next section.

Given that the script returned a resp_headers_link we are enabled to compare the response header canonical to the HTML canonical.

df_headers["resp_headers_link"]
print("Checking any links within the Response Header")

df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:\/0-9-.]*)")
(df_headers["response_header_canonical"] == df_headers["url"]).value_counts()

If the result is False, just like in the example, it means that the response header canonical does not equal the URL canonical on the audited website.

If the result is True, obviously the response header canonical equals the URL canonical

💡BONUS

You can automate a canonical audit using Python. Learn more and cut off plenty of manual and boring work!

X-Robot Meta Tags

Temporary amendments to the Robots.txt directives can often apply, and as a result, they could be soaked up within the URLs enshrined on the XML sitemap.

Because of that, you want to peep through the sitemap and assess the existence of such tags within the HTTP headers.

def robots_tag_checker(dataframe:pd.DataFrame):
     for i in df_headers:
          if i.__contains__("robots"):
               return i
          else:
               return "There is no robots tag"
robots_tag_checker(df_headers)

Once again, the script utters a binomial response.

There is no robots tag = fair enough, jump on the next section

There is robots tag = you may want to dig deeper and see what they refer to

To narrow down, we might want to check if these X-Robots Meta Tags come with a noindex directive.

💡 TIPS 💡

In the Coverage section on Google Search Console profile, Google Search Console Coverage Report, those normally appear as “Submitted – marked as noindex”. Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.

df_headers["response_header_x_robots_tag"].value_counts()
df_headers[df_headers["response_header_x_robots_tag"] == "noindex"]

Audit Meta Tag Robots

Even if a web page is not disallowed from any constraining robots.txt directive, it can still be disallowed from the HTML Meta Tags.

Hence, checking the HTML Meta Tags for better indexation and crawling is necessary and very much recommended.

A handy method I fancy is using custom XPath selectors to perform the HTML Meta Tag audit for the audited URLs uploaded on a sitemap.

For the purposes of this tutorial, we are going to leverage Advertools to run an additional crawl drilled on the research for a specific Xpath selector aimed to extract all the robots commands from the URLs on the sitemap

adv.crawl(url_list=sitemap["loc"][:1000], output_file="meta_command_audit.jl",

follow_links=False,

# extract all the robots commands from the URLs from the sitemap.

xpath_selectors= {"meta_command": "//meta[@name='robots']/@content"}, 

# we have set the crawling to 1000 URLs from the sitemap

custom_settings={"CLOSESPIDER_PAGECOUNT":1000}) 

df_meta_check = pd.read_json("meta_command_audit.jl", lines=True)

df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()

As usual, a dual output will be provided. Having instructed the machine to spot the factual robots directives, we are going to learn whether the sitemap hosts URLs that threaten the crawling and indexing processes.

True = there URLs with “noindex,follow” attributes uploaded on the sitemap

False = there are no URLs with “noindex,follow” attributes uploaded on the sitemap

💡 TIP 💡

If your Search Console property raises indexing issues spurring from a specific URL despite this one displaying a Index,Follow, you should inspect the <body> section of that page to assess whether a noindex,follow is actually in place

H/T to Kristina Azarenko and her tests about a curious case of noindexed page.

Moving back to our meta tag robots audit from the target sitemap’s URLs, we can now get a final roundup of the meta tag robots applied.

To run this check, I decided to use the XML sitemap of seodepths as the script turns out to digest much better sitemaps with fewer URLs enclosed.

df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]

Audit Duplicate URLs

This crazy journey throughout an XML sitemap culminates with a screening of duplicate URLs.

The following script will inform us of the ratio between the original quantity of URLs on the sitemap and those resulting after de-duplication, therefore post duplicates removal

print(f'Original: {sitemap.shape}')
sitemap = sitemap.drop_duplicates(subset=['loc'])
print(f'After de-duplication: {sitemap.shape}')
duplicate urls feom sitemap

The following will recap the number of duplicate URLs identified

duplicated_urls = sitemap['loc'].duplicated()
md(f'## Duplicated URLs: {duplicated_urls.sum():,} ({duplicated_urls.mean():.1%})')

Should your sitemap get caught with duplicated pages, the following script will use a pivot table in Pandas to tell you how many are out there

pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)

And most importantly, display URLs that cause duplicate issues.

sitemap[sitemap["loc"].duplicated() == True]

Compare Multiple XML Sitemaps

Large eCommerce may come in with a ton of XML sitemaps. They may be either submitted via Google Search Console or via Robots.Txt. In truth, it’s not rare to detect sitemaps uploaded only via the Robots.txt file and other files submitted exclusively via Google Search Console.

In case you’re in such a messy situation, it would be helpful to learn at a glance whether they are unique files or just duplicates. In other words, you want to know if each document contains the same URL that might be found elsewhere.

Fear no more, Advertools and Pandas will come to the rescue.

You just need to scrape each similar XML sitemap, concatenate them by <loc> using the pd.concat function and count the number of duplicates across each sitemap.


f = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_9.xml")
df38 = pd.DataFrame(f, columns=['loc', 'changefreq', 'priority', 'sitemap', 'etag', 'sitemap_last_modified', 'sitemap_size_mb', 'download_date'])
sitemap_url_9 = df38.drop(['etag', 'download_date','changefreq', 'priority', 'sitemap', 'sitemap_last_modified', 'sitemap_size_mb'], axis=1)

g = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_8.xml")
df39 = pd.DataFrame(g, columns=['loc', 'changefreq', 'priority', 'sitemap', 'etag', 'sitemap_last_modified', 'sitemap_size_mb', 'download_date'])
sitemap_url_8 = df39.drop(['etag', 'download_date','changefreq', 'priority', 'sitemap', 'sitemap_last_modified', 'sitemap_size_mb'], axis=1)

# Concatenate <loc> cols 
result = pd.concat([sitemap_url_9, sitemap_url_8], axis=1, join='outer')
result.columns = ['sitemap_url_9','sitemap_url_8']

# Drop Nan values
result = result.dropna()

#Count the number of values of a column
col_count = result['sitemap_url_9'].count()
col_count = result['sitemap_url_8'].count()

# Divide the number of values by the total number of rows
col_percent = (col_count / sitemap_url_9.shape[0]) * 100
col_percent1 = (col_count / sitemap_url_8.shape[0]) * 100

# Return the result in percentage format
print(f"Percentage of unique URLs in sitemap_url_9: {col_percent:.2f}%")
print(f"Percentage of unique URLs in sitemap_url_8: {col_percent:.2f}%")
percentage of duplicate URLs in different sitemaps

Conclusion

Retrieving insights and juicy data with Python can be a time saver especially if you’re working for a small-medium size website.

Although it might appear daunting at first glance, once you start getting along with the code your SEO research will become a proper piece of cake.


FAQ

How do I export a sitemap to Excel?

To export the sitemap to excel you need to execute a df.to_csv function by appending the directory path where you want to store the sitemap on your PC.

sitemap_url_df.to_csv(r'YOUR-PATH-DIRECTORY.csv'), index = False, header=True)

You need to append and execute this command prompt right after the lines of code pointing at the sitemap scraping.
Please note that we are using “sitemap_url” as the custom name of the function.

How to do a sitemap audit?

To do a sitemap audit, first, ensure that your sitemap submission is up-to-date. Then, use a sitemap audit tool such as Screaming Frog, Google Search Console, or SEMrush to check for errors, missing pages, and duplicate content. Finally, review and fix any issues found to improve your site’s crawlability and search engine visibility.

What is sitemap in Python?

In Python, a sitemap is an XML file that lists the URLs for a website along with additional metadata about each URL such as the frequency of updates and the date of the last modification. Using a sitemap can improve the efficiency of search engine crawlers and thus enhance your website’s visibility in search results

Further Readings

I highly recommend following Elias Dabbas for Python tips and actionable tutorials drilled on SEO.

This post was inspired by his take on the foreign policy XML sitemap analysis from Kaggle

Related Posts

Never Miss a Beat

Subscribe now to receive weekly tips about Technical SEO and Data Science 🔥