How do I export a sitemap to Excel?

Example df = pd.to_excel("xml_sitemap_audit.xlsx", index = None)

How to do a sitemap audit?

To do a sitemap audit, ensure that your sitemap submission in Google Search Console is up-to-date. Then, run a crawl with the code provided in the post (or using Screaming Frog)and check for errors, missing pages, and duplicate content.

🍁Automate XML Sitemap Audit (Updated Tutorial)

October 27, 2024

XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs.

However, they also play a crucial role when it comes to seizing the competitive batch as they quickly deliver an overview of what a website considers to be the most relevant pages that need to be submitted to the attention of the web crawlers.

In this post, you will learn how to automate an XML sitemap audit to improve your SEO decision-making.

💡Shortcut

Jump straight on the Google Colab script if in a hurry

XML Sitemap: What it is and How to Audit

Sitemaps are vast repositories of the best pages from your website. This document improves the capability of Google and other search engines to crawl your website more efficiently.

In the words of Google:

A sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them.

A sitemap helps search engines discover URLs on your site, but it doesn’t guarantee that all the items in your sitemap will be crawled and indexed.

Who needs an XML Sitemap?

When your website’s pages are linked correctly, Google is usually able to find and index most of them. This means that all the important pages on your website can be accessed through navigation options, such as your site’s menu or links placed on other pages.

Despite optimal internal linking, an XML sitemap can help Google find and index larger or more complex websites or specialized files effectively.

Some websites may benefit more from an XML sitemap than others:

Your site is really large. As a result, it’s more likely Google web crawlers might overlook crawling some of your new or recently updated pages.
Your site has a large archive of content pages that are isolated or not well linked to each other. If your site pages don’t naturally reference each other, you can list them in a sitemap to ensure that Google doesn’t overlook some of your pages.
Your site is new and has few external links to it. Googlebot and other web crawlers crawl the web by following links from one page to another. As a result, Google might not discover your pages if no other sites link to them.
Your site has a lot of rich media content (video, images) or is shown in Google News. If provided, Google can take additional information from sitemaps into account for search, where appropriate.

Some websites may not receive equivalent benefits from an XML sitemap:

Your site is “small”. The site should be about 500 pages or fewer.
Your site is comprehensively linked internally. This means that Google can find all the important pages on your site by following links starting from the homepage.
You don’t have many media files (video, image) or news pages that you want to show in search results. Sitemaps can help Google find and understand video and image files, or news articles, on your site. If you don’t need these results to appear in images, videos, or news results, you might not need a sitemap.

XML Sitemap Best Practices

Before jumping on the analytics bandwagon, it is required to remark on a list of things to know to approach the analysis with a certain degree of understanding of how XML sitemaps should stand out on a website.

Make sure to use consistent URLs. Avoid omitting the safety protocol (HTTPS) from the URLs and specify a URL without www. Instead, upload absolute URLs.
Don’t include session IDs and other user-dependent identifiers from URLs in your sitemap. This reduces duplicate crawling of those URLs.
Break up large sitemaps into smaller files: a sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed.
List only canonical URLs in your sitemaps. If you have multiple versions of a page, list in the sitemap only the one you prefer to appear in search results.
Point to only one website version in a sitemap. If you have different URLs for mobile and desktop versions of a page, use one version in a sitemap.
Tell Google about alternate language versions of a URL using hreflang annotations. Use the hreflang tag in a sitemap to indicate the alternate URLs if you have different pages for different languages or regions.
Encode your sitemap with UTF-8 and use ASCII characters to ensure is readable by Google. This is usually done automatically if you are using a script, tool, or log file to generate your URLs.
Avoid submitting Pagination. URLs in the sitemap should only contain pages that you want to rank – and paginated series do not usually fit this criterion. Including paginated series on a sitemap means increasing the chances to have them ranked on the SERP and requesting spending unnecessary requests to web spiders that could compromise your crawl efficacy.
Keep in mind sitemaps are a recommendation to Google about which pages you think are important; Google does not pledge to crawl every URL in a sitemap.
Google ignores <priority> and <changefreq> values.
Google uses the <lastmod> value if it’s consistently and verifiably (for example by comparing to the last modification of the page) accurate.
The position of a URL in a sitemap is not important; Google does not crawl URLs in the order in which they appear in your sitemap.
Submit your sitemap to Google. Google examines sitemaps only when it first discovers them or when it is notified of an update. Despite plenty of ways to make your sitemap available to Google, it is best practice to submit it through the Search Console.

💡Pro Tip

Referencing XML sitemaps in your robots.txt files is not an official requirement.
This is an SEO myth based on the assumption that placing the sitemap on the robots.txt consolidate the file’s submission signals to Google.

Instead, this may expose your site to security breaches as someone could scrape important information from your website.

Submitting the XML sitemap from your Google Search Console property is simply enough.

Audit XML Sitemaps in Python

Agency-wise, there are many opportunities to confront specific SEO tasks. When screening XML sitemaps of a client website, I like to follow a machine learning-based playbook that allows me to process huge chunks of data and automate most of the boring stuff.
Before kicking off this tutorial on how to automate a sitemap audit, it is necessary to become aware of a few premises.

Run the script either on Google Colab or on Jupiter Notebook. I suggest using Google Colab because of its readiness to complete a Python task as well as the chance to run large chunks of scripts on the browser, rather than onto your PC’s CPU.
Make sure you change the run time on GPU to make the parsing process more smooth.
Bear in mind that sitemaps are hints from Google prompting webmasters to stress the pages that they deem to be most important. In fact, Google does not pledge to crawl every URL in a sitemap.
If you are going to run the script on Colab, please do not forget to append an exclamation point at the beginning of the “pip install“

For the purpose of this tutorial, I am going to test the XML sitemap from Halfords, the largest British retailer of motoring and cycling products and services.

Install and Import Libraries

We need to install and import a couple of fundamental libraries

%%capture
!pip install advertools dash dash_bootstrap_components jupyter_dash plotly bs4 matplotlib datasets adviz dash_bootstrap_templates

The following packages deserve a brief introduction:

Library	Description
Advertools	A Python library designed to assist in SEO automation projects. Used for retrieving website’s sitemap.xml.
Pandas	A library for data manipulation and analysis in Python. It is imported for easy data manipulation maneuvers.
Adviz	A Python library built on top of Advertools for clean data visualization and basic descriptive analysis.
Plotly	A high-level, declarative charting library for creating highly visual plots.

import os
import requests
import urllib.parse
from bs4 import BeautifulSoup
import time
import warnings

import advertools as adv
import adviz
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go

import pandas as pd
from IPython.display import display_html, display_markdown, HTML
from IPython.core.display import display

from lxml import etree

import matplotlib
import sklearn

from dash_bootstrap_templates import load_figure_template
import dash
import dash_core_components as dcc
import dash_html_components as html

# Set display options
pd.options.display.max_columns = None
warnings.filterwarnings("ignore")
display(HTML("<style>.container { width:100% !important; }</style>"))

def md(text):
    return display_markdown(text, raw=True)

# Check package versions
for pkg in [adv, pd, plotly, sklearn]:
    print(f'{pkg.__name__:-<30}v{pkg.__version__}')

load_figure_template(['darkly','bootstrap','flatly','cosmo'])

Run a Quick Crawl of the Website

A proper way to lay the groundwork for the start of an XML sitemap audit is to collect comprehensive information from our website. This can be useful in various ways, such as testing for orphan pages later in the process.

adv.crawl('your_site_url', 'output.jl',
          follow_links=True,
          custom_settings={
          #'CLOSESPIDER_PAGECOUNT':2000,
          'LOG_FILE': 'output.log',
          'JOBDIR':'output_job_1'}
          )
brand = pd.read_json('output.jl', lines=True)
brand.head()

You can remove the hashmark from CLOSESPIDER_PAGECOUNT or just modify the value according to how many pages you’d like advertools crawler to go through.

Fetch URLs from the Sitemap

The next step concerns the factual scraping of our sitemap.xml.

Having imported the required libraries, we are now in the position to set up a few functions to prompt the machine to parse the sitemap.

💡 Append either /sitemap.xml or /sitemap_index.html to the end of the URL address to find out whether the target site disposes of a sitemap.xml

Next, we fetch the target site’s sitemap.xml using the adv.sitemap_to_df() function from Advertools to convert the sitemap into a data frame.

The script then converts the date format of the “lastmod” column in the data frame to make it more readable.

Finally, the script drops a few columns to comply with data cleaning techniques before printing the dataset.

sitemap = adv.sitemap_to_df('https://www.halfords.com/robots.txt',
                            max_workers=8,
                            recursive=True)
sitemap.to_csv('sitemap_collection.csv',index=False)

#parse the sitemaps

sitemap_raw = pd.read_csv('sitemap_collection.csv', parse_dates=['lastmod', 'download_date'], low_memory=False)

#overview duplicates
print(f'Original: {sitemap_raw.shape}')
sitemap = sitemap_raw.drop_duplicates(subset=['loc'])
print(f'After removing dupes: {sitemap.shape}')
sitemap.to_csv('sitemap.csv',index=False)
sitemap

Audit the Size and Number of URLs on the XML Sitemap

A single XML sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed, therefore we need to make sure your file does not break the rule.

Check how many URLs a single file contains

sitemap['sitemap'].value_counts()

Check the size of the XML sitemap

sitemap['sitemap_size_mb'].drop_duplicates().sort_values(ascending=False)

<lastmod>, <changefreq> and <priority>

Sitemaps dispose of a bunch of attributes aimed at instructing search engines about crawling patterns or providing information on the current site architecture.

Some have been recently deprecated, whereas others are still valid.

As anticipated, Google dismissed <priority> and <changefreq> whilst keeping other mandatory values such as <lastmod> and obviously <loc>.

<lastmod> is a valuable element that helps search engines schedule crawls to known URLs.

As confirmed from a recent update on Google’s documentation, the <lastmod> element should be in a supported date format, and its value should accurately reflect the page’s last modification date to maintain trust with search engines.

Using<lastmod>helps highlight updated pages in the sitemap and notify Google about updates.

To get a grip on the <lastmod> usage, we are going to use the Display,HTML module that we previously imported.

#How many URLs have a date in the URL

md(f"## URLs that have lastmod implemented: {sitemap['lastmod'].notna().sum():,} ({sitemap['lastmod'].notna().mean():.1%})")

URLs that have lastmod implemented: 3,136 (99.5%)

💡Pro Tip

XML sitemaps attributes come with different nomenclature. Recall that instead of “Lastmod” you could use whatever the name of the column containing the last modified date will appear on your sitemap.

What about the other 0.5 % of pages that don’t have a <lastmod> ?

You can inspect them by filtering missing values

sitemap[sitemap['lastmod'].isna()]['loc'].tolist()

Now we want to audit existing <changefreq> and <priority>.

Google does not use such elements for either NLP purposes or for rankings.

<changefreq> is an element that overlaps with the concept of <lastmod>
<priority> is very subjective, hence may not accurately represent a page’s actual priority compared to other pages on a website.

Nevertheless, these fields are very popular with large publishers and eCommerce subject to dynamic injection of fresh content or products on their websites.

Despite Google’s deprecation of the values, many are still using them at full capacity.

#check on priority
md(f"## URLs that have priority implemented: {sitemap['priority'].notna().sum():,} ({sitemap['priority'].notna().mean():.1%})")

#check on changefreq
md(f"## URLs that have changefreq implemented: {sitemap['changefreq'].notna().sum():,} ({sitemap['changefreq'].notna().mean():.1%})")

URLs that have priority implemented: 14,179 (100.0%)

URLs that have changefreq implemented: 14,179 (100.0%)

Inspect Publishing Trends from the XML sitemap

With the aid of plotly express, we can visualize the publishing trends of the website through the <lastmode> tag – in case it exists on the XML sitemap.

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=sitemap['lastmod'].sort_values(),
        y=np.arange(1, len(sitemap) + 1) / len(sitemap),
        mode='markers',
        marker=dict(size=15, opacity=0.8, color='gold'),
        name='ECDF'
    )
)

fig.add_shape(
    type='rect',
    x0='2022-01-01', x1='2022-12-31', y0=0, y1=1,
    fillcolor='gray', opacity=0.2,
    xref='x', yref='paper',
    layer='below',
    name='No publishing',
)

fig.add_shape(
    type='rect',
    x0='2023-01-01', x1='2023-07-05', y0=0, y1=1,
    fillcolor='gray', opacity=0.2,
    xref='x', yref='paper',
    layer='below',
    name='Frequent publishing',
)

fig.update_layout(
    title='{your_client_site} publishing trends<br>(lastmod tags of XML sitemap)',
    template='plotly_dark',
    height=500
)

fig.show()

The above time series illustrates a growing publishing trend. This suggests that the website has been adding pages strenuously over the past couple of years despite a few breaks, as you can see from the dark spots across the yellow line.

Pagination Sequences on XML sitemaps

As anticipated, it is recommended to avoid cluttering an XML sitemap with paginated series that bring little to no value to your organic rankings and business goals. You can find out if an XML sitemap contains pagination by using searching for the /p=/ parameter on the loc column from our sitemap.

This will return the previous data frame with an extra column called “Pagination” returning boolean values to address the main question

sitemap['Pagination'] = sitemap['loc'].str.contains(r'/p=|\\?page=', regex=True, na=False)

if sitemap['Pagination'].any():
    print('Pagination found')
else:
    print('No Pagination')
    sitemap.to_csv('sitem.csv', index=False)

sitemap.head()

Inspect the URL structure of Submitted Pages

One of the features that this Python framework allows you to take advantage of is to get a first-hand look at your target website’s structure.

This is possible using advertools and their built-in functions.

sitemap_url = adv.url_to_df(sitemap["loc"].fillna(''))
sitemap_url.head()

Here’s an excerpt of what you might be able to retrieve at the moment.

At a glance, we can get an overview of the top directories used in the sitemap.

Depending on the type of file, Adviz will help you identify the most frequent directories/sub-directories, categories, and the most used words per directory included in the sitemap.

Top Geo Location Values

The primary directory of a URL often regards geolocation.

It’s not rare to stumble across eCommerce shops with multiple XML sitemaps containing an unfathomable list of URLs serving all but the main market or language.

Let’s say that for this example we’re auditing an XML sitemap for a large eCommerce with the following URL structure:

tw/sitemap_index.com

You’ll judge by yourself what the XML sitemap may contain instead

adviz.value_counts_plus(sitemap_url['dir_1'], size=20, name='Market')

Now, let’s plot the count with a polished histogram using plotly.express

country = sitemap_url.groupby('dir_1').url.count().sort_values(ascending=False).reset_index()
country.rename(columns=
 {'dir_1':'country',
  'url': 'count'},
               inplace=True)

#plot a histogram
px.histogram(country,
             x='country',
             y='count',
             title='Ratio of Markets in XML sitemap',
             template='plotly_dark')

Top Page Categories

We can inspect the second directory from a URL path from the XML sitemap to get a grip on the top page categories submitted to the attention of Google.

adviz.value_counts_plus(sitemap_url['dir_2'], size=20, name='Main Directory')

plotting the count with plotly express you should have a histogram similar to that

parent = sitemap_url.groupby('dir_2').url.count().sort_values(ascending=False).reset_index()
parent.rename(columns=
 {'dir_2':'parent category',
  'url': 'count'},
               inplace=True)

#plot a histogram

fig = px.histogram(
    parent.head(25),
    x='parent category',
    y='count',
    title='Main Parent Category pages in the XML sitemap',
    template='plotly_dark'
)

fig.update_layout(
    xaxis_tickangle=25
)

fig.show()

Inspecting Orphan Pages

If you still have data saved in the sitemap dataframe, make sure to look for potential orphan pages stuck in the XML sitemap that our initial crawl failed to intercept.

from IPython.display import display_markdown

in_sitemap_not_crawled = set(sitemap['loc'].str.strip('/')).difference(brand['url'].str.strip('/'))

save_output = input("Do you want to save the output? (yes/no): ")

if save_output.lower() == "yes":
    with open('orphan_pages.txt', 'w') as file:
        for url in in_sitemap_not_crawled:
            file.write(f'{url}\n')
    print("Output saved to orphan_pages.txt")

for url in in_sitemap_not_crawled:
    display_markdown(f'[{url}]({url})', raw=True)

💡BONUS

If orphan pages is your very next stop in the blog post, tune into my actionable guide to audit orphan pages

Most common n-grams or bi-grams

Coming back to the Halfords sitemap – the only crawled one with advertools – we can look at the most used uni-grams in the primary directory of the URLs included on the sitemap

(adv.word_frequency(
    sitemap_url_df['dir_2']
    .dropna()
    .str.replace('-', ' '),
    rm_words=['to', 'for', 'the', 'and', 'in', 'of', 'a', 'with', 'is'])
 .head(15)
 .style.format({'abs_freq': '{:,}'}))

🔦 Shout out to Elias Dabbas for inspiring this specific section. You can find similar XML sitemap file manipulation and analysis on the Foreign PolicyXML sitemap analysis on Kaggle.

Spotting Irrelevant Country Pages from an XML sitemap folder

Despite not posing a threat to indexability, having multiple URLs that differ from the main purpose of the XML sitemap file may confuse search engines when picking up the most relevant pages you’ve submitted.

Tallying on the current example, we can fetch and download those submitted URLs drifting from the main purpose of the XML sitemap

# Use boolean indexing to filter out URLs that DO NOT target Taiwan(e.g)
filtered_urls = sitemap_url[~sitemap_url['loc'].str.contains('/tw/')]

# Create a new dataframe with the filtered rows
new_dataframe = pd.DataFrame(filtered_urls)
new_dataframe.to_excel('Filtered XML.xlsx',index=False)
new_dataframe['loc'].head()

💡BONUS

eCommerce is having a great time nowadays and optimizing the rigth features can definitely help you boost your online store visibility. Here are 3 common issues with images on luxury eCommerce

Robots.txt

We can ask the machine to return the robots.txt file’s status code for the domain.

import requests
r = requests.get("halfords.com/robots.txt")
r.status_code

If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.

By the way, we can even dig down on the robots.txt analysis by extending the audit to all of the URLs standing on the sitemap.

sitemap_df_robotstxt_check = adv.robotstxt_test("https://www.yoursite.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"])

sitemap_df_robotstxt_check["can_fetch"].value_counts()

What we have just done was to perform the audit for all of the user agents. You should have returned something like that

Bulk audit Robots.txt of the URLs

As you can see, we received a True value meaning that all of the URLs in the sitemap.xml are crawlable.

In case the value turned out to be False, it means that some URLs are being disallowed.

You can identify them by running the following lines of code

pd.set_option("display.max_colwidth",255)
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]

Status Code

You can progress your journey throughout the sitemap.xml tags by inspecting the response returned to the URLs included.

To do that, we run a crawl with Advertools by making sure that the in-built crawler scans all the links he can possibly find on the sitemap.

adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl")
df_headers = pd.read_json("sitemap_df_header.jl", lines=True)
df_headers["status"].value_counts()

Next, we paste the above json file within a Pandas function designed to display the URLs status codes with the likes of a clean array.

df_headers = pd.read_json('sitemap_df_header.jl', lines=True)
df_headers.head()

Once you run the script, here is what you might get.

To wrap off the status code discovery, we want to make sure that the sitemap.xml does not present any 404 URLs. According to Google Search Central, an XML sitemap should contain only relevant 2xxx URLs and avoid pages returning toxic response codes.

df_headers[df_headers["status"] == 404]

It goes without saying that if the script returns nothing, it means that there are no 404

Canonicalization

Using canonicalization hints on the response headers is beneficial for crawling and indexing

If you want to include a canonicalization hint on the HTTP header, you need to guarantee that the HTML canonical tags and the response header canonical tags are the same.

To untangle this knot, we are going to trawl through the HTTP headers and chase down the resp_headers_link

df_headers.columns

🚨 WARNING 🚨

In case the script above doesn’t return a resp_headers_link, it means that a response header canonical is not in place. You want to jump on the next section.

Given that the script returned a resp_headers_link we are enabled to compare the response header canonical to the HTML canonical.

df_headers["resp_headers_link"]
print("Checking any links within the Response Header")

df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:\/0-9-.]*)")
(df_headers["response_header_canonical"] == df_headers["url"]).value_counts()

If the result is False, just like in the example, it means that the response header canonical does not equal the URL canonical on the audited website.

If the result is True, obviously the response header canonical equals the URL canonical

💡BONUS

You can automate a canonical audit using Python. Learn more and cut off plenty of manual and boring work!

X-Robot Meta Tags

Temporary amendments to the Robots.txt directives can often apply, and as a result, they could be soaked up within the URLs enshrined on the XML sitemap.

Because of that, you want to peep through the sitemap and assess the existence of such tags within the HTTP headers.

def robots_tag_checker(dataframe:pd.DataFrame):
     for i in df_headers:
          if i.__contains__("robots"):
               return i
          else:
               return "There is no robots tag"
robots_tag_checker(df_headers)

Once again, the script utters a binomial response.

There is no robots tag = fair enough, jump on the next section

There is robots tag = you may want to dig deeper and see what they refer to

To narrow down, we might want to check if these X-Robots Meta Tags come with a noindex directive.

💡 TIPS 💡

In the Coverage section on Google Search Console profile, Google Search Console Coverage Report, those normally appear as “Submitted – marked as noindex”. Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.

df_headers["response_header_x_robots_tag"].value_counts()
df_headers[df_headers["response_header_x_robots_tag"] == "noindex"]

Audit Meta Tag Robots

Even if a web page is not disallowed from any constraining robots.txt directive, it can still be disallowed from the HTML Meta Tags.

Hence, checking the HTML Meta Tags for better indexation and crawling is necessary and very much recommended.

A handy method I fancy is using custom XPath selectors to perform the HTML Meta Tag audit for the audited URLs uploaded on a sitemap.

For the purposes of this tutorial, we are going to leverage Advertools to run an additional crawl drilled on the research for a specific Xpath selector aimed to extract all the robots commands from the URLs on the sitemap

adv.crawl(url_list=sitemap["loc"][:1000], output_file="meta_command_audit.jl",

follow_links=False,

# extract all the robots commands from the URLs from the sitemap.

xpath_selectors= {"meta_command": "//meta[@name='robots']/@content"}, 

# we have set the crawling to 1000 URLs from the sitemap

custom_settings={"CLOSESPIDER_PAGECOUNT":1000}) 

df_meta_check = pd.read_json("meta_command_audit.jl", lines=True)

df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()

As usual, a dual output will be provided. Having instructed the machine to spot the factual robots directives, we are going to learn whether the sitemap hosts URLs that threaten the crawling and indexing processes.

True = there URLs with “noindex,follow” attributes uploaded on the sitemap

False = there are no URLs with “noindex,follow” attributes uploaded on the sitemap

💡 TIP 💡

If your Search Console property raises indexing issues spurring from a specific URL despite this one displaying a Index,Follow, you should inspect the <body> section of that page to assess whether a noindex,follow is actually in place

H/T to Kristina Azarenko and her tests about a curious case of noindexed page.

Moving back to our meta tag robots audit from the target sitemap’s URLs, we can now get a final roundup of the meta tag robots applied.

To run this check, I decided to use the XML sitemap of seodepths as the script turns out to digest much better sitemaps with fewer URLs enclosed.

df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]

Audit Duplicate URLs

This crazy journey throughout an XML sitemap culminates with a screening of duplicate URLs.

The following script will inform us of the ratio between the original quantity of URLs on the sitemap and those resulting after de-duplication, therefore post duplicates removal

print(f'Original: {sitemap.shape}')
sitemap = sitemap.drop_duplicates(subset=['loc'])
print(f'After de-duplication: {sitemap.shape}')

The following will recap the number of duplicate URLs identified

duplicated_urls = sitemap['loc'].duplicated()
md(f'## Duplicated URLs: {duplicated_urls.sum():,} ({duplicated_urls.mean():.1%})')

Should your sitemap get caught with duplicated pages, the following script will use a pivot table in Pandas to tell you how many are out there

pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)

And most importantly, display URLs that cause duplicate issues.

sitemap[sitemap["loc"].duplicated() == True]

Compare Multiple XML Sitemaps

Large eCommerce may come in with a ton of XML sitemaps. They may be either submitted via Google Search Console or via Robots.Txt. In truth, it’s not rare to detect sitemaps uploaded only via the Robots.txt file and other files submitted exclusively via Google Search Console.

In case you’re in such a messy situation, it would be helpful to learn at a glance whether they are unique files or just duplicates. In other words, you want to know if each document contains the same URL that might be found elsewhere.

Fear no more, Advertools and Pandas will come to the rescue.

You just need to scrape each similar XML sitemap, concatenate them by <loc> using the pd.concat function and count the number of duplicates across each sitemap.


f = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_9.xml")
df38 = pd.DataFrame(f, columns=['loc',	'changefreq',	'priority',	'sitemap',	'etag',	'sitemap_last_modified',	'sitemap_size_mb',	'download_date'])
sitemap_url_9 = df38.drop(['etag', 'download_date','changefreq',	'priority',	'sitemap',	'sitemap_last_modified',	'sitemap_size_mb'], axis=1)

g = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_8.xml")
df39 = pd.DataFrame(g, columns=['loc',	'changefreq',	'priority',	'sitemap',	'etag',	'sitemap_last_modified',	'sitemap_size_mb',	'download_date'])
sitemap_url_8 = df39.drop(['etag', 'download_date','changefreq',	'priority',	'sitemap',	'sitemap_last_modified',	'sitemap_size_mb'], axis=1)

# Concatenate <loc> cols 
result = pd.concat([sitemap_url_9, sitemap_url_8], axis=1, join='outer')
result.columns = ['sitemap_url_9','sitemap_url_8']

# Drop Nan values
result = result.dropna()

#Count the number of values of a column
col_count = result['sitemap_url_9'].count()
col_count = result['sitemap_url_8'].count()

# Divide the number of values by the total number of rows
col_percent = (col_count / sitemap_url_9.shape[0]) * 100
col_percent1 = (col_count / sitemap_url_8.shape[0]) * 100

# Return the result in percentage format
print(f"Percentage of unique URLs in sitemap_url_9: {col_percent:.2f}%")
print(f"Percentage of unique URLs in sitemap_url_8: {col_percent:.2f}%")

Conclusion

As far as my experience is concerned, polishing off your XML sitemap often leads to improved crawling and indexing performance. This translates into fewer “grey boxes” in the Coverage section on Google Search Console as pages increase their chances to get indexed in the short term.

Retrieving insights and juicy data with Python can be a time saver especially if you’re working for a small-medium size website.

Although it might appear daunting at first glance, once you start getting along with the code your SEO research will become a proper piece of cake.

FAQ

How do I export a sitemap to Excel?

Example
df = pd.to_excel("xml_sitemap_audit.xlsx", index = None)
How to do a sitemap audit?

To do a sitemap audit, ensure that your sitemap submission in Google Search Console is up-to-date. Then, run a crawl with the code provided in the post (or using Screaming Frog)and check for errors, missing pages, and duplicate content.

Simone De Palma

Technical SEO Executive

Simone De Palma is a Technical SEO Executive at iProspect UK and the founder of SEO Depths.

He graduated in Marketing and Management from Università IULM before completing a degree in Digital Marketing and Data Science at Leeds Beckett University.
Simone has worked as an SEO Specialist in digital agencies in Italy and the United Kingdom and he’s a contributor for the Search Engine Land.

When he’s away from his double screens, he enjoys cooling down with a refreshing swim at the pool. You could find him exploring art museums or enjoying the company of a classic romance.

🍁XML Sitemap Audit with Python