XML Sitemap Audit with Python

Reading time: 22 Minutes

XML sitemaps are designed to make life easier for search engines by providing an index of a siteโ€™s URLs.

However, they also play a crucial role when it comes to seizing the competitive batch as they quickly deliver an overview of what a website considers to be the most relevant pages which need to be submitted to the attention of the web crawlers.

In this post, you will learn how to automate an XML sitemap audit to improve your SEO decision-making.


๐Ÿ’กShortcut

Jump straight away on the Google Colab script if in a hurry


XML Sitemap: What it is and How to Audit

Sitemaps are vast repositories of the best pages from your website. This document improves the capability of Google and other search engines to crawl your website more efficiently.

In the words of Google:

sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them.

A sitemap helps search engines discover URLs on your site, but it doesn’t guarantee that all the items in your sitemap will be crawled and indexed.

Who needs an XML Sitemap?

When your website’s pages are linked correctly, Google is usually able to find and index most of them. This means that all the important pages on your website can be accessed through navigation options, such as your site’s menu or links placed on other pages.

Despite optimal internal linking, an XML sitemap can help Google find and index larger or more complex websites or specialized files effectively.

Some websites may benefit more from an XML sitemap than others:

  • Your site is really large. As a result, it’s more likely Google web crawlers might overlook crawling some of your new or recently updated pages.
  • Your site has a large archive of content pages that are isolated or not well linked to each other. If your site pages don’t naturally reference each other, you can list them in a sitemap to ensure that Google doesn’t overlook some of your pages.
  • Your site is new and has few external links to it. Googlebot and other web crawlers crawl the web by following links from one page to another. As a result, Google might not discover your pages if no other sites link to them.
  • Your site has a lot of rich media content (video, images) or is shown in Google News. If provided, Google can take additional information from sitemaps into account for search, where appropriate.

Some websites may not receive equivalent benefits from an XML sitemap:

  • Your site is “small”. The site should be about 500 pages or fewer.
  • Your site is comprehensively linked internally. This means that Google can find all the important pages on your site by following links starting from the homepage.
  • You don’t have many media files (video, image) or news pages that you want to show in search results. Sitemaps can help Google find and understand video and image files, or news articles, on your site. If you don’t need these results to appear in images, videos, or news results, you might not need a sitemap.

XML Sitemap Best Practices

Before jumping on the analytics bandwagon, it is required to remark on a list of things to know to approach the analysis with a certain degree of understanding of how XML sitemaps should stand out on a website.

  • Make sure to use consistent URLs. Avoid omitting the safety protocol (HTTPS) from the URLs and specify a URL without www. Instead, upload absolute URLs.
  • Don’t include session IDs and other user-dependent identifiers from URLs in your sitemap. This reduces duplicate crawling of those URLs.
  • Break up large sitemaps into smaller sitemaps: a sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed.
  • List only canonical URLs in your sitemaps. If you have multiple versions of a page, list in the sitemap only the one you prefer to appear in search results.
  • Point to only one website version in a sitemap. If you have different URLs for mobile and desktop versions of a page, use one version in a sitemap.
  • Tell Google about alternate language versions of a URL using hreflang annotations. Use the hreflang tag in a sitemap to indicate the alternate URLs if you have different pages for different languages or regions.
  • Encode your sitemap with UTF-8 and use ASCII characters to ensure is readable by Google. This is usually done automatically if you are using a script, tool, or log file to generate your URLs.
  • Avoid submitting Pagination. URLs in the sitemap should only contain pages that you want to rank โ€“ and paginated series do not usually fit this criterion. Including paginated series on a sitemap means increasing the chances to have them ranked on the SERP. Whether you appreciate pagination to get indexed, ranking them brings little to no value and in turn wastes crawl efficacy.
  • Keep in mind sitemaps are a recommendation to Google about which pages you think are important; Google does not pledge to crawl every URL in a sitemap.
  • Google ignores <priority> and <changefreq> values.
  • Google uses the <lastmod> value if it’s consistently and verifiably (for example by comparing to the last modification of the page) accurate.
  • The position of a URL in a sitemap is not important; Google does not crawl URLs in the order in which they appear in your sitemap.
  • Submit your sitemap to Google. Google examines sitemaps only when it first discovers them or when it is notified of an update. Despite plenty of ways to make your sitemap available to Google, it is best practice to submit it through the Search Console.

๐Ÿ’กPro Tip

Referencing XML sitemaps in your robots.txt files is not an official requirement.
This is an SEO myth based on the assumption that placing the sitemap on the robots.txt consolidate the file’s submission signals to Google.
Instead, this may expose your site to security breach in that someone might easily scrape important information from your website.
Submitting the XML sitemap from your Google Search Console property is simply enough and the safest option.

Audit XML Sitemaps in Python

Working in-agency, there are many opportunities to confront specific SEO tasks. When screening XML sitemaps of a client website, I like to follow a machine learning-based playbook that allows me to process huge chunks of data and automate most of the boring stuff.
Before kicking off this tutorial on how to automate a sitemap audit, it is necessary to become aware of a few premises.

For the purpose of this tutorial, I am going to test the XML sitemap from Halfords, the largest British retailer of motoring and cycling products and services. 


Packages and Libraries

We need to install and import a couple of fundamental libraries

%%capture
!pip install advertools plotly

Hence, we need to import them into our environment, including Pandas anddisplay, HTML that will provide us with easy data manipulation maneuvers and text output visualization.

import advertools as adv
import pandas as pd
import os
import requests
import urllib.parse
import requests
import time
import matplotlib
import plotly.express as px

from lxml import etree
from bs4 import BeautifulSoup
from IPython.display import display_html, display_markdown
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))
def md(text):
    return display_markdown(text, raw=True)

Scrape URLs from the Sitemap

The next step concerns the factual scraping of our sitemap.xml.

Having imported the required libraries, we are now in the position to set up a few functions to prompt the machine to parse the sitemap.

๐Ÿ’ก Append either /sitemap.xml or /sitemap_index.html to the end of the URL address to find out whether the target site disposes of a sitemap.xml

To dive into this, we are going to hint the machine with the homepage of the domain we want to retrieve the XML sitemap from.

Next, we fetch the target site’s sitemap.xml using the adv.sitemap_to_df() function from Advertools to convert the sitemap into a data frame.

The script then converts the date format of the “lastmod” column in the data frame to make it more readable.

Finally, the script drops a few columns to comply with ordinary data cleaning techniques before printing the dataset.

homepage = 'https://www.halfords.com/'

x = adv.sitemap_to_df("https://www.halfords.com/sitemap_index.xml")
df = pd.DataFrame(x, columns=['loc',	'lastmod',	'changefreq',	'priority',	'image',	'image_loc',	'image_caption',	'image_title',	'sitemap',	'sitemap_size_mb',	'download_date'])

#convert the date format to make it more readable 
df['Lastmod'] = df['lastmod'].dt.strftime('%Y-%m-%d')

#drop multiple columns
sitemap_url = df.drop(['lastmod'], axis=1)
sitemap_url.head()
scraped URLs from a sitemap

Audit the Size and Number of URLs on the XML Sitemap

A single XML sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed, therefore we need to make sure your file does not break the rule.

Check how many URLs a single file contains

sitemap_url['sitemap'].value_counts()

Check the size of the XML sitemap

sitemap_url['sitemap_size_mb'].drop_duplicates().sort_values(ascending=False)

Audit Usage of XML Sitemap Attributes

Sitemaps dispose of a bunch of attributes aimed at instructing search engines about crawling patterns or providing information on the current site architecture.

Some have been recently deprecated, whereas others are still valid.

As anticipated, Google dismissed <priority> and <changefreq> whilst keeping other mandatory values such as <loc> and <lastmod>.

Using<lastmod>helps highlight updated pages in the sitemap and notify Google about updates.

To get a grip on the <lastmod> usage, we are going to use the Display,HTML module that we previously imported.

#How many URLs have a date in the URL

md(f"## URLs that have lastmod implemented: {sitemap_url['Lastmod'].notna().sum():,} ({sitemap_url['Lastmod'].notna().mean():.1%})")

URLs that have lastmod implemented: 14,213 (100.0%)

๐Ÿ’กPro Tip

XML sitemaps attributes come with different nomenclature. Recall that instead of “Lastmod” you could be using whatever the name of the column containing the last modified date will appear on your sitemap.

But we want to audit existing`>changefreq>` and `<priority>. These fields are very popular among large publishers and eCommerce subject to dynamic injection of fresh content or products on their websites. Despite Google’s deprecation of the values, many are still using them at full capacity.

#check on priority
md(f"## URLs that have priority implemented: {sitemap_url['priority'].notna().sum():,} ({sitemap_url['priority'].notna().mean():.1%})")

#check on changefreq
md(f"## URLs that have changefreq implemented: {sitemap_url['changefreq'].notna().sum():,} ({sitemap_url['changefreq'].notna().mean():.1%})")

URLs that have priority implemented: 14,179 (100.0%)

URLs that have changefreq implemented: 14,179 (100.0%)

Pagination

As anticipated, it is recommended to avoid cluttering an XML sitemap with paginated series that bring little to no value to your organic rankings and business goals. You can find out if an XML sitemap contains pagination by using searching for the /p=/ parameter on the loc column from our sitemap.

This will return the previous data frame with an extra column called “Pagination” returning boolean values to address the main question

sitemap_url['Pagination'] = sitemap_url['loc'].str.contains('/p=/', na=False)

if (sitemap_url[sitemap_url["Pagination"] == False]).empty:
    print('Pagination found')
else:
    print('No Pagination')
sitemap_url.head()

External Link Ratio

To spice up the audit, we can investigate the backlink weight of each URL included on the XML sitemap.

def count_links( page_url, domain ):
    """Given input page_url, output the total number of outbound links"""
    links_internal = {}
    links_external = {}
    
    # download the html
    res = requests.get(page_url)
    if "html" not in res.headers.get('Content-Type'):
        # this is an image
        return {'parseable': False}
    
    html = res.text
    soup = BeautifulSoup(html)
    
    for a in soup.find_all('a'):
        link = a.get('href')

        # skip missing
        if link is None:
            continue

        # remove params
        link = link.split("?")[0]
        # remove shortcuts
        link = link.split("#")[0]

        # skip missing
        if (link is None) or (link == ''):
            continue

        if (domain in link) or (len(link)>1 and link[0:1]=="/") or (len(link)>2 and link[0:2]=="./"):
            # is internal
            links_internal[link]= links_internal.get(link,0) + 1
        else:
            # external
            links_external[link]= links_external.get(link,0) + 1

    return {"parseable": True, "external": links_external, "internal": links_internal }


domain = urllib.parse.urlparse(homepage).netloc

# test one url
page_url = sitemap_url.iloc[0]['loc']
print(page_url)

links = count_links( page_url, domain )
links

All right. This was just a test but we can’t go that far unless we want to tear apart Google Colab CPU.

Due to the vastity of the URLs contained in our XML sitemap, it would be detrimental to the CPU memory of our Colab to parse outbound links for every single URL.

Small sites (<10,000 pages) usually have limited sitemaps with fewer submitted URLs. If this is your case, then it might be reasonable to run a casual sampling of 25% of pages and set the output to return no more than 40 sampled pages.

Please, bear in mind that this section may not be useful if you’re running an international website with multiple sitemaps (e.g eCommerce)

# sample 25% of the site
sample_size = 0.25 
# not more than n number of pages
max_n_samples = 40

Let’s run the script and store the outcomes into a Pandas data frame.

# table of pages to test
subset_of_sitemap_df = sitemap_url.sample(min(max_n_samples, round(sample_size*len(sitemap_url))))

# get domain
domain = urllib.parse.urlparse(homepage).netloc

# links per page
links_per_page = []

# get count of external links per pagex
for index, row in subset_of_sitemap_df.iterrows():
    
    page_url = row['loc']
    
    # count outbound links
    links = count_links( page_url, domain)
    
    # keep track of links per page
    if links.get('parseable'):
        external_links = len( links['external'] )
        links_per_page.append( external_links )
        
# create a dictionary to hold the data
data = {'page_url': [], 'external_links': []}

# fill the dictionary with data
for index, row in subset_of_sitemap_df.iterrows():
    page_url = row['loc']
    links = count_links( page_url, domain)
    if links.get('parseable'):
        external_links = len( links['external'] )
        data['page_url'].append(page_url)
        data['external_links'].append(external_links)

# convert dictionary to dataframe
df = pd.DataFrame(data)
df.to_excel('external_links_on_XML.xlsx', index=False)
df.head()

๐Ÿ”ฆ Shout out to Alton Alexander for the inspiration coming from one of his brilliant workaround SEO and Data Science. You can find the full code to the analysis of sitewide link quality on Github

Deconstruct the XML sitemap structure

One of the features that this Python framework allows you to take advantage of is to get a first-hand look at your target website’s structure. This is peculiarly possible thanks to advertools and their built-in functions.

sitemap_url_df = adv.url_to_df(sitemap_df["loc"])
sitemap_url_df 

Here’s an excerpt of what you might be able to retrieve at the moment.

deconstruction of the site tree from a sitemap

At a glance, we can obtain an overview of the top directories used in the sitemap.

Depending on the type of file, this can help you discern among the most frequent sub-directories, learn more about the most common categories included in the sitemap, and the most used words per directory.

sitemap_url_df.notna().mean().to_frame().rename(columns={0: 'URL element used %'}).style.format('{:.2%}').background_gradient(cmap='cividis')
percentage of utilization of URLs components from a sitemap

Top Geo Location Values

The primary directory of a URL often regards geolocation.

It’s not rare to stumble across eCommerce shops with multiple XML sitemaps containing an unfathomable list of URLs serving all but the main market or language.

Let’s say that for this example we’re auditing an XML sitemap for a large eCommerce with the following URL structure:

tw/sitemap_index.com

You’ll judge by yourself what the XML sitemap may contain instead

counts_df = sitemap_url_df['dir_1'].value_counts().reset_index()
counts_df.columns = ['country', 'count']

#using plotly to plot a histogram
px.histogram(counts_df.head(5), x='country', y='count')

Despite not posing a threat to indexability, having multiple URLs that differ from the main purpose of the XML sitemap file may confuse search engines when picking up the most relevant pages you’ve submitted.

Tallying on the current example, we can fetch and download those submitted URLs drifting from the main purpose of the XML sitemap

# Use boolean indexing to filter the rows that do not contain "/tw/"
filtered_urls = sitemap_url[~sitemap_url['loc'].str.contains('/tw/')]

# Create a new dataframe with the filtered rows
new_dataframe = pd.DataFrame(filtered_urls)
new_dataframe.to_excel('Filtered XML.xlsx',index=False)
new_dataframe['loc'].head()

๐Ÿ’กBONUS

eCommerce is having a great time nowadays and optimizing the rigth features can definitely help you boost your online store visibility. Here are 3 common issues with images on luxury eCommerce

Top Page Categories

We can inspect the second directory from a URL path from the XML sitemap to get a grip on the top page categories submitted to the attention of Google.

top_categories = sitemap_url_df['dir_2'].value_counts().reset_index()
top_categories.columns = ['category', 'count']

#using plotly to plot a histogram
px.histogram(top_categories.head(), x='category', y='count')

Most common n-grams or bi-grams

Coming back to the Halfords sitemap – the only crawled one with advertools – we can look at the most used uni-grams in the primary directory of the URLs included on the sitemap

(adv.word_frequency(
    sitemap_url_df['dir_2']
    .dropna()
    .str.replace('-', ' '),
    rm_words=['to', 'for', 'the', 'and', 'in', 'of', 'a', 'with', 'is'])
 .head(15)
 .style.format({'abs_freq': '{:,}'}))
most used n-grams from URLs on a sitemap

๐Ÿ”ฆ Shout out to Elias Dabbas for inspiring this specific section. You can find similar XML sitemap file manipulation and analysis on the Foreign PolicyXML sitemap analysis on Kaggle.

HTTPS Usage on URLs in the Sitemap

You can peep through the sitemap.xml to get a grip on the overall HTTPS usage on your site

sitemap_url["scheme"].value_counts().to_frame()

Here’s what you might get.

A screen grab from the python script providing a clue of HTTPS Usage on your Site
HTTPS Usage on your Site

Robots.txt

We can ask the machine to return the robots.txt file’s status code for the domain.

import requests
r = requests.get("halfords.com/robots.txt")
r.status_code

If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.

By the way, we can even dig down on the robots.txt analysis by extending the audit to all of the URLs standing on the sitemap.

sitemap_df_robotstxt_check = adv.robotstxt_test("https://www.yoursite.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"])

sitemap_df_robotstxt_check["can_fetch"].value_counts()

What we have just done was to perform the audit for all of the user agents. You should have returned something like that

Outcome of the bulk audit Robots.txt of the URLs in the sitemap
Bulk audit Robots.txt of the URLs

As you can see, we received a True value meaning that all of the URLs in the sitemap.xml are crawlable.

In case the value turned out to be False, it means that some URLs are being disallowed.

You can identify them by running the following lines of code

pd.set_option("display.max_colwidth",255)
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]

Status Code

You can progress your journey throughout the sitemap.xml tags by inspecting the response returned to the URLs included.

To do that, we run a crawl with Advertools by making sure that the in-built crawler scans all the links he can possibly find on the sitemap.

adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl")
df_headers = pd.read_json("sitemap_df_header.jl", lines=True)
df_headers["status"].value_counts()

Next, we paste the above json file within a Pandas function designed to display the URLs status codes with the likes of a clean array.

df_headers = pd.read_json('sitemap_df_header.jl', lines=True)
df_headers.head()

Once you run the script, here is what you might get.

To wrap off the status code discovery, we want to make sure that the sitemap.xml does not present any 404 URLs. According to Google Search Central, an XML sitemap should contain only relevant 2xxx URLs and avoid pages returning toxic response codes.

df_headers[df_headers["status"] == 404]

It goes without saying that if the script returns nothing, it means that there are no 404

Canonicalization

Using canonicalization hints on the response headers is beneficial for crawling and indexing

If you want to include a canonicalization hint on the HTTP header, you need to guarantee that the HTML canonical tags and the response header canonical tags are the same.

To untangle this knot, we are going to trawl through the HTTP headers and chase down the resp_headers_link

df_headers.columns

๐Ÿšจ WARNING ๐Ÿšจ

In case the script above doesn’t return a resp_headers_link, it means that a response header canonical is not in place. You want to jump on the next section.

Given that the script returned a resp_headers_link we are enabled to compare the response header canonical to the HTML canonical.

df_headers["resp_headers_link"]
print("Checking any links within the Response Header")

df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:\/0-9-.]*)")
(df_headers["response_header_canonical"] == df_headers["url"]).value_counts()

If the result is False, just like in the example, it means that the response header canonical does not equal the URL canonical on the audited website.

If the result is True, obviously the response header canonical equals the URL canonical

๐Ÿ’กBONUS

You can automate a canonical audit using Python. Learn more and cut off plenty of manual and boring work!

X-Robot Meta Tags

Temporary amendments to the Robots.txt directives can often apply, and as a result, they could be soaked up within the URLs enshrined on the XML sitemap.

Because of that, you want to peep through the sitemap and assess the existence of such tags within the HTTP headers.

def robots_tag_checker(dataframe:pd.DataFrame):
     for i in df_headers:
          if i.__contains__("robots"):
               return i
          else:
               return "There is no robots tag"
robots_tag_checker(df_headers)

Once again, the script utters a binomial response.

There is no robots tag = fair enough, jump on the next section

There is robots tag = you may want to dig deeper and see what they refer to

To narrow down, we might want to check if these X-Robots Meta Tags come with a noindex directive.

๐Ÿ’ก TIPS ๐Ÿ’ก

In the Coverage section on Google Search Console profile, Google Search Console Coverage Report, those normally appear as โ€œSubmitted – marked as noindexโ€. Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.

df_headers["response_header_x_robots_tag"].value_counts()
df_headers[df_headers["response_header_x_robots_tag"] == "noindex"]

Audit Meta Tag Robots

Even if a web page is not disallowed from any constraining robots.txt directive, it can still be disallowed from the HTML Meta Tags.

Hence, checking the HTML Meta Tags for better indexation and crawling is necessary and very much recommended.

A handy method I fancy is using custom XPath selectors to perform the HTML Meta Tag audit for the audited URLs uploaded on a sitemap.

For the purposes of this tutorial, we are going to leverage Advertools to run an additional crawl drilled on the research for a specific Xpath selector aimed to extract all the robots commands from the URLs on the sitemap

adv.crawl(url_list=sitemap_url["loc"][:1000], output_file="meta_command_audit.jl",

follow_links=False,

# extract all the robots commands from the URLs from the sitemap.

xpath_selectors= {"meta_command": "//meta[@name='robots']/@content"}, 

# we have set the crawling to 1000 URLs from the sitemap

custom_settings={"CLOSESPIDER_PAGECOUNT":1000}) 

df_meta_check = pd.read_json("meta_command_audit.jl", lines=True)

df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()

As usual, a dual output will be provided. Having instructed the machine to spot the factual robots directives, we are going to learn whether the sitemap hosts URLs that threaten the crawling and indexing processes.

True = there URLs with “noindex,follow” attributes uploaded on the sitemap

False = there are no URLs with “noindex,follow” attributes uploaded on the sitemap

๐Ÿ’ก TIP ๐Ÿ’ก

If your Search Console property raises indexing issues spurring from a specific URL despite this one displaying a Index,Follow, you should inspect the <body> section of that page to assess whether a noindex,follow is actually in place

I must pay tribute for this tip to Kristina Azarenko and her tests about a curious case of noindexed page

Moving back to our meta tag robots audit from the target sitemap’s URLs, we can now get a final roundup of the meta tag robots applied.

To run this check, I decided to use the XML sitemap of seodepths as the script turns out to digest much better sitemaps with fewer URLs enclosed.

df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]

Audit Duplicate URLs

This crazy journey throughout an XML sitemap culminates with a screening of duplicate URLs.

The following script will inform us of the ratio between the original quantity of URLs on the sitemap and those resulting after de-duplication, therefore post duplicates removal

print(f'Original: {sitemap_url.shape}')
sitemap = sitemap_url.drop_duplicates(subset=['loc'])
print(f'After de-duplication: {sitemap.shape}')
duplicate urls feom sitemap

The following will recap the number of duplicate URLs identified

duplicated_urls = sitemap_url['loc'].duplicated()
md(f'## Duplicated URLs: {duplicated_urls.sum():,} ({duplicated_urls.mean():.1%})')

Should your sitemap get caught with duplicated pages, the following script will use a pivot table in Pandas to tell you how many are out there

pd.pivot_table(sitemap_url[sitemap_url["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)

And most importantly, display URLs that cause duplicate issues.

sitemap_url[sitemap_url["loc"].duplicated() == True]

Compare Multiple XML Sitemaps

Large eCommerce may come in with a ton of XML sitemaps. They may be either submitted via Google Search Console or via Robots.Txt. In truth, it’s not rare to detect sitemaps uploaded only via the Robots.txt file and other files submitted exclusively via Google Search Console.

In case you’re in such a messy situation, it would be helpful to learn at a glance whether they are unique files or just duplicates. In other words, you want to know if each document contains the same URL that might be found elsewhere.

Fear no more, Advertools and Pandas will come to the rescue.

You just need to scrape each similar XML sitemap, concatenate them by <loc> using the pd.concat function and count the number of duplicates across each sitemap.


f = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_9.xml")
df38 = pd.DataFrame(f, columns=['loc',	'changefreq',	'priority',	'sitemap',	'etag',	'sitemap_last_modified',	'sitemap_size_mb',	'download_date'])
sitemap_url_9 = df38.drop(['etag', 'download_date','changefreq',	'priority',	'sitemap',	'sitemap_last_modified',	'sitemap_size_mb'], axis=1)

g = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_8.xml")
df39 = pd.DataFrame(g, columns=['loc',	'changefreq',	'priority',	'sitemap',	'etag',	'sitemap_last_modified',	'sitemap_size_mb',	'download_date'])
sitemap_url_8 = df39.drop(['etag', 'download_date','changefreq',	'priority',	'sitemap',	'sitemap_last_modified',	'sitemap_size_mb'], axis=1)

# Concatenate <loc> cols 
result = pd.concat([sitemap_url_9, sitemap_url_8], axis=1, join='outer')
result.columns = ['sitemap_url_9','sitemap_url_8']

# Drop Nan values
result = result.dropna()

#Count the number of values of a column
col_count = result['sitemap_url_9'].count()
col_count = result['sitemap_url_8'].count()

# Divide the number of values by the total number of rows
col_percent = (col_count / sitemap_url_9.shape[0]) * 100
col_percent1 = (col_count / sitemap_url_8.shape[0]) * 100

# Return the result in percentage format
print(f"Percentage of unique URLs in sitemap_url_9: {col_percent:.2f}%")
print(f"Percentage of unique URLs in sitemap_url_8: {col_percent:.2f}%")
percentage of duplicate URLs in different sitemaps

Conclusion

Retrieving insights and juicy data with Python can be a time saver especially if you’re working for a small-medium size website.

Although it might appear daunting at first glance, once you start getting along with the code your SEO research will become a proper piece of cake.


FAQ

How do I export a sitemap to excel?

To export the sitemap to excel you need to execute a df.to_csv function by appending the directory path where you want to store the sitemap on your PC.

sitemap_url_df.to_csv(r'YOUR-PATH-DIRECTORY.csv'), index = False, header=True)

You need to append and execute this command prompt right after the lines of code pointing at the sitemap scraping.
Please note that we are using “sitemap_url” as the custom name of the function.

Further Readings

I highly recommend following Elias Dabbas for Python tips and actionable tutorials drilled on SEO.

This post was inspired by his take on the foreign policy XML sitemap analysis from Kaggle

Related Posts