XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs.
However, they also play a crucial role when it comes to seizing the competitive batch as they quickly deliver an overview of what a website considers to be the most relevant pages which need to be submitted to the attention of the web crawlers.
In this post, you will learn how to automate an XML sitemap audit to improve your SEO decision-making.
💡Shortcut
Jump straight on the Google Colab script if in a hurry
XML Sitemap: What it is and How to Audit
Sitemaps are vast repositories of the best pages from your website. This document improves the capability of Google and other search engines to crawl your website more efficiently.
In the words of Google:
A sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them.
A sitemap helps search engines discover URLs on your site, but it doesn’t guarantee that all the items in your sitemap will be crawled and indexed.
Who needs an XML Sitemap?
When your website’s pages are linked correctly, Google is usually able to find and index most of them. This means that all the important pages on your website can be accessed through navigation options, such as your site’s menu or links placed on other pages.
Despite optimal internal linking, an XML sitemap can help Google find and index larger or more complex websites or specialized files effectively.
Some websites may benefit more from an XML sitemap than others:
- Your site is really large. As a result, it’s more likely Google web crawlers might overlook crawling some of your new or recently updated pages.
- Your site has a large archive of content pages that are isolated or not well linked to each other. If your site pages don’t naturally reference each other, you can list them in a sitemap to ensure that Google doesn’t overlook some of your pages.
- Your site is new and has few external links to it. Googlebot and other web crawlers crawl the web by following links from one page to another. As a result, Google might not discover your pages if no other sites link to them.
- Your site has a lot of rich media content (video, images) or is shown in Google News. If provided, Google can take additional information from sitemaps into account for search, where appropriate.
Some websites may not receive equivalent benefits from an XML sitemap:
- Your site is “small”. The site should be about 500 pages or fewer.
- Your site is comprehensively linked internally. This means that Google can find all the important pages on your site by following links starting from the homepage.
- You don’t have many media files (video, image) or news pages that you want to show in search results. Sitemaps can help Google find and understand video and image files, or news articles, on your site. If you don’t need these results to appear in images, videos, or news results, you might not need a sitemap.
XML Sitemap Best Practices
Before jumping on the analytics bandwagon, it is required to remark on a list of things to know to approach the analysis with a certain degree of understanding of how XML sitemaps should stand out on a website.
- Make sure to use consistent URLs. Avoid omitting the safety protocol (HTTPS) from the URLs and specify a URL without www. Instead, upload absolute URLs.
- Don’t include session IDs and other user-dependent identifiers from URLs in your sitemap. This reduces duplicate crawling of those URLs.
- Break up large sitemaps into smaller files: a sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed.
- List only canonical URLs in your sitemaps. If you have multiple versions of a page, list in the sitemap only the one you prefer to appear in search results.
- Point to only one website version in a sitemap. If you have different URLs for mobile and desktop versions of a page, use one version in a sitemap.
- Tell Google about alternate language versions of a URL using hreflang annotations. Use the hreflang tag in a sitemap to indicate the alternate URLs if you have different pages for different languages or regions.
- Encode your sitemap with UTF-8 and use ASCII characters to ensure is readable by Google. This is usually done automatically if you are using a script, tool, or log file to generate your URLs.
- Avoid submitting Pagination. URLs in the sitemap should only contain pages that you want to rank – and paginated series do not usually fit this criterion. Including paginated series on a sitemap means increasing the chances to have them ranked on the SERP and requesting spending unnecessary requests to web spiders that could compromise your crawl efficacy.
- Keep in mind sitemaps are a recommendation to Google about which pages you think are important; Google does not pledge to crawl every URL in a sitemap.
- Google ignores
<priority>
and<changefreq>
values. - Google uses the
<lastmod>
value if it’s consistently and verifiably (for example by comparing to the last modification of the page) accurate. - The position of a URL in a sitemap is not important; Google does not crawl URLs in the order in which they appear in your sitemap.
- Submit your sitemap to Google. Google examines sitemaps only when it first discovers them or when it is notified of an update. Despite plenty of ways to make your sitemap available to Google, it is best practice to submit it through the Search Console.
💡Pro Tip
Referencing XML sitemaps in your robots.txt files is not an official requirement.
This is an SEO myth based on the assumption that placing the sitemap on the robots.txt consolidate the file’s submission signals to Google.
Instead, this may expose your site to security breaches as someone could scrape important information from your website.
Submitting the XML sitemap from your Google Search Console property is simply enough.
Audit XML Sitemaps in Python
Agency-wise, there are many opportunities to confront specific SEO tasks. When screening XML sitemaps of a client website, I like to follow a machine learning-based playbook that allows me to process huge chunks of data and automate most of the boring stuff.
Before kicking off this tutorial on how to automate a sitemap audit, it is necessary to become aware of a few premises.
- Run the script either on Google Colab or on Jupiter Notebook. I suggest using Google Colab because of its readiness to complete a Python task as well as the chance to run large chunks of scripts on the browser, rather than onto your PC’s CPU.
- Make sure you change the run time on GPU to make the parsing process more smooth.
- Bear in mind that sitemaps are hints from Google prompting webmasters to stress the pages that they deem to be most important. In fact, Google does not pledge to crawl every URL in a sitemap.
- If you are going to run the script on Colab, please do not forget to append an exclamation point at the beginning of the “pip install“
For the purpose of this tutorial, I am going to test the XML sitemap from Halfords, the largest British retailer of motoring and cycling products and services.
Install and Import Libraries
We need to install and import a couple of fundamental libraries
%%capture
!pip install advertools dash dash_bootstrap_components jupyter_dash plotly bs4 matplotlib datasets adviz dash_bootstrap_templates
The following packages deserve a brief introduction:
Library | Description |
---|---|
Advertools | A Python library designed to assist in SEO automation projects. Used for retrieving website’s sitemap.xml. |
Pandas | A library for data manipulation and analysis in Python. It is imported for easy data manipulation maneuvers. |
Adviz | A Python library built on top of Advertools for clean data visualization and basic descriptive analysis. |
Plotly | A high-level, declarative charting library for creating highly visual plots. |
import os
import requests
import urllib.parse
from bs4 import BeautifulSoup
import time
import warnings
import advertools as adv
import adviz
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
from IPython.display import display_html, display_markdown, HTML
from IPython.core.display import display
from lxml import etree
import matplotlib
import sklearn
from dash_bootstrap_templates import load_figure_template
import dash
import dash_core_components as dcc
import dash_html_components as html
# Set display options
pd.options.display.max_columns = None
warnings.filterwarnings("ignore")
display(HTML("<style>.container { width:100% !important; }</style>"))
def md(text):
return display_markdown(text, raw=True)
# Check package versions
for pkg in [adv, pd, plotly, sklearn]:
print(f'{pkg.__name__:-<30}v{pkg.__version__}')
load_figure_template(['darkly','bootstrap','flatly','cosmo'])
Fetch URLs from the Sitemap
The next step concerns the factual scraping of our sitemap.xml.
Having imported the required libraries, we are now in the position to set up a few functions to prompt the machine to parse the sitemap.
💡 Append either /sitemap.xml or /sitemap_index.html to the end of the URL address to find out whether the target site disposes of a sitemap.xml
Next, we fetch the target site’s sitemap.xml using the adv.sitemap_to_df()
function from Advertools to convert the sitemap into a data frame.
The script then converts the date format of the “lastmod
” column in the data frame to make it more readable.
Finally, the script drops a few columns to comply with data cleaning techniques before printing the dataset.
sitemap = adv.sitemap_to_df('https://www.halfords.com/robots.txt',
max_workers=8,
recursive=True)
sitemap.to_csv('sitemap_collection.csv',index=False)
#parse the sitemaps
sitemap_raw = pd.read_csv('sitemap_collection.csv', parse_dates=['lastmod', 'download_date'], low_memory=False)
#overview duplicates
print(f'Original: {sitemap_raw.shape}')
sitemap = sitemap_raw.drop_duplicates(subset=['loc'])
print(f'After removing dupes: {sitemap.shape}')
sitemap.to_csv('sitemap.csv',index=False)
sitemap
Audit the Size and Number of URLs on the XML Sitemap
A single XML sitemap can contain up to 50,000 URLs and must not exceed 50MB uncompressed, therefore we need to make sure your file does not break the rule.
Check how many URLs a single file contains
sitemap['sitemap'].value_counts()
Check the size of the XML sitemap
sitemap['sitemap_size_mb'].drop_duplicates().sort_values(ascending=False)
<lastmod>, <changefreq> and <priority>
Sitemaps dispose of a bunch of attributes aimed at instructing search engines about crawling patterns or providing information on the current site architecture.
Some have been recently deprecated, whereas others are still valid.
As anticipated, Google dismissed <priority> and <changefreq> whilst keeping other mandatory values such as <lastmod> and obviously <loc>.
<lastmod> is a valuable element that helps search engines schedule crawls to known URLs.
As confirmed from a recent update on Google’s documentation, the <lastmod>
element should be in a supported date format, and its value should accurately reflect the page’s last modification date to maintain trust with search engines.
Using<lastmod>
helps highlight updated pages in the sitemap and notify Google about updates.
To get a grip on the <lastmod>
usage, we are going to use the Display,HTML
module that we previously imported.
#How many URLs have a date in the URL
md(f"## URLs that have lastmod implemented: {sitemap['lastmod'].notna().sum():,} ({sitemap['lastmod'].notna().mean():.1%})")
URLs that have lastmod
implemented: 3,136 (99.5%)
💡Pro Tip
XML sitemaps attributes come with different nomenclature. Recall that instead of “Lastmod” you could use whatever the name of the column containing the last modified date will appear on your sitemap.
What about the other 0.5 % of pages that don’t have a <lastmod>
?
You can inspect them by filtering missing values
sitemap[sitemap['lastmod'].isna()]['loc'].tolist()
Now we want to audit existing <changefreq>
and <priority>
.
Google does not use such elements for either NLP purposes or for rankings.
<changefreq>
is an element that overlaps with the concept of<lastmod>
<priority>
is very subjective, hence may not accurately represent a page’s actual priority compared to other pages on a website.
Nevertheless, these fields are very popular with large publishers and eCommerce subject to dynamic injection of fresh content or products on their websites.
Despite Google’s deprecation of the values, many are still using them at full capacity.
#check on priority
md(f"## URLs that have priority implemented: {sitemap['priority'].notna().sum():,} ({sitemap['priority'].notna().mean():.1%})")
#check on changefreq
md(f"## URLs that have changefreq implemented: {sitemap['changefreq'].notna().sum():,} ({sitemap['changefreq'].notna().mean():.1%})")
URLs that have priority
implemented: 14,179 (100.0%)
URLs that have changefreq
implemented: 14,179 (100.0%)
Inspect Publishing Trends from the XML sitemap
With the aid of plotly express, we can visualize the publishing trends of the website through the <lastmode> tag – in case it exists on the XML sitemap.
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=sitemap['lastmod'].sort_values(),
y=np.arange(1, len(sitemap) + 1) / len(sitemap),
mode='markers',
marker=dict(size=15, opacity=0.8, color='gold'),
name='ECDF'
)
)
fig.add_shape(
type='rect',
x0='2022-01-01', x1='2022-12-31', y0=0, y1=1,
fillcolor='gray', opacity=0.2,
xref='x', yref='paper',
layer='below',
name='No publishing',
)
fig.add_shape(
type='rect',
x0='2023-01-01', x1='2023-07-05', y0=0, y1=1,
fillcolor='gray', opacity=0.2,
xref='x', yref='paper',
layer='below',
name='Frequent publishing',
)
fig.update_layout(
title='{your_client_site} publishing trends<br>(lastmod tags of XML sitemap)',
template='plotly_dark',
height=500
)
fig.show()
The above time series illustrates a growing publishing trend. This suggests that the website has been adding pages strenuously over the past couple of years despite a few breaks, as you can see from the dark spots across the yellow line.
Pagination Sequences on XML sitemaps
As anticipated, it is recommended to avoid cluttering an XML sitemap with paginated series that bring little to no value to your organic rankings and business goals. You can find out if an XML sitemap contains pagination by using searching for the /p=/
parameter on the loc
column from our sitemap.
This will return the previous data frame with an extra column called “Pagination” returning boolean values to address the main question
sitemap['Pagination'] = sitemap['loc'].str.contains(r'/p=|\\?page=', regex=True, na=False)
if sitemap['Pagination'].any():
print('Pagination found')
else:
print('No Pagination')
sitemap.to_csv('sitem.csv', index=False)
sitemap.head()
External Link Ratio
To enhance the audit, we can examine the backlink influence of each URL listed in the XML sitemap.
To do this, we should determine the homepage address of the desired domain to retrieve the XML sitemap from.
homepage = 'https://www.halfords.com'
def count_links( page_url, domain ):
"""Given input page_url, output the total number of outbound links"""
links_internal = {}
links_external = {}
# download the html
res = requests.get(page_url)
if "html" not in res.headers.get('Content-Type'):
# this is an image
return {'parseable': False}
html = res.text
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
link = a.get('href')
# skip missing
if link is None:
continue
# remove params
link = link.split("?")[0]
# remove shortcuts
link = link.split("#")[0]
# skip missing
if (link is None) or (link == ''):
continue
if (domain in link) or (len(link)>1 and link[0:1]=="/") or (len(link)>2 and link[0:2]=="./"):
# is internal
links_internal[link]= links_internal.get(link,0) + 1
else:
# external
links_external[link]= links_external.get(link,0) + 1
return {"parseable": True, "external": links_external, "internal": links_internal }
domain = urllib.parse.urlparse(homepage).netloc
# test one url
page_url = sitemap.iloc[0]['loc']
print(page_url)
links = count_links( page_url, domain )
links
This was just a test but we can’t go that far unless we want to tear apart Google Colab CPU.
Due to the vastity of the URLs contained in our XML sitemap, it would be detrimental to the CPU memory of our Colab to parse outbound links for every single URL.
Small sites (<10,000 pages) usually have limited sitemaps with fewer submitted URLs. If this is your case, then it might be reasonable to run a casual sampling of 25% of pages and set the output to return no more than 40 sampled pages.
Please, bear in mind that this section may not be useful if you’re running an international website with multiple sitemaps (e.g eCommerce)
# sample 25% of the site
sample_size = 0.25
# not more than n number of pages
max_n_samples = 40
Let’s run the script and store the outcomes into a Pandas data frame.
# table of pages to test
subset_of_sitemap_df = sitemap.sample(min(max_n_samples, round(sample_size*len(sitemap_url))))
# get domain
domain = urllib.parse.urlparse(homepage).netloc
# links per page
links_per_page = []
# get count of external links per pagex
for index, row in subset_of_sitemap_df.iterrows():
page_url = row['loc']
# count outbound links
links = count_links( page_url, domain)
# keep track of links per page
if links.get('parseable'):
external_links = len( links['external'] )
links_per_page.append( external_links )
# create a dictionary to hold the data
data = {'page_url': [], 'external_links': []}
# fill the dictionary with data
for index, row in subset_of_sitemap_df.iterrows():
page_url = row['loc']
links = count_links( page_url, domain)
if links.get('parseable'):
external_links = len( links['external'] )
data['page_url'].append(page_url)
data['external_links'].append(external_links)
# convert dictionary to dataframe
df = pd.DataFrame(data)
df.to_excel('external_links_on_XML.xlsx', index=False)
df.head()
🔦 Shout out to Alton Alexander for the inspiration coming from one of his brilliant workaround SEO and Data Science. You can find the full code to the analysis of sitewide link quality on Github
Inspect the URL structure of Submitted Pages
One of the features that this Python framework allows you to take advantage of is to get a first-hand look at your target website’s structure.
This is possible using advertools and their built-in functions.
sitemap_url = adv.url_to_df(sitemap["loc"].fillna(''))
sitemap_url.head()
Here’s an excerpt of what you might be able to retrieve at the moment.
At a glance, we can get an overview of the top directories used in the sitemap.
Depending on the type of file, Adviz will help you identify the most frequent directories/sub-directories, categories, and the most used words per directory included in the sitemap.
Top Geo Location Values
The primary directory of a URL often regards geolocation.
It’s not rare to stumble across eCommerce shops with multiple XML sitemaps containing an unfathomable list of URLs serving all but the main market or language.
Let’s say that for this example we’re auditing an XML sitemap for a large eCommerce with the following URL structure:
tw/sitemap_index.com
You’ll judge by yourself what the XML sitemap may contain instead
adviz.value_counts_plus(sitemap_url['dir_1'], size=20, name='Market')
Now, let’s plot the count with a polished histogram using plotly.express
country = sitemap_url.groupby('dir_1').url.count().sort_values(ascending=False).reset_index()
country.rename(columns=
{'dir_1':'country',
'url': 'count'},
inplace=True)
#plot a histogram
px.histogram(country,
x='country',
y='count',
title='Ratio of Markets in XML sitemap',
template='plotly_dark')
Top Page Categories
We can inspect the second directory from a URL path from the XML sitemap to get a grip on the top page categories submitted to the attention of Google.
adviz.value_counts_plus(sitemap_url['dir_2'], size=20, name='Main Directory')
plotting the count with plotly express you should have a histogram similar to that
parent = sitemap_url.groupby('dir_2').url.count().sort_values(ascending=False).reset_index()
parent.rename(columns=
{'dir_2':'parent category',
'url': 'count'},
inplace=True)
#plot a histogram
fig = px.histogram(
parent.head(25),
x='parent category',
y='count',
title='Main Parent Category pages in the XML sitemap',
template='plotly_dark'
)
fig.update_layout(
xaxis_tickangle=25
)
fig.show()
Most common n-grams or bi-grams
Coming back to the Halfords sitemap – the only crawled one with advertools – we can look at the most used uni-grams in the primary directory of the URLs included on the sitemap
(adv.word_frequency(
sitemap_url_df['dir_2']
.dropna()
.str.replace('-', ' '),
rm_words=['to', 'for', 'the', 'and', 'in', 'of', 'a', 'with', 'is'])
.head(15)
.style.format({'abs_freq': '{:,}'}))
🔦 Shout out to Elias Dabbas for inspiring this specific section. You can find similar XML sitemap file manipulation and analysis on the Foreign PolicyXML sitemap analysis on Kaggle.
Spotting Irrelevant Country Pages from an XML sitemap folder
Despite not posing a threat to indexability, having multiple URLs that differ from the main purpose of the XML sitemap file may confuse search engines when picking up the most relevant pages you’ve submitted.
Tallying on the current example, we can fetch and download those submitted URLs drifting from the main purpose of the XML sitemap
# Use boolean indexing to filter out URLs that DO NOT target Taiwan(e.g)
filtered_urls = sitemap_url[~sitemap_url['loc'].str.contains('/tw/')]
# Create a new dataframe with the filtered rows
new_dataframe = pd.DataFrame(filtered_urls)
new_dataframe.to_excel('Filtered XML.xlsx',index=False)
new_dataframe['loc'].head()
💡BONUS
eCommerce is having a great time nowadays and optimizing the rigth features can definitely help you boost your online store visibility. Here are 3 common issues with images on luxury eCommerce
Robots.txt
We can ask the machine to return the robots.txt file’s status code for the domain.
import requests
r = requests.get("halfords.com/robots.txt")
r.status_code
If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.
By the way, we can even dig down on the robots.txt analysis by extending the audit to all of the URLs standing on the sitemap.
sitemap_df_robotstxt_check = adv.robotstxt_test("https://www.yoursite.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"])
sitemap_df_robotstxt_check["can_fetch"].value_counts()
What we have just done was to perform the audit for all of the user agents. You should have returned something like that
As you can see, we received a True value meaning that all of the URLs in the sitemap.xml are crawlable.
In case the value turned out to be False, it means that some URLs are being disallowed.
You can identify them by running the following lines of code
pd.set_option("display.max_colwidth",255)
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]
Status Code
You can progress your journey throughout the sitemap.xml tags by inspecting the response returned to the URLs included.
To do that, we run a crawl with Advertools by making sure that the in-built crawler scans all the links he can possibly find on the sitemap.
adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl")
df_headers = pd.read_json("sitemap_df_header.jl", lines=True)
df_headers["status"].value_counts()
Next, we paste the above json file within a Pandas function designed to display the URLs status codes with the likes of a clean array.
df_headers = pd.read_json('sitemap_df_header.jl', lines=True)
df_headers.head()
Once you run the script, here is what you might get.
To wrap off the status code discovery, we want to make sure that the sitemap.xml does not present any 404 URLs. According to Google Search Central, an XML sitemap should contain only relevant 2xxx URLs and avoid pages returning toxic response codes.
df_headers[df_headers["status"] == 404]
It goes without saying that if the script returns nothing, it means that there are no 404
Canonicalization
Using canonicalization hints on the response headers is beneficial for crawling and indexing
If you want to include a canonicalization hint on the HTTP header, you need to guarantee that the HTML canonical tags and the response header canonical tags are the same.
To untangle this knot, we are going to trawl through the HTTP headers and chase down the resp_headers_link
df_headers.columns
🚨 WARNING 🚨
In case the script above doesn’t return a
resp_headers_link
, it means that a response header canonical is not in place. You want to jump on the next section.
Given that the script returned a resp_headers_link
we are enabled to compare the response header canonical to the HTML canonical.
df_headers["resp_headers_link"]
print("Checking any links within the Response Header")
df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:\/0-9-.]*)")
(df_headers["response_header_canonical"] == df_headers["url"]).value_counts()
If the result is False, just like in the example, it means that the response header canonical does not equal the URL canonical on the audited website.
If the result is True, obviously the response header canonical equals the URL canonical
💡BONUS
You can automate a canonical audit using Python. Learn more and cut off plenty of manual and boring work!
X-Robot Meta Tags
Temporary amendments to the Robots.txt directives can often apply, and as a result, they could be soaked up within the URLs enshrined on the XML sitemap.
Because of that, you want to peep through the sitemap and assess the existence of such tags within the HTTP headers.
def robots_tag_checker(dataframe:pd.DataFrame):
for i in df_headers:
if i.__contains__("robots"):
return i
else:
return "There is no robots tag"
robots_tag_checker(df_headers)
Once again, the script utters a binomial response.
There is no robots tag = fair enough, jump on the next section
There is robots tag = you may want to dig deeper and see what they refer to
To narrow down, we might want to check if these X-Robots Meta Tags come with a noindex directive.
💡 TIPS 💡
In the Coverage section on Google Search Console profile, Google Search Console Coverage Report, those normally appear as “Submitted – marked as noindex”. Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.
df_headers["response_header_x_robots_tag"].value_counts()
df_headers[df_headers["response_header_x_robots_tag"] == "noindex"]
Audit Meta Tag Robots
Even if a web page is not disallowed from any constraining robots.txt directive, it can still be disallowed from the HTML Meta Tags.
Hence, checking the HTML Meta Tags for better indexation and crawling is necessary and very much recommended.
A handy method I fancy is using custom XPath selectors to perform the HTML Meta Tag audit for the audited URLs uploaded on a sitemap.
For the purposes of this tutorial, we are going to leverage Advertools to run an additional crawl drilled on the research for a specific Xpath selector aimed to extract all the robots commands from the URLs on the sitemap
adv.crawl(url_list=sitemap["loc"][:1000], output_file="meta_command_audit.jl",
follow_links=False,
# extract all the robots commands from the URLs from the sitemap.
xpath_selectors= {"meta_command": "//meta[@name='robots']/@content"},
# we have set the crawling to 1000 URLs from the sitemap
custom_settings={"CLOSESPIDER_PAGECOUNT":1000})
df_meta_check = pd.read_json("meta_command_audit.jl", lines=True)
df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()
As usual, a dual output will be provided. Having instructed the machine to spot the factual robots directives, we are going to learn whether the sitemap hosts URLs that threaten the crawling and indexing processes.
True = there URLs with “noindex,follow” attributes uploaded on the sitemap
False = there are no URLs with “noindex,follow” attributes uploaded on the sitemap
💡 TIP 💡
If your Search Console property raises indexing issues spurring from a specific URL despite this one displaying a Index,Follow, you should inspect the
<body>
section of that page to assess whether a noindex,follow is actually in place
H/T to Kristina Azarenko and her tests about a curious case of noindexed page.
Moving back to our meta tag robots audit from the target sitemap’s URLs, we can now get a final roundup of the meta tag robots applied.
To run this check, I decided to use the XML sitemap of seodepths as the script turns out to digest much better sitemaps with fewer URLs enclosed.
df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]
Audit Duplicate URLs
This crazy journey throughout an XML sitemap culminates with a screening of duplicate URLs.
The following script will inform us of the ratio between the original quantity of URLs on the sitemap and those resulting after de-duplication, therefore post duplicates removal
print(f'Original: {sitemap.shape}')
sitemap = sitemap.drop_duplicates(subset=['loc'])
print(f'After de-duplication: {sitemap.shape}')
The following will recap the number of duplicate URLs identified
duplicated_urls = sitemap['loc'].duplicated()
md(f'## Duplicated URLs: {duplicated_urls.sum():,} ({duplicated_urls.mean():.1%})')
Should your sitemap get caught with duplicated pages, the following script will use a pivot table in Pandas to tell you how many are out there
pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)
And most importantly, display URLs that cause duplicate issues.
sitemap[sitemap["loc"].duplicated() == True]
Compare Multiple XML Sitemaps
Large eCommerce may come in with a ton of XML sitemaps. They may be either submitted via Google Search Console or via Robots.Txt. In truth, it’s not rare to detect sitemaps uploaded only via the Robots.txt file and other files submitted exclusively via Google Search Console.
In case you’re in such a messy situation, it would be helpful to learn at a glance whether they are unique files or just duplicates. In other words, you want to know if each document contains the same URL that might be found elsewhere.
Fear no more, Advertools and Pandas will come to the rescue.
You just need to scrape each similar XML sitemap, concatenate them by <loc>
using the pd.concat
function and count the number of duplicates across each sitemap.
f = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_9.xml")
df38 = pd.DataFrame(f, columns=['loc', 'changefreq', 'priority', 'sitemap', 'etag', 'sitemap_last_modified', 'sitemap_size_mb', 'download_date'])
sitemap_url_9 = df38.drop(['etag', 'download_date','changefreq', 'priority', 'sitemap', 'sitemap_last_modified', 'sitemap_size_mb'], axis=1)
g = adv.sitemap_to_df("https://www.example.com/media/google_sitemap_8.xml")
df39 = pd.DataFrame(g, columns=['loc', 'changefreq', 'priority', 'sitemap', 'etag', 'sitemap_last_modified', 'sitemap_size_mb', 'download_date'])
sitemap_url_8 = df39.drop(['etag', 'download_date','changefreq', 'priority', 'sitemap', 'sitemap_last_modified', 'sitemap_size_mb'], axis=1)
# Concatenate <loc> cols
result = pd.concat([sitemap_url_9, sitemap_url_8], axis=1, join='outer')
result.columns = ['sitemap_url_9','sitemap_url_8']
# Drop Nan values
result = result.dropna()
#Count the number of values of a column
col_count = result['sitemap_url_9'].count()
col_count = result['sitemap_url_8'].count()
# Divide the number of values by the total number of rows
col_percent = (col_count / sitemap_url_9.shape[0]) * 100
col_percent1 = (col_count / sitemap_url_8.shape[0]) * 100
# Return the result in percentage format
print(f"Percentage of unique URLs in sitemap_url_9: {col_percent:.2f}%")
print(f"Percentage of unique URLs in sitemap_url_8: {col_percent:.2f}%")
Conclusion
As far as my experience is concerned, polishing off your XML sitemap often leads to improved crawling and indexing performance. This translates into fewer “grey boxes” in the Coverage section on Google Search Console as pages increase their chances to get indexed in the short term.
Retrieving insights and juicy data with Python can be a time saver especially if you’re working for a small-medium size website.
Although it might appear daunting at first glance, once you start getting along with the code your SEO research will become a proper piece of cake.
FAQ
How do I export a sitemap to Excel?
To export the sitemap to excel you need to execute a df.to_csv
function by appending the directory path where you want to store the sitemap on your PC.
sitemap_url_df.to_csv(r'YOUR-PATH-DIRECTORY.csv'), index = False, header=True)
You need to append and execute this command prompt right after the lines of code pointing at the sitemap scraping.
Please note that we are using “sitemap_url” as the custom name of the function.
How to do a sitemap audit?
To do a sitemap audit, first, ensure that your sitemap submission is up-to-date. Then, use a sitemap audit tool such as Screaming Frog, Google Search Console, or SEMrush to check for errors, missing pages, and duplicate content. Finally, review and fix any issues found to improve your site’s crawlability and search engine visibility.
What is sitemap in Python?
In Python, a sitemap is an XML file that lists the URLs for a website along with additional metadata about each URL such as the frequency of updates and the date of the last modification. Using a sitemap can improve the efficiency of search engine crawlers and thus enhance your website’s visibility in search results
Further Readings
I highly recommend following Elias Dabbas for Python tips and actionable tutorials drilled on SEO.
This post was inspired by his take on the foreign policy XML sitemap analysis from Kaggle