🍞Audit a Site Structure by Breadcrumbs and H-1 tags

Reading time: 10 Minutes

Have you ever heard of breadcrumbs and H1 headings?

They may not be the most glamorous parts of your website, but they can reveal a lot about the structure and content.

Breadcrumbs are often overlooked in internal linking, while H1 headings tend to receive attention only from those working in content marketing.

But what happens if we blend them together for a deeper analysis?

In this blog post, I’ll share a method to audit your site’s structure based on the entities stemming from the breadcrumb paths and the H1 headings of a page.

Get ready to discover new insights into your website’s structure and content!

This post was proofread with the aid of ChatGPT-4 and manually reviewed prior to publication.

A giant breadcrumb holding an H-1 flag created by Bing AI

Breadcrumbs can improve user experience and help search engines discover a website. This can reduce bounce rate and prevent issues that may arise from faceted navigation, such as duplicate pages and inefficiency in crawling.

On the other hand, H-1 headings can help both users and search engines understand the topic of an article. A well-crafted and unique H-1 also ensures that the content is fully accessible to users with visual impairments.

Fine-tuning breadcrumbs and H-1 headings can reduce the guesswork of natural language processing (NLP), resulting in a faster crawl rate and improved topical authority of a page.

The purpose of this script is to help users identify areas for improvement in their site architecture by matching the main entities expressed on a PDP stemming from both breadcrumb paths and the H-1 headings.

💡 Useful SEO audit Resources:

Fine-Tuning your Site Structure with Fuzzy Matching

What better way to perform text mining tasks based on fuzzy matching techniques than using Python?

This programming language is well-known for its versatility and power in performing analytical tasks, and finding similarities between a corpus of text is just one of its many capabilities.

Fuzzy matching (FM), also known as fuzzy logic, is an artificial intelligence and machine learning technology that identifies similar, but not identical, elements in data table sets.

In terms of SEO optimization, a wider spectrum of similar items will help us make more informed decisions than using exact-matching models.

Requirements & Assumptions

To start using this automated framework, you should be aware of a few mandatory requirements.

1️⃣ A Custom Extraction file export of your site’s breadcrumbs.

This involves extracting the CSS selector responsible to pull the Breadcrumbs on a page of your target site.

You can use Screaming frog to run a crawl with the custom extraction and export the output with Address, Status Code and Breadcrumbs as headers.

2️⃣ H1 Export file.

If you’re using Screaming Frog as your trusted web crawler, then all you have to do is to export the H-1 report and ensure it comes with Address and H1-1 as headers.

Great! now that you’re all set up, let’s jump on the optimisation bandwagon! 😎

Install and Import Dependencies

To get the ball rolling we need to install and import a few libraries. The stack required to fine-tune breadcrumbs and H-1 tags is largely based on Pandas, Polyfuzz for fuzzy matching library and Plotly Express to plot elegant charts.

!pip install polyfuzz plotly

from polyfuzz.models import TFIDF
from polyfuzz import PolyFuzz
import pandas as pd
import plotly.express as px

Import the Datasets and Data Pre-Processing

Now it’s time to upload the Breadcrumb custom extraction and the H-1 file export.

Breadcrumb = pd.read_excel('/content/Breadcrumb.xlsx')
df = pd.DataFrame(Breadcrumb, columns=['Breadcrumb'])
H1 = pd.read_excel('/content/h1_all.xlsx')
df2 = pd.DataFrame(H1, columns=['H1-1'])
df2.columns = ['H1']

Next up, we concatenate the datasets using Pandas

Comparison = pd.concat([df,df2], axis=1)
Comparison = Comparison.dropna()

And ensure we pre-process data so that datasets have the same size before starting to run the similarity checks.

Breadcrumbs = Comparison['Breadcrumb'].tolist()
cleanedList = [x for x in Breadcrumbs if str(x) != 'nan']
H1 = Comparison['H1'].tolist()
cleanedList2 = [x for x in H1 if str(x) != 'nan']

Now we can calculate TF-IDF of submitted values using a minimum similarity score of 0.95 and the K-nearest neighbors cosine method.

The match function is used to match the two cleaned lists, and the resulting similarity scores are stored in a dataframe called outcome.

The outcome dataframe is sorted in descending order of similarity scores and displayed with the top matches between the two lists of text data, including the breadcrumbs and H1 columns, and their respective similarity scores.

tfidf = TFIDF(n_gram_range=(3,3), min_similarity=0.95, cosine_method='knn')
model = PolyFuzz(tfidf)
model.match(cleanedList, cleanedList2)
similarity = model.get_matches()

#display the output
outcome = pd.DataFrame(similarity)

#sort by similarity score
outcome.sort_values('Similarity', ascending=False, inplace=True)

Let’s now take a look at the distribution of the similarity scores between Breadcrumbs and H-1 headings.


As the distribution curve is left-skewed, we can safely claim the similarity score doesn’t follow a normal distribution and overall breadcrumbs and H-1 tags aren’t similar across the crawled pages of the website.

This suggests room for SEO optimisation, as the closer the score to zero the more opportunities for us to optmise semantics from the main navigation.

💡 Can you feel the vibe of a data-informed SEO strategy? jump on a full-automated SEO market analysis for semantic search and make the most out of it

Discussing and Plotting Similarities

Once we have obtained a statistical response from the TF-IDF calculation for both datasets, we can use a scatterplot to assess the degree of similarity between each breadcrumb path and the corresponding H-1 tag.

fig = px.scatter(outcome.head(100), x="H1", y="Breadcrumbs", size="Similarity",
                 color="Similarity", hover_name="Similarity",
                 title="Similarity Breadcrumbs vs H-1",

# Show the plot

Here’s what you’ll get.

However, let’s focus for a minute on what we can derive from this sophisticated scatterplot.

As an example, the scatterplot indicates that pages with “Accessories” as the H-1 tag are poorly related to the semantics of their breadcrumb paths. The purple bubble, which refers to a similarity score of approximately 0.7, shows that “Belts” is the specified entity in the breadcrumb, but it doesn’t appear in the H-1 of the pages.

This lack of consistency between the vertical linking and the semantic information conveyed by the PDP is evident in this case.

On the other hand, pages with “Sunglasses” as the H-1 tag are positively related to the breadcrumb paths that describe their parent category. The multiple warm-colored bubbles, which refer to a similarity score of around 0.78, shows that “Sunglasses” is the specified entity in the breadcrumb path and it does appear in the form of H-1 tag as well .

This indicates that there is consistency between the vertical linking and the semantic information conveyed by the PDP in this example.

Discussing and Plotting Dissimilarities

Once we have identified the similarities in our dataset, let’s focus on the quick wins. These are the cases where the similarity between breadcrumb paths and H-1 tags is weak.

To achieve this, we need to adjust the similarity dataset to focus on the first quartile of the similarity spectrum score, ranging from 0 to 1 (0.25).

dissimilarity = outcome[outcome['Similarity'] < 0.25].sort_values('Similarity', ascending = False)

#save the output

Grouping Breadcrumbs by Similarity

To wrap up, we can group breadcrumb paths by similarity score and plot the top 10 most frequent ones. This will help us understand the entire vertical linking structure and identify areas of the website where link juice is passed on less frequently.

breadcrumb_count = outcome.groupby('Breadcrumbs').agg({'Similarity': 'count'}).sort_values('Similarity', ascending = False).head(10)

barplot_fig = px.bar(breadcrumb_count, 
                     labels={'Similarity': 'Count', 'x': 'Breadcrumbs'},

barplot_fig.update_layout(title='Top 10 Most Frequent Breadcrumbs')

As the plot above suggests, it looks like the most frequent breadcrumbs by similarity group is the “Homepage>Shop Women > Bags > Cassette“, compared to the least prominent one being “Homepage > Shop Men > Bags > Padded > Tech > Cassette.

Due to a lack of analytics processes behind this plotting, we are not entitled to shed a light on the underlying reasons of such discrepancy. However, we could assume that:

  • Longer breadcrumb paths indicate higher crawl depth and are reasonably less frequent in the site structure.
  • The main focus of the eCommerce site seems to be on selling bags, and any entity that deviates from the underlying semantics receives less exposure (such as “Tech“).

Grouping H-1 Tags by Similarity

Likewise, we can group h-1 tags by similarity score and plot the top 10 most frequent.

Needless to say, this will help us get a grip on the most prominent entities the website leverages to introduce their web pages.

headings_count = outcome.groupby('H1').agg({'Similarity': 'count'}).sort_values('Similarity', ascending = False).head(10)

barplot_fig = px.bar(headings_count, 
                     labels={'Similarity': 'Count', 'x': 'H1'},

barplot_fig.update_layout(title='Top 10 Most Frequent H1')

If you followed along the tutorial, you’ll notice to no surprise that “Accessories” is the most popular among the H-1 tags.

When fine-tuned with the breadcrumb paths, we learned this entity was largely misused because it turned out too generic with respect to the vertical linking semantics.

Again, for this is just an audit we can’t infer the reasons why “Accessories” is more frequent and prominent than “Arco“. However, it suggests clear room for improvement to devise more granular and topics-oriented H-1 tags.


Fine-tuning the breadcrumb paths with the H1 of a page can help reduce search engines’ NLP guesswork and lead to an improved quality of crawling.

Think about it. If web crawlers can easily read through your website structure, you will improve crawl rate and help search engines establish the entity connections to achieve a best grades in topic modeling.