Despite the rise of semantic structured data following the ongoing development in NLP and NLU machine learning, traditional structured data still represent a goldmine for your SEO.
A detailed and exhaustive implementation of schemas on valuable pages can positively impact CTR and prompt Google to award your pages with ‘sexy’ SERP features.
But sometimes auditing structured data on a website could take ages and even drive you up the wall with all of those bloated lines of HTML messing around.
In this post, I am going to provide a method to bulk audit structured data formats and schema types for a bunch of pages so that you can advance your technical SEO with juicy insights at your fingertips.
Requirements and Assumptions
To kick start, we need to abide by a few requirements to adequately set up our machine learning environment.
- Run the script on Google Colab and change the run type on GPU. This allows you to speed up the execution of the script client-side rather than onto your PCโs CPU.
- Make sure to install Extruct and w3lib.html libraries
!pip install extruct
!pip install w3lib.htmml
Import Packages
Next, we are going to import a few packages other than the already installed extruct and w3lib.html.
While Requests is set to orchestrate the metadata extraction, Pandas will be used to create a data frame to embed the findings from the scraping procedures.
import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse
Extraction of Structured Data Format
It’s time to define a full list of URLs that we intend to scrape to ultimately obtain hints on schema markup usage.
As you can notice from the script, for a British grocery chain I picked up a few landing pages along with a bunch of product pages in a bid to assess their holistic approach to structured data.
๐กYou can provide as many URLs as you want, provided that they make sense to your overall structured data audit.
sites = ['https://www.morrisons.com/help/',
'https://groceries.morrisons.com/products/pukka-steak-slice-569177011',
'https://www.morrisons-corporate.com/about-us/',
'https://groceries.morrisons.com/webshop/bundle/breaded-chicken-salad-wrap-bundle/1006702395',
'https://groceries.morrisons.com/',
'https://groceries.morrisons.com/content/recipes-by-morrisons-33805?clkInTab=Recipes',
'https://groceries.morrisons.com/on-offer',
'https://my.morrisons.com/storefinder/'
'https://www.morrisons.jobs/'
]
Finally, we can start extracting metadata for a single page.
To do so, we need to scrape the given URLs with the aid of Requests and then extract the present structured data format on the sampled page with Extruct.
def extract_metadata(url):
r = requests.get(url)
base_url = get_base_url(r.text, r.url)
metadata = extruct.extract(r.text,
base_url=base_url,
uniform=True,
syntaxes=['json-ld',
'microdata',
'opengraph',
'rdfa'])
return metadata
Once the scraping is done, we process the extraction which will deliver a preliminary output framing an overview of the structured data format currently in place on the sampled URL.
metadata = extract_metadata('https://www.morrisons-corporate.com/about-us/')
metadata
To date, we have only scratched the surface of the analysis. It’s therefore time to narrow down our audit and investigate whether the sampled URL is using a single structured data format or any other metadata types.
def uses_metadata_type(metadata, metadata_type):
if (metadata_type in metadata.keys()) and (len(metadata[metadata_type]) > 0):
return True
else:
return False
Is the website using Open Graph?
uses_metadata_type(metadata, 'opengraph')
Is the website using the RDFA structured data format?
uses_metadata_type(metadata, 'rdfa')
Is the website using the JSON-LD format?
uses_metadata_type(metadata, 'json-ld')
Is the website using the Microdata format?
uses_metadata_type(metadata, 'microdata')
Once found out, we can finally create an empty Pandas data frame which will contain the findings of the structured data format extraction from the list of URLs that we issued above.
df = pd.DataFrame(columns = ['url', 'microdata', 'json-ld', 'opengraph', 'rdfa'])
for url in sites:
metadata = extract_metadata(url)
urldata = urlparse(url)
row = {
'url': urldata.netloc,
'microdata': uses_metadata_type(metadata, 'microdata'),
'json-ld': uses_metadata_type(metadata, 'json-ld'),
'opengraph': uses_metadata_type(metadata, 'opengraph'),
'rdfa': uses_metadata_type(metadata, 'rdfa')
}
df = df.append(row, ignore_index=True)
df.head(10).sort_values(by='microdata', ascending=False)
๐กPlease note that you can adjust the number from within the df() function depending on how many URLs you submitted for the analysis.
You should gain an output like this
As you can see, the audited pages are bloated with a variety of structured data formats. Needless to say, this is detrimental for SEO as it genuinely contributes confusing web crawlers, thereby potentially slowing down the crawling and indexing procedures.
Extraction of Structured Data Type
Time to wrap our heads around the scraping of the structured data types in place on our list of URLs.
def key_exists(dict, key):
if not any(item['@type'] == key for item in dict):
return False
else:
return True
As we did with the structured data format extraction, we first need to scrape the schema types for a sample URL.
metadata = extract_metadata('https://groceries.morrisons.com/')
metadata
You will have returned a raw output which might be a bit messy to go through, yet it delivers bits of information about the specific schema markup currently in place for the audited URL.
Finally, we need to loop over the URLs, scrape the HTML, extract the metadata, and ultimately check over each key to see whether it is implemented by a given metadata type.
df_specific = pd.DataFrame(columns = ['url',
'organization-json-ld',
'organization-microdata',
'product-json-ld',
'product-microdata',
'offer-json-ld',
'offer-microdata',
'review-json-ld',
'review-microdata',
'aggregaterating-json-ld',
'aggregaterating-microdata',
'breadcrumblist-json-ld',
'breadcrumblist-microdata',
])
for url in sites:
metadata = extract_metadata(url)
urldata = urlparse(url)
row = {
'url': urldata.netloc,
'organization-json-ld': key_exists(metadata['json-ld'], 'Organization'),
'organization-microdata': key_exists(metadata['microdata'], 'Organization'),
'product-json-ld': key_exists(metadata['json-ld'], 'Product'),
'product-microdata': key_exists(metadata['microdata'], 'Product'),
'offer-json-ld': key_exists(metadata['json-ld'], 'Offer'),
'offer-microdata': key_exists(metadata['microdata'], 'Offer'),
'review-json-ld': key_exists(metadata['json-ld'], 'Review'),
'review-microdata': key_exists(metadata['microdata'], 'Review'),
'aggregaterating-json-ld': key_exists(metadata['json-ld'], 'AggregateRating'),
'aggregaterating-microdata': key_exists(metadata['microdata'], 'AggregateRating'),
'breadcrumblist-json-ld': key_exists(metadata['json-ld'], 'BreadcrumbList'),
'breadcrumblist-microdata': key_exists(metadata['microdata'], 'BreadcrumbList'),
}
df_specific = df_specific.append(row, ignore_index=True)
df_specific.sort_values(by='url', ascending=False).head(9).T
๐ก Feel free to adjust the number within the head() depending on how many URLs you decided to include in the audit
This is how the final output should look like
As for the schema type usage, the grocery retailer seems not to display any of the relevant schema types proposed within our analysis.
Conclusion
In a matter of a few minutes, I was able to earn an overall picture of structured data usage across one single website. Although the outcomes may not always deliver responsive and accurate findings, this framework represents a fair starting point to narrow down a structured data audit on the targeted website.
When it comes to auditing a messy big website, it is incredibly helpful to dispose of a few automating tools that can empower your Tech SEO audit. As part of that, you may want to check out the handy sitemap audit factoring into a few lines of easy Python code.
Further Readings
This post is inspired by the original take on structured data scraping provided by Matt Clarke and his post How To Use Extruct to identify Schema.org metadata usage.
For further reference, please go check it out and see what improvements you could apply to suit the needs of your structured data custom scraping ๐