How to Bulk Audit Structured Data with Python

Reading time: 10 Minutes

A detailed and exhaustive implementation of schemas on valuable pages can positively impact CTR and prompt Google to award your pages with ‘sexy’ SERP features.

But sometimes auditing structured data on a website could take ages and even drive you up the wall with all of those bloated lines of HTML messing around.

In this post, I am going to provide a method to bulk audit structured data formats and schema types for a bunch of pages so that you can advance your technical SEO with juicy insights at your fingertips.


Requirements and Assumptions

To kick start, we need to abide by a few requirements to adequately set up our machine learning environment.

!pip install extruct

!pip install w3lib.htmml

Import Packages

Next, we are going to import a few packages other than the already installed extruct and w3lib.html.

While Requests is set to orchestrate the metadata extraction, Pandas will be used to create a data frame to embed the findings from the scraping procedures.

import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse

Extraction of Structured Data Format

It’s time to define a full list of URLs that we intend to scrape to ultimately obtain hints on schema markup usage.

As you can notice from the script, for a British grocery chain I picked up a few landing pages along with a bunch of product pages in a bid to assess their holistic approach to structured data.

💡You can provide as many URLs as you want, provided that they make sense to your overall structured data audit.

sites = ['https://www.morrisons.com/help/',
'https://groceries.morrisons.com/products/pukka-steak-slice-569177011',
'https://www.morrisons-corporate.com/about-us/',
'https://groceries.morrisons.com/webshop/bundle/breaded-chicken-salad-wrap-bundle/1006702395',
'https://groceries.morrisons.com/',
'https://groceries.morrisons.com/content/recipes-by-morrisons-33805?clkInTab=Recipes',
'https://groceries.morrisons.com/on-offer',
'https://my.morrisons.com/storefinder/'
'https://www.morrisons.jobs/'
]     
def extract_metadata(url):

    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    metadata = extruct.extract(r.text, 
                               base_url=base_url,
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph',
                                         'rdfa'])
return metadata

Once the scraping is done, we process the extraction which will deliver a preliminary output framing an overview of the structured data format currently in place on the sampled URL.

metadata = extract_metadata('https://www.morrisons-corporate.com/about-us/')
metadata

To date, we have only scratched the surface of the analysis. It’s therefore time to narrow down our audit and investigate whether the sampled URL is using a single structured data format or any other metadata types.

def uses_metadata_type(metadata, metadata_type):
    if (metadata_type in metadata.keys()) and (len(metadata[metadata_type]) > 0):
        return True
    else:
        return False
uses_metadata_type(metadata, 'opengraph')

Is the website using the RDFA structured data format?

uses_metadata_type(metadata, 'rdfa')

Is the website using the JSON-LD format?

uses_metadata_type(metadata, 'json-ld')

Is the website using the Microdata format?

uses_metadata_type(metadata, 'microdata')

Once found out, we can finally create an empty Pandas data frame which will contain the findings of the structured data format extraction from the list of URLs that we issued above.

df = pd.DataFrame(columns = ['url', 'microdata', 'json-ld', 'opengraph', 'rdfa'])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)

    row = {
        'url': urldata.netloc, 
        'microdata': uses_metadata_type(metadata, 'microdata'),
        'json-ld': uses_metadata_type(metadata, 'json-ld'),
        'opengraph': uses_metadata_type(metadata, 'opengraph'),
        'rdfa': uses_metadata_type(metadata, 'rdfa')              
    }

    df = df.append(row, ignore_index=True)

df.head(10).sort_values(by='microdata', ascending=False)

💡Please note that you can adjust the number from within the df() function depending on how many URLs you submitted for the analysis.

You should gain an output like this

Output of structured data format audit processed in python

As you can see, the audited pages are bloated with a variety of structured data formats. Needless to say, this is detrimental for SEO as it genuinely contributes confusing web crawlers, thereby potentially slowing down the crawling and indexing procedures.

Extraction of Structured Data Type

Time to wrap our heads around the scraping of the structured data types in place on our list of URLs.

def key_exists(dict, key):

    if not any(item['@type'] == key for item in dict):
        return False
    else:
        return True   

As we did with the structured data format extraction, we first need to scrape the schema types for a sample URL.

metadata = extract_metadata('https://groceries.morrisons.com/')
metadata

You will have returned a raw output which might be a bit messy to go through, yet it delivers bits of information about the specific schema markup currently in place for the audited URL.

Finally, we need to loop over the URLs, scrape the HTML, extract the metadata, and ultimately check over each key to see whether it is implemented by a given metadata type.

df_specific = pd.DataFrame(columns = ['url', 
                                      'organization-json-ld', 
                                      'organization-microdata',                                   
                                      'product-json-ld', 
                                      'product-microdata',                  
                                      'offer-json-ld', 
                                      'offer-microdata',     
                                      'review-json-ld', 
                                      'review-microdata',   
                                      'aggregaterating-json-ld', 
                                      'aggregaterating-microdata',   
                                      'breadcrumblist-json-ld', 
                                      'breadcrumblist-microdata',            
                                     ])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)


    row = {
        'url': urldata.netloc, 
        'organization-json-ld': key_exists(metadata['json-ld'], 'Organization'),
        'organization-microdata': key_exists(metadata['microdata'], 'Organization'),
        'product-json-ld': key_exists(metadata['json-ld'], 'Product'),
        'product-microdata': key_exists(metadata['microdata'], 'Product'),
        'offer-json-ld': key_exists(metadata['json-ld'], 'Offer'),
        'offer-microdata': key_exists(metadata['microdata'], 'Offer'),
        'review-json-ld': key_exists(metadata['json-ld'], 'Review'),
        'review-microdata': key_exists(metadata['microdata'], 'Review'),
        'aggregaterating-json-ld': key_exists(metadata['json-ld'], 'AggregateRating'),
        'aggregaterating-microdata': key_exists(metadata['microdata'], 'AggregateRating'),
        'breadcrumblist-json-ld': key_exists(metadata['json-ld'], 'BreadcrumbList'),
        'breadcrumblist-microdata': key_exists(metadata['microdata'], 'BreadcrumbList'),
    }

    df_specific = df_specific.append(row, ignore_index=True)

df_specific.sort_values(by='url', ascending=False).head(9).T

💡 Feel free to adjust the number within the head() depending on how many URLs you decided to include in the audit

This is how the final output should look like

Output of a bulky structured data type audit processed in python

As for the schema type usage, the grocery retailer seems not to display any of the relevant schema types proposed within our analysis.

Conclusion

In a matter of a few minutes, I was able to earn an overall picture of structured data usage across one single website. Although the outcomes may not always deliver responsive and accurate findings, this framework represents a fair starting point to narrow down a structured data audit on the targeted website.

Further Readings

This post is inspired by the original take on structured data scraping provided by Matt Clarke and his post How To Use Extruct to identify Schema.org metadata usage.

For further reference, please go check it out and see what improvements you could apply to suit the needs of your structured data custom scraping 😉