📈How to Bulk Audit Structured Data with Python

Reading time: 11 Minutes

A detailed and exhaustive implementation of schemas on valuable pages can have a positive impact on CTR and prompt Google to award your pages with attractive SERP features.

However, sometimes auditing structured data on a website can take ages and even drive you up the wall with all of those bloated lines of HTML messing around.

In this post, I am going to take you through structured data optimization for semantic search and reveal a method to bulk audit schema markup so that you can step up your SEO game.

Optimize Structured Data for Semantic Search

Structured data is a way of marking up your website’s content so that search engines can better understand it. This can help your website rank higher in search results, and it can also make your content more visible in rich results such as product listings.

🤔 Curious to improve the search appearance of your product review pages?
Learn the benefits of marking up your content’s pros and cons annotations in this post.

There are a few things you can do to optimize your structured data for semantic search:

📌Make sure your structured data is valid.

There are a number of different structured data formats, and each one has its own set of rules. Make sure you’re using the correct format for your content.

📌 Use descriptive labels.

When you’re marking up your content, use descriptive labels that will help search engines understand what it is. For example, if you’re marking up a product, use a label like “ProductName” or “ProductDescription.”

📌Use the right properties.

There are a number of different properties you can use to mark up your content. Choose the properties that are most relevant to your content.

📌Use the right values.

When you’re marking up your content, use the right values for the properties you’re using. For example, if you’re marking up a product, use the price of the product as the value for the “Price” property.

These tips will help you optimize your structured data for semantic search and improve your website’s visibility in search results.

However, any optimization effort requires accurate insights. We can extract structured data formats and markup types in bulk using Python.

Let’s explore how.

Requirements and Assumptions

To kick start, we need to abide by a few requirements to adequately set up our machine learning environment.

!pip install extruct

!pip install w3lib.htmml

Import Packages

Next, we are going to import a few packages other than the already installed extruct and w3lib.html.

While Requests is set to orchestrate the metadata extraction, Pandas will be used to create a data frame to embed the findings from the scraping procedures.

import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse

Extraction of Structured Data Format

It’s time to define a full list of URLs that we intend to scrape to ultimately obtain hints on schema markup usage.

As you can notice from the script, for a British grocery chain I picked up a few landing pages along with a bunch of product pages in a bid to assess their holistic approach to structured data.

💡You can provide as many URLs as you want, provided that they make sense to your overall structured data audit.

sites = ['https://www.morrisons.com/help/',
'https://groceries.morrisons.com/products/pukka-steak-slice-569177011',
'https://www.morrisons-corporate.com/about-us/',
'https://groceries.morrisons.com/webshop/bundle/breaded-chicken-salad-wrap-bundle/1006702395',
'https://groceries.morrisons.com/',
'https://groceries.morrisons.com/content/recipes-by-morrisons-33805?clkInTab=Recipes',
'https://groceries.morrisons.com/on-offer',
'https://my.morrisons.com/storefinder/'
'https://www.morrisons.jobs/'
]     
def extract_metadata(url):

    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    metadata = extruct.extract(r.text, 
                               base_url=base_url,
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph',
                                         'rdfa'])
return metadata

Once the scraping is done, we process the extraction which will deliver a preliminary output framing an overview of the structured data format currently in place on the sampled URL.

metadata = extract_metadata('https://www.morrisons-corporate.com/about-us/')
metadata

To date, we have only scratched the surface of the analysis. It’s therefore time to narrow down our audit and investigate whether the sampled URL is using a single structured data format or any other metadata types.

def uses_metadata_type(metadata, metadata_type):
    if (metadata_type in metadata.keys()) and (len(metadata[metadata_type]) > 0):
        return True
    else:
        return False
uses_metadata_type(metadata, 'opengraph')

Is the website using the RDFA structured data format?

uses_metadata_type(metadata, 'rdfa')

Is the website using the JSON-LD format?

uses_metadata_type(metadata, 'json-ld')

Is the website using the Microdata format?

uses_metadata_type(metadata, 'microdata')

Once found out, we can finally create an empty Pandas data frame which will contain the findings of the structured data format extraction from the list of URLs that we issued above.

df = pd.DataFrame(columns = ['url', 'microdata', 'json-ld', 'opengraph', 'rdfa'])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)

    row = {
        'url': urldata.netloc, 
        'microdata': uses_metadata_type(metadata, 'microdata'),
        'json-ld': uses_metadata_type(metadata, 'json-ld'),
        'opengraph': uses_metadata_type(metadata, 'opengraph'),
        'rdfa': uses_metadata_type(metadata, 'rdfa')              
    }

    df = df.append(row, ignore_index=True)

df.head(10).sort_values(by='microdata', ascending=False)

💡Please note that you can adjust the number from within the df() function depending on how many URLs you submitted for the analysis.

You should gain an output like this

As you can see, the audited pages are bloated with a variety of structured data formats. Needless to say, this is detrimental for SEO as it genuinely contributes confusing web crawlers, thereby potentially slowing down the crawling and indexing procedures.

Schema Markup Extraction

Time to wrap our heads around the extraction of the types of schema markup in place on our list of URLs.

def key_exists(dict, key):

    if not any(item['@type'] == key for item in dict):
        return False
    else:
        return True   

As we did with the structured data format extraction, we first need to scrape the schema types for a sample URL.

metadata = extract_metadata('https://groceries.morrisons.com/')
metadata

You will have returned a raw output which might be a bit messy to go through, yet it delivers bits of information about the specific schema markup currently in place for the audited URL.

Finally, we need to loop over the URLs, scrape the HTML, extract the metadata, and ultimately check over each key to see whether it is implemented by a given metadata type.

df_specific = pd.DataFrame(columns = ['url', 
                                      'organization-json-ld', 
                                      'organization-microdata',                                   
                                      'product-json-ld', 
                                      'product-microdata',                  
                                      'offer-json-ld', 
                                      'offer-microdata',     
                                      'review-json-ld', 
                                      'review-microdata',   
                                      'aggregaterating-json-ld', 
                                      'aggregaterating-microdata',   
                                      'breadcrumblist-json-ld', 
                                      'breadcrumblist-microdata',            
                                     ])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)


    row = {
        'url': urldata.netloc, 
        'organization-json-ld': key_exists(metadata['json-ld'], 'Organization'),
        'organization-microdata': key_exists(metadata['microdata'], 'Organization'),
        'product-json-ld': key_exists(metadata['json-ld'], 'Product'),
        'product-microdata': key_exists(metadata['microdata'], 'Product'),
        'offer-json-ld': key_exists(metadata['json-ld'], 'Offer'),
        'offer-microdata': key_exists(metadata['microdata'], 'Offer'),
        'review-json-ld': key_exists(metadata['json-ld'], 'Review'),
        'review-microdata': key_exists(metadata['microdata'], 'Review'),
        'aggregaterating-json-ld': key_exists(metadata['json-ld'], 'AggregateRating'),
        'aggregaterating-microdata': key_exists(metadata['microdata'], 'AggregateRating'),
        'breadcrumblist-json-ld': key_exists(metadata['json-ld'], 'BreadcrumbList'),
        'breadcrumblist-microdata': key_exists(metadata['microdata'], 'BreadcrumbList'),
    }

    df_specific = df_specific.append(row, ignore_index=True)

df_specific.sort_values(by='url', ascending=False).head(9).T

💡 Feel free to adjust the number within the head() depending on how many URLs you decided to include in the audit

This is how the final output should look like

As for the schema type usage, the grocery retailer seems not to display any of the relevant schema types proposed within our analysis.

Conclusion

Within just a few minutes, you can obtain a comprehensive overview of structured data usage across your site. While the results may not always be perfect, this framework provides a solid foundation for starting a structured data audit and identifying areas that require further attention.

When it comes to auditing a large, complex website, having automated tools at your disposal can make all the difference in streamlining the process and maximizing efficiency.

If you want to start speaking the language of search engines to improve their understanding of your content, this Python framework is definitely worth considering.