📌Keyword Density to Evaluate Topical Coverage

July 28, 2022

For years SEOs have been calling the death of keyword density blaming it all on those who still dared to believe the myth.

Though, few SEOs know that keyword density in a copy text is still a thing, especially when it comes to defining the semantic distance for topics on a site to evaluate keyword ranking potential. What if your site presented a bunch of entities resonating with everything but the real gist of what should be your site’s USP? What if your copywriters were targeting the wrong keywords to write your landing pages?

In this post, I will take you through a Python script designed to automate topical coverage on a few pages to make sure they resonate with the website’s value proposition.

In case your most used keywords don’t align with the site’s inner identity, then the overall EAT score will probably dampen.

Table of Contents

Requirements & Assumptions

Before kicking off with the task, you should know that the following framework is devised to return 10 n-grams reflecting the most frequently used words. This allows you to get an overview of your web page’s topical coverage so that you can quickly determine whether they reflect the website’s identity or they go astray.

There are two main requirements to onboard the task.

Run the script on Google Colab
Get a Google Knowledge API
Either a list of URLs or a CSV file with high-traffic landing pages that you can retrieve from the Performance tab in Google Search Console
If submitting CSV file, please make sure to name the columns of your URLs as “Address“

⚠️ Please note that the script is only meant to ensure the pages reflect the identity threshold of their website, and not to generate entities.

For this purpose, you may want to dive into the process of generating Entities and Sentiment in Python for a website’s landing page

Install and Import the Packages

First things first, we are going to install a few libraries.

!pip install fake_useragent 
!pip install bs4

fake_useragent will generate a fake user agent for each web page request.

The reason why is a fake user agent is we are going to use it in a staging development, thereby for personal use only.

bs4 will be leveraged as part of the BeautifulSoup library to parse the HTML code from the scraped URLs.

Next, it’s time to import a few additional libraries. Other than Pandas and Numpy to set up the ultimate data visualization output, we need to import libraries such as:

time which will be used to delay scripts in a bid to prevent bottlenecks with the server.
counter will be used to count the number of each word

import requests
from bs4 import BeautifulSoup
from collections import Counter 
import pandas as pd 
import time
import io
import json
from fake_useragent import UserAgent
from google.colab import files
import numpy as np

Upload the URLs

As mentioned, you can decide to submit either a restricted list of URLs or an entire CSV file with the top-performing landing pages of your target website.

⭐️ Upload a CSV File

If you choose to upload a file, you need to execute only the following script

crawldf = pd.read_excel('COPY_PATH')
addresses = crawldf['Address'].tolist()

Alternatively, you only need to run the following:

⭐️ Upload a List of URLs

addresses = ['URL 1', 'URL 2', 'URL 3']

Make sure to run only one option from above to avoid confusing the crawler

Set up the HTTP Request User Agent

Let’s jump onto the more technical part. We need to set up our fake user agent to prevent Google’s server to ban the web pages scraping that will follow suit.

ua = UserAgent()
 
headers = {
    'User-Agent': ua.chrome
}

Knowledge API Key

To achieve the technical set-up of the environment, we need to call up the Google Knowledge API Key. You just need to copy and paste it into a certain section of the script which I will show you.

def gkbAPI(keyword):
    url = "https://kgsearch.googleapis.com/v1/entities:search?query="+keyword+"&key=YOUR_API_KEY=1&indent=True"

    payload = {}
    headers = {}

    response = requests.request("GET", url, headers=headers, data = payload) #this one makes the call and store the response

    data = json.loads(response.text)

    try:
        getlabel = data['itemListElement'][0]['result']['@type']
    except:
        getlabel = ["none"]
    return getlabel

For the url variable, make sure to replace the key parameter with your API key.

Scraping and Parsing the Web Pages

Next, we start scraping the selected web pages. First, we create an empty list variable that we’ll use to store the site-wide data and then we execute a for loop of the URLs in the addresses list.

fulllist = []
 
for row in addresses:
    time.sleep(1)
    url = row
    print(url)
 
    res = requests.get(url,headers=headers)
    html_page = res.content

⚠️ In the event of a bottleneck with the code, make sure you have submitted your webpages either via a CSV file or a List of URLs.

Once we fetch the URL contents, we can load them into a bs4 object that we’re going to name soup

soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True) #scrape the text within the HTML from the above URLs

The find_all() function will extract only the text between HTML tags with the text=Trueparameter

Data Cleaning

Let’s whizz through a bit of data cleaning from the number of strings we have just fetched.

First, we can try to remove stopwords or pronouns and articles we don’t need to scrape. The list may easily turn out endless, so feel free to tweak it at your own convenience.

stopwords = ['get','ourselves', 'hers','us','there','you','for','that','as','between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than','its','(en)']

Next, we can filter out non-relevant HTML tags that are doing nothing but confusing the machine learning model when it comes to firing up suitable n-grams to configure entities.

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    'style',
 
]

Finally, we can filter out special characters which likely obstruct the correct n-grams generation later on

ban_chars = ['|','/','&','()']

Once dusted off our data, we can create a list of words from the web text which we will turn into a giant string

for t in text:
    if t.parent.name not in blacklist:
        output += t.replace("\n","").replace("\t","")
output = output.split(" ")

Ultimately, we can apply the filter previously established for the data cleaning

output = [x for x in output if not x=='' and not x[0] =='#' and x not in ban_chars] 
output = [x.lower() for x in output]
output = [word for word in output if word not in stopwords]
 
fulllist += output

Fetch the Top 10 N-Grams Count

Do you recall that counter library we installed at the beginning?

Here is where the counter library comes into play. We only need to send around 10-20 strings (say, words) to the counter() function.

As mentioned, the reason behind this number is to allow you to get a quick grasp of the web pages’ recall to their domain identity. However, this is also relevant to keep the process as manageable as possible.

counts = Counter(output).most_common(10)

Top 10 N-Grams for a Single URL

The last step of the process is to execute a command to prompt the machine to return the most frequent n-grams. This is going to be processed first on a single URL and ultimately site-wide.

Once we for loop the n-gram detection, we set up our Pandas data frame equipped with a few layout indicators for the sake of a clear-cut data representation.

At last, you can also download the data frame in a CSV file.

all_term_data = []
for key, value in counts:
    labels = gkbAPI(key)
    term_data = {
        'Topic': key,
        'Density': value,
        'Entity': ', '.join(labels)
    }
    all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
  'props': [('background', '#7CAE00'), 
            ('color', 'white'),
            ('font-family', 'verdana')]},
 
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},

 {'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#DCDCDC')]}, 
 
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 
]
).hide_index()

df.to_csv(r'PATH\topical_coverage.csv', index = False, header=True)

The output you receive will look like something similar.

Topical Coverage and Entity Calculation

Top 10 N-Grams Site-Wide

Similarly, we are going to prompt the machine to return the most frequent n-grams site-wide.

The only difference with respect to the above lines of code is that we need to count the number of each word site-wide, not only on each URL. This is being processed with the fullcounts instead of just counts.

print("------ AGGREGATE COUNT -------")


fullcounts = Counter(fulllist).most_common(10)

all_term_data = []
for key, value in fullcounts:
    labels = gkbAPI(key)
    term_data = {
        'Topic': key,
        'Density': value,
        'Entity': ', '.join(labels)
    }
    all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
  'props': [('background', '#7CAE00'), 
            ('color', 'white'),
            ('font-family', 'verdana')]},
 
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},

 {'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#DCDCDC')]}, 
 
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 
]
).hide_index()

df.to_csv(r'PATH\topical_coverage.csv', index = False, header=True)

Conclusion

As you notice, the model deals with a large proportion of outliers. Despite a lower grade of accuracy, the output provides a sneak peek of the topical coverage stemming from a bunch of content landing pages from the Johnnie Walker website. Moreover, the array delivers an approximate calculation of entities descending from each n-gram.

Nevertheless, this Python script provides a cookie-cutter method to make the best out of keyword density to enshrine SEO research into the semantic search realm.

Simone De Palma

Technical SEO Executive

Simone De Palma is a Technical SEO Executive at iProspect UK and the founder of SEO Depths.

He graduated in Marketing and Management from Università IULM before completing a degree in Digital Marketing and Data Science at Leeds Beckett University.
Simone has worked as an SEO Specialist in digital agencies in Italy and the United Kingdom and he’s a contributor for the Search Engine Land.

When he’s away from his double screens, he enjoys cooling down with a refreshing swim at the pool. You could find him exploring art museums or enjoying the company of a classic romance.

📌Using Keyword Density for Topical Coverage and Entity Discovery