Using Keyword Density for Topical Coverage and Entity Discovery

Reading time: 8 Minutes

For years SEOs have been calling the death of keyword density blaming it all on those who still dared to believe the myth.

Though, few SEOs know that keyword density in a copy text is still a thing, especially when it comes to defining the semantic distance for topics on a site to evaluate keyword ranking potential. What if your site presented a bunch of entities resonating with everything but the real gist of what should be your site’s USP? What if your copywriters were targeting the wrong keywords to write your landing pages?

In this post, I will take you through a Python script designed to automate topical coverage on a few pages to make sure they resonate with the website’s value proposition.

In case your most used keywords don’t align with the site’s inner identity, then the overall EAT score will probably dampen.

Requirements & Assumptions


Before kicking off with the task, you should know that the following framework is devised to return 10 n-grams reflecting the most frequently used words. This allows you to get an overview of your web page’s topical coverage so that you can quickly determine whether they reflect the website’s identity or they go astray.

There are two main requirements to onboard the task.

⚠️ Please note that the script is only meant to ensure the pages reflect the identity threshold of their website, and not to generate entities.

Install and Import the Packages

First things first, we are going to install a few libraries.

!pip install fake_useragent 
!pip install bs4 
import requests
from bs4 import BeautifulSoup
from collections import Counter 
import pandas as pd 
import time
import io
import json
from fake_useragent import UserAgent
from google.colab import files
import numpy as np

Upload the URLs

As mentioned, you can decide to submit either a restricted list of URLs or an entire CSV file with the top-performing landing pages of your target website.

⭐️ Upload a CSV File

If you choose to upload a file, you need to execute only the following script

crawldf = pd.read_excel('COPY_PATH')
addresses = crawldf['Address'].tolist()

Alternatively, you only need to run the following:

⭐️ Upload a List of URLs

addresses = ['URL 1', 'URL 2', 'URL 3']

Make sure to run only one option from above to avoid confusing the crawler

Set up the HTTP Request User Agent

Let’s jump onto the more technical part. We need to set up our fake user agent to prevent Google’s server to ban the web pages scraping that will follow suit.

ua = UserAgent()
 
headers = {
    'User-Agent': ua.chrome
}

Knowledge API Key

def gkbAPI(keyword):
    url = "https://kgsearch.googleapis.com/v1/entities:search?query="+keyword+"&key=YOUR_API_KEY=1&indent=True"

    payload = {}
    headers = {}

    response = requests.request("GET", url, headers=headers, data = payload) #this one makes the call and store the response

    data = json.loads(response.text)

    try:
        getlabel = data['itemListElement'][0]['result']['@type']
    except:
        getlabel = ["none"]
    return getlabel

For the url variable, make sure to replace the key parameter with your API key.

Scraping and Parsing the Web Pages

Next, we start scraping the selected web pages. First, we create an empty list variable that we’ll use to store the site-wide data and then we execute a for loop of the URLs in the addresses list.

fulllist = []
 
for row in addresses:
    time.sleep(1)
    url = row
    print(url)
 
    res = requests.get(url,headers=headers)
    html_page = res.content

⚠️ In the event of a bottleneck with the code, make sure you have submitted your webpages either via a CSV file or a List of URLs.

Once we fetch the URL contents, we can load them into a bs4 object that we’re going to name soup

soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True) #scrape the text within the HTML from the above URLs

The find_all() function will extract only the text between HTML tags with the text=Trueparameter

Data Cleaning

Let’s whizz through a bit of data cleaning from the number of strings we have just fetched.

First, we can try to remove stopwords or pronouns and articles we don’t need to scrape. The list may easily turn out endless, so feel free to tweak it at your own convenience.

stopwords = ['get','ourselves', 'hers','us','there','you','for','that','as','between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than','its','(en)']

Next, we can filter out non-relevant HTML tags that are doing nothing but confusing the machine learning model when it comes to firing up suitable n-grams to configure entities.

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    'style',
 
]

Finally, we can filter out special characters which likely obstruct the correct n-grams generation later on

ban_chars = ['|','/','&','()']

Once dusted off our data, we can create a list of words from the web text which we will turn into a giant string

for t in text:
    if t.parent.name not in blacklist:
        output += t.replace("\n","").replace("\t","")
output = output.split(" ")

Ultimately, we can apply the filter previously established for the data cleaning

output = [x for x in output if not x=='' and not x[0] =='#' and x not in ban_chars] 
output = [x.lower() for x in output]
output = [word for word in output if word not in stopwords]
 
fulllist += output

Fetch the Top 10 N-Grams Count

Do you recall that counter library we installed at the beginning?

Here is where the counter library comes into play. We only need to send around 10-20 strings (say, words) to the counter() function.

As mentioned, the reason behind this number is to allow you to get a quick grasp of the web pages’ recall to their domain identity. However, this is also relevant to keep the process as manageable as possible.

counts = Counter(output).most_common(10)

Top 10 N-Grams for a Single URL

The last step of the process is to execute a command to prompt the machine to return the most frequent n-grams. This is going to be processed first on a single URL and ultimately site-wide.

Once we for loop the n-gram detection, we set up our Pandas data frame equipped with a few layout indicators for the sake of a clear-cut data representation.

At last, you can also download the data frame in a CSV file.

all_term_data = []
for key, value in counts:
    labels = gkbAPI(key)
    term_data = {
        'Topic': key,
        'Density': value,
        'Entity': ', '.join(labels)
    }
    all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
  'props': [('background', '#7CAE00'), 
            ('color', 'white'),
            ('font-family', 'verdana')]},
 
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},

 {'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#DCDCDC')]}, 
 
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 
]
).hide_index()

df.to_csv(r'PATH\topical_coverage.csv', index = False, header=True)

The output you receive will look like something similar.

An array showcasing the overall topical coverage garnered from a bunch of web pages from the Johnnie Walker's website
Topical Coverage and Entity Calculation

Top 10 N-Grams Site-Wide

Similarly, we are going to prompt the machine to return the most frequent n-grams site-wide.

The only difference with respect to the above lines of code is that we need to count the number of each word site-wide, not only on each URL. This is being processed with the fullcounts instead of just counts.

print("------ AGGREGATE COUNT -------")


fullcounts = Counter(fulllist).most_common(10)

all_term_data = []
for key, value in fullcounts:
    labels = gkbAPI(key)
    term_data = {
        'Topic': key,
        'Density': value,
        'Entity': ', '.join(labels)
    }
    all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
  'props': [('background', '#7CAE00'), 
            ('color', 'white'),
            ('font-family', 'verdana')]},
 
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},

 {'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#DCDCDC')]}, 
 
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 
]
).hide_index()

df.to_csv(r'PATH\topical_coverage.csv', index = False, header=True)

Conclusion

Further Readings

This post was conceived looking at an inspirational and comprehensive workaround entity calculations provided by JC Chouinard and his post Keyword Density and Entity Calculator

Please do check his post in case you needed to back up this framework with further references

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *