For years SEOs have been calling the death of keyword density blaming it all on those who still dared to believe the myth.
Though, few SEOs know that keyword density in a copy text is still a thing, especially when it comes to defining the semantic distance for topics on a site to evaluate keyword ranking potential. What if your site presented a bunch of entities resonating with everything but the real gist of what should be your site’s USP? What if your copywriters were targeting the wrong keywords to write your landing pages?
In this post, I will take you through a Python script designed to automate topical coverage on a few pages to make sure they resonate with the website’s value proposition.
In case your most used keywords don’t align with the site’s inner identity, then the overall EAT score will probably dampen.
Requirements & Assumptions
Before kicking off with the task, you should know that the following framework is devised to return 10 n-grams reflecting the most frequently used words. This allows you to get an overview of your web page’s topical coverage so that you can quickly determine whether they reflect the website’s identity or they go astray.
There are two main requirements to onboard the task.
- Run the script on Google Colab
- Get a Google Knowledge API
- Either a list of URLs or a CSV file with high-traffic landing pages that you can retrieve from the Performance tab in Google Search Console
- If submitting CSV file, please make sure to name the columns of your URLs as “Address“
⚠️ Please note that the script is only meant to ensure the pages reflect the identity threshold of their website, and not to generate entities.
For this purpose, you may want to dive into the process of generating Entities and Sentiment in Python for a website’s landing page
Install and Import the Packages
First things first, we are going to install a few libraries.
!pip install fake_useragent
!pip install bs4
fake_useragent
will generate a fake user agent for each web page request.
The reason why is a fake user agent is we are going to use it in a staging development, thereby for personal use only.
bs4
will be leveraged as part of the BeautifulSoup library to parse the HTML code from the scraped URLs.
Next, it’s time to import a few additional libraries. Other than Pandas
and Numpy
to set up the ultimate data visualization output, we need to import libraries such as:
time
which will be used to delay scripts in a bid to prevent bottlenecks with the server.counter
will be used to count the number of each word
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
import time
import io
import json
from fake_useragent import UserAgent
from google.colab import files
import numpy as np
Upload the URLs
As mentioned, you can decide to submit either a restricted list of URLs or an entire CSV file with the top-performing landing pages of your target website.
⭐️ Upload a CSV File
If you choose to upload a file, you need to execute only the following script
crawldf = pd.read_excel('COPY_PATH')
addresses = crawldf['Address'].tolist()
Alternatively, you only need to run the following:
⭐️ Upload a List of URLs
addresses = ['URL 1', 'URL 2', 'URL 3']
Make sure to run only one option from above to avoid confusing the crawler
Set up the HTTP Request User Agent
Let’s jump onto the more technical part. We need to set up our fake user agent to prevent Google’s server to ban the web pages scraping that will follow suit.
ua = UserAgent()
headers = {
'User-Agent': ua.chrome
}
Knowledge API Key
To achieve the technical set-up of the environment, we need to call up the Google Knowledge API Key. You just need to copy and paste it into a certain section of the script which I will show you.
def gkbAPI(keyword):
url = "https://kgsearch.googleapis.com/v1/entities:search?query="+keyword+"&key=YOUR_API_KEY=1&indent=True"
payload = {}
headers = {}
response = requests.request("GET", url, headers=headers, data = payload) #this one makes the call and store the response
data = json.loads(response.text)
try:
getlabel = data['itemListElement'][0]['result']['@type']
except:
getlabel = ["none"]
return getlabel
For the url
variable, make sure to replace the key
parameter with your API key.
Scraping and Parsing the Web Pages
Next, we start scraping the selected web pages. First, we create an empty list variable that we’ll use to store the site-wide data and then we execute a for
loop of the URLs in the addresses list.
fulllist = []
for row in addresses:
time.sleep(1)
url = row
print(url)
res = requests.get(url,headers=headers)
html_page = res.content
⚠️ In the event of a bottleneck with the code, make sure you have submitted your webpages either via a CSV file or a List of URLs.
Once we fetch the URL contents, we can load them into a bs4
object that we’re going to name soup
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True) #scrape the text within the HTML from the above URLs
The find_all()
function will extract only the text between HTML tags with the text=True
parameter
Data Cleaning
Let’s whizz through a bit of data cleaning from the number of strings we have just fetched.
First, we can try to remove stopwords or pronouns and articles we don’t need to scrape. The list may easily turn out endless, so feel free to tweak it at your own convenience.
stopwords = ['get','ourselves', 'hers','us','there','you','for','that','as','between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than','its','(en)']
Next, we can filter out non-relevant HTML tags that are doing nothing but confusing the machine learning model when it comes to firing up suitable n-grams to configure entities.
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
'style',
]
Finally, we can filter out special characters which likely obstruct the correct n-grams generation later on
ban_chars = ['|','/','&','()']
Once dusted off our data, we can create a list of words from the web text which we will turn into a giant string
for t in text:
if t.parent.name not in blacklist:
output += t.replace("\n","").replace("\t","")
output = output.split(" ")
Ultimately, we can apply the filter previously established for the data cleaning
output = [x for x in output if not x=='' and not x[0] =='#' and x not in ban_chars]
output = [x.lower() for x in output]
output = [word for word in output if word not in stopwords]
fulllist += output
Fetch the Top 10 N-Grams Count
Do you recall that counter
library we installed at the beginning?
Here is where the counter
library comes into play. We only need to send around 10-20 strings (say, words) to the counter()
function.
As mentioned, the reason behind this number is to allow you to get a quick grasp of the web pages’ recall to their domain identity. However, this is also relevant to keep the process as manageable as possible.
counts = Counter(output).most_common(10)
Top 10 N-Grams for a Single URL
The last step of the process is to execute a command to prompt the machine to return the most frequent n-grams. This is going to be processed first on a single URL and ultimately site-wide.
Once we for loop
the n-gram detection, we set up our Pandas data frame equipped with a few layout indicators for the sake of a clear-cut data representation.
At last, you can also download the data frame in a CSV file.
all_term_data = []
for key, value in counts:
labels = gkbAPI(key)
term_data = {
'Topic': key,
'Density': value,
'Entity': ', '.join(labels)
}
all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
'props': [('background', '#7CAE00'),
('color', 'white'),
('font-family', 'verdana')]},
{'selector': 'td',
'props': [('font-family', 'verdana')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background', '#DCDCDC')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background', 'white')]},
]
).hide_index()
df.to_csv(r'PATH\topical_coverage.csv', index = False, header=True)
The output you receive will look like something similar.

Top 10 N-Grams Site-Wide
Similarly, we are going to prompt the machine to return the most frequent n-grams site-wide.
The only difference with respect to the above lines of code is that we need to count the number of each word site-wide, not only on each URL. This is being processed with the fullcounts
instead of just counts
.
print("------ AGGREGATE COUNT -------")
fullcounts = Counter(fulllist).most_common(10)
all_term_data = []
for key, value in fullcounts:
labels = gkbAPI(key)
term_data = {
'Topic': key,
'Density': value,
'Entity': ', '.join(labels)
}
all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
'props': [('background', '#7CAE00'),
('color', 'white'),
('font-family', 'verdana')]},
{'selector': 'td',
'props': [('font-family', 'verdana')]},
{'selector': 'tr:nth-of-type(odd)',
'props': [('background', '#DCDCDC')]},
{'selector': 'tr:nth-of-type(even)',
'props': [('background', 'white')]},
]
).hide_index()
df.to_csv(r'PATH\topical_coverage.csv', index = False, header=True)
Conclusion
As you notice, the model deals with a large proportion of outliers. Despite a lower grade of accuracy, the output provides a sneak peek of the topical coverage stemming from a bunch of content landing pages from the Johnnie Walker website. Moreover, the array delivers an approximate calculation of entities descending from each n-gram.
Nevertheless, this Python script provides a cookie-cutter method to make the best out of keyword density to enshrine SEO research into the semantic search realm.
Further Readings
This post was conceived looking at an inspirational and comprehensive workaround entity calculations provided by JC Chouinard and his post Keyword Density and Entity Calculator
Please do check his post in case you needed to back up this framework with further references
Related Posts
Leave a Reply