Entity Competitor Analysis with NLP in Python

February 17, 2024

One of the great benefits of using state-of-the-art NLP models is that it enables you to compare several entities between pages as plain natural language.

I have already covered NL P applications for SEO in another post which is why I’m not expanding too much on that here. However, I will never grow tired of emphasizing the importance of the shift from keyword to entity apprehension and research.

This is Google’s daily bread, nonetheless the kernel of the search engine, which has become progressively reliant on NLP machine learning models for a while now.

As a result, the SEO and Data Science industries are running in tandem as we observe daily releases of in-depth data science projects, such as topic modelling analysis, aimed at predicting how Google’s algorithms will likely deem text-based content on a page.

If we limit our understanding to the NLP radar, Data Science can give birth to a plateau of projects suiting the needs of SEO, such as luring workaround entities and sentiment analysis and a ton of minor daily tasks.

Not to mention that you can obviously think of expanding your analysis and start running competitor analysis based entirely on entity research.

This could turn out beneficial for those devising content strategies and aiming to collect low-hanging fruits straight from the search engine’s underlying mechanisms.

Here’s some good news for you! There is a method to benchmark entities using NLP so you can inform your decision-making toward outranking the competition.

And this post, I’m going to show you how.

Learning Objectives Summary
1️⃣ Compare entities and their salience between two web pages
2️⃣ Display missing entities between two pages

Table of Contents

Settings

To kick off the model, you will need to make sure that you are equipped with just a few tools and features.

Don’t panic, I know it might sound alarming but for the most part, it’s going to be a piece of cake. For all the rest, just make sure you follow carefully the steps in this guide.

An active Google Cloud Platform account with active credentials and JSON file downloaded
NLP API Enabled

I strongly advise running the model on Google Colab as it requires less coding experience and you can intuitively flex data. Plus, using Google Colab means you are working up your code client-side which, in turn, will help prevent overwhelming your computer RAM.

⚠️WARNING⚠️

Make sure you upload on Colab the NLP API Key every time you run the script

How to Get the NLP API Key

To obtain an NLP API Key you need to have an active Google Cloud account, a web UI used to configure and manage several systems such as the wide span of Google APIs.

Let me illustrate the process step-by-step

Enable a Service Account

To use services provided by Google Cloud, you must create a project.

In the Google Cloud console, under “Admin”, choose the “Manage Resources” selector and create a Google Cloud project. You can name it as you want, there are no limitations in place. In most cases, when you create a project for the first time, you configure billing as well.

Does it mean that you have to pay to use the NLP API?

No, you just need to set up your billing account with your credit card credentials. Unless you dump your hands on that, nothing is going to happen to your finances for the next 365 days.

Enable the API

You must enable the Cloud Natural Language API for your project.

Once you’ve created a project, head over to it and hit “Credentials”, click “Create Credentials” and go for the API selector

Create a service account and download the private key file

Finally, you need to create a service account:

In the Google Cloud console, select your project.
In the Service account name field, enter a name.
Click Create and continue.
Click Done to finish creating the service account.

Now you can properly create a service account key:

In the Google Cloud console, click the email address for the service account that you created.
Click Keys.
Click Add key, and then click Create new key.
Click Create. A JSON key file is downloaded to your computer.
Enjoy the NLP API

Install Missing Packages

Now that you successfully passed onto the next stage of this tutorial, you can start setting up the machine learning environment.

To do so, make sure to install fake_useragent which is the only external dependency required to install for this project.

!pip install fake_useragent

!pip install pandas==1.1.2

fake_useragent: this library will help us generate a user agent when making a request
pandas==1.1.2: this is just the newest pandas version.

💡BONUS💡

Don’t forget to append an exclamation mark before pip install

Import Libraries

Next, you’re going to need to import a number of libraries that will be actively used throughout the tutorial.

import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums

from google.cloud import language
from google.cloud.language import types

from fake_useragent import UserAgent
import requests
import pandas as pd

os : a library that enables you to upload files, which in our case is the NLP API Key
google.cloud: the official Google Python library that enables you to connect with their APIs
requests: a library that enables you to scrape portions of data from the Web
pandas: a library that enables you to build and manipulate data frames

Call up the NLP API Key

You should have already downloaded your NLP API and have it stored on your device in JSON format. Hence, you can now use os to upload it onto the environment.

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "COPY_NLP_KEY_PATH"

Build the NLP Function

Since we are bound to benchmark entities from two pages, we can create a function. This helps reduce redundant code.

This function named processhtml() shown in the code below will:

Create a new user agent for the request header
Make the request to the web page and store the HTML content
Initialize the Google NLP
Communicate to Google that you are sending them HTML, rather than plain text
Send the request to Google NLP
Store the JSON response
Convert the JSON into a python dictionary with the entities and salience scores (adjust rounding as needed)
Convert the keys to lower case (for comparing)
Return the new dictionary to the main script

def processhtml(url):

    ua = UserAgent() 
    headers = { 'User-Agent': ua.chrome } 
    res = requests.get(url,headers=headers) 
    html_page = res.text

    url_dict = {}

    client = language_v1.LanguageServiceClient()

    type_ = enums.Document.Type.HTML

    language = "en"
    document = {"content": html_page, "type": type_, "language": language}

    encoding_type = enums.EncodingType.UTF8

    response = client.analyze_entities(document, encoding_type=encoding_type)

    for entity in response.entities:
        url_dict[entity.name] = round(entity.salience,4)

    url_dict = {k.lower(): v for k, v in url_dict.items()}

    return url_dict

Process NLP Data and Calculate Salience Difference

Now that we have our function we can set the variables storing the web page URLs we want to compare and then send them to the function we have just created.

url1 = "https://www.greetingscards.co.uk/cards/birthday/"
competitor_url = "https://www.cardfactory.co.uk/cards/celebration-cards/birthday-cards/"

url1_dict = processhtml(url1)
competitor_url_dict = processhtml(competitor_url)

Benchmark Entities against your Competitor

So, we set up an empty data frame with Pandas by providing the three columns that will be populated with the findings stemming from a for loop function that will rinse the differences between the sampled pages.

Next, we’ll sort values by difference for an improved data visualization that will return the top 10 results per entity difference.

df = pd.DataFrame([], columns=['URL','Competitor URL','Difference'])

for key in set(url1_dict) & set(competitor_url_dict):
    url1_keywordnum = str(url1_dict.get(key,"n/a"))
    competitor_url_keywordnum = str(competitor_url_dict.get(key,"n/a"))
    
    if competitor_url_keywordnum > url1_keywordnum:
        diff = str(round(float(competitor_url_keywordnum) - float(url1_keywordnum),3))
    else:
        diff = "0"

    new_row = {'Entity':key,'URL':url1_keywordnum,'Competitor URL':competitor_url_keywordnum,'Difference':diff}
    
    df = df.append(new_row, ignore_index=True)
    df = df.sort_values(by='Difference', ascending=False)
df.head(10)

Here’s what you receive

The URL reflects the page we have used as a benchmark against the Competitor URL, and contains a salience score identified for each entity for that URL.

If your competitor’s salience score for a given entity is greater than yours, record the difference.

❗ “Salience score” is a metric of calculated importance in relation to the rest of the text.

Age cards and home cards are entities found on both our page and the competitors’, but their semantic weight appears to hold a slightly greater impact on the competitor page.

These are entities you may want to consider investigating in order to find ways to better communicate them on your page.

You can also plot the results leveraging a specific function from the Pandas library. This will help you sift through the results.

url = [0.002, 0.002, 0.0019, 0.002, 0.0012, 0.0012, 0.0004, 0.0003, 0.0004, 0.0004]
competitor_url = [0.0113, 0.0113, 0.0114, 0.0113, 0.0029, 0.0022, 0.0008, 0.0006, 0.0008, 0.0008]
index = ['baby cards', 'home cards', 'age cards', 'job cards', 'christmas cards','you cards', 'mothers day','thinking of you cards', 'valentines day','fathers day']
df = pd.DataFrame({'url': url,
                   'competitor url': competitor_url}, index=index)
ax = df.plot.bar(rot=35)

And here’s the output

Find Entity Opportunities from Outranking pages

To narrow down the analysis, you could have a quick peek at the entities that you’re missing out on targeting against your competitor.

What you’re going to do is essentially compute a subtraction between the competitor_url and url1, and run a for loop to get the results.

Finally, you’ll use Pandas to wrap up the findings in a new data frame which will display the top 25 entities by salience on the competitor page.

diff_lists = set(competitor_url_dict) - set(url1_dict)

final_diff = {}

for k in diff_lists:
  for key,value in competitor_url_dict.items():
    if k == key:
      final_diff.update({key:value})

df = pd.DataFrame(final_diff.items(), columns=['Entity','Score'])
df = df.sort_values(by='Score', ascending=False)
df.head(25)

This is what you might get

This is useful as it returns a list of entities sorted by prominence that are used by your competitor to outrank your page.

⚠️WARNING⚠️

Please, bear in mind that the above listed entities were generated from the previous comparison and MIGHT NOT appear on your page

Conclusion

It’s impressive what machine learning models can provide to your daily SEO efforts.

With this framework, you can now offload tons of time you used to spend on content research.

But what makes this framework super cool is that you can flex the code to devise some new outstanding NLP projects.

Simone De Palma

Technical SEO Executive

Simone De Palma is a Technical SEO Executive at iProspect UK and the founder of SEO Depths.

He graduated in Marketing and Management from Università IULM before completing a degree in Digital Marketing and Data Science at Leeds Beckett University.
Simone has worked as an SEO Specialist in digital agencies in Italy and the United Kingdom and he’s a contributor for the Search Engine Land.

When he’s away from his double screens, he enjoys cooling down with a refreshing swim at the pool. You could find him exploring art museums or enjoying the company of a classic romance.

How to Benchmark Entity Opportunities with Google NLP