Kick Off Entity Research in NLP with Python

Recently Google has been training its search ranking algorithms with machine learning models aimed to hone in on search results clustering performance and improve the understanding of human-like search queries.

As a result, content strategists are changing their approach to content copywriting by adopting a view consisting of perceiving the World Wide Web as “Things” and no more as “Strings”.

Entities are sexier than keywords

Although this mantra has been around since Rankbrain took over in 2015, recent advancements in the humanization of Google followed by the ongoing evolution in search patterns have prompted the industry to narrow down their analysis of NLP for SEO

A few personal considerations about this enchanting dive into the state of Google Search from @cwarzel

The search engine itself is not dying, it’s the way people search and browse the Internet that has changed over the years.
Here’s a thread🧵#seo https://t.co/0zUY9VTKvB
— Simone De Palma 🦊 (@SimoneDePalma2) June 25, 2022

Why is that?

Because this is the main gate to the realm of Entity Research.

In this post, I am going to show you how to kick-start Entity research from the scratches of NLP. This means that the following Python framework is going to process the Tokenization of excerpts of text to extract first-hand entities.

Table of Contents

Requirements & Assumptions

For the purpose of this project, we are going to abide by a few preliminary requirements.

Run the script either on Google Colab or Jupyter Notebook
Make sure you!pip install nltk before starting.
Run a crawl with Screaming Frog and export an internal_html CSV file equipped with the following columns:
1. url
2. title
3. description
4. H1

What is Tokenization?

Tokenization is a data science method that reduces the words in a sentence into a comma-separated list of distinct words or values. It’s the entry gate to start processing text data during Natural Language Processing (NLP).

In fact, before diving into more granular NLP machine learning techniques, you would need to fragment your data. There are many coding languages out there that you can hand over for this task, but Python is undoubtedly the easiest one to manipulate.

In fact, we are going to leverage Python’s Natural Language Toolkit (NLTK) to take text data from a Pandas data frame and return a tokenized list of words.

How To Kick Off Entity Research in NLP

First things first we need to import Pandas to load and manipulate our data, and the Natural Language Toolkit (NLTK) to perform the tokenization.

import pandas as pd
import nltk

Import the Data

We use Pandas to import the data you extracted from Screaming Frog or another crawler of your choice. For demonstration purposes, I am importing a dataset of titles and descriptions from SEO Depths.

However, I do suggest importing the copy from your landing pages or product pages. This is an excellent pathway to evaluating whether your copy disposes of potential entities.

df = pd.read_excel('/content/interni_html.xlsx') 
df.head()

Concatenate the text into a single column

When performing NLP tasks, you want to prioritize the audit of the overall text available instead of narrowing it down into a single word from specific columns. What we are going to do is to merge the text in the two columns together using concatenation via the + operator, followed by ‘ ‘ .

df['text'] = df['title'] + ' ' + df['description']

Remove NaN values and cast to string

Now we need to make sure that any NaN values are not messing up with our new data frame. In fact, we must guarantee that we are going to deal with only string values.

df['text'].dropna(inplace=True)
df['text'] = df['text'].astype(str)
df.head()

Create a tokenizer using NLTK

Next, we use NLTK to create a tokenizer function. The command below will shoot the NLTK downloader and prompt the installation of the punkt data, the string in charge of executing the fragmentation of a sentence of words, and return individual values.

nltk.download('punkt');

In order to make it easier to apply tokenization to our new Pandas data frame column, we need to cook up a few further lines of code.

def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.

    Args:
        column: Pandas dataframe column (i.e. df['text']).

    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
    """

    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]

Tokenize your text data using NLTK

Finally, we run the last function on the Pandas column. A lambda function is required to pass in the whole text column, use NLTK to tokenize the values and return a new Pandas column containing comma-separated tokens.

df['tokenized'] = df.apply(lambda x: tokenize(x['text']), axis=1)
df[['tokenized']].head()

Here’s what you’ll get

Tokenization Output

Next Steps

Once your tokenization is complete, you could start performing plenty of analyses. Here are a few examples that may help your SEO strategy and offload your daily workflow:

Falling short of ideas for kick-starting an SEO strategy? Run a complete SEO market research
Find out more about your website’s entities and how search engines perceive them by running a sentiment analysis
Find out how to tweak your SERP positioning by benchmarking your competitor entities
What is your site all about? modulate a topic strategy using topic modeling and find out how your site fits in your own competitive arena.

Simone De Palma

Technical SEO Executive

Simone De Palma is a Technical SEO Executive at Dentsu and the founder of SEO Depths. In a previous life, he was a grad student in Marketing and Management at Università IULM in Milan and worked as SEO Specialist in digital agencies such as Connexia and Fusion Unlimited.

Simone De Palma majored in Digital Marketing and Data Science at Leeds Beckett University.

See Also

How to Kick Off Entity Research with Tokenization in NLP