Recently Google has been training its search ranking algorithms with machine learning models aimed to hone in on search results clustering performance and improve the understanding of human-like search queries.
As a result, content strategists are changing their approach to content copywriting by adopting a view consisting of perceiving the World Wide Web as āThingsā and no more as āStringsā.

Although this mantra has been around since Rankbrain took over in 2015, recent advancements in the humanization of Google followed by the ongoing evolution in search patterns have prompted the industry to narrow down their analysis of NLP and NLU in SEO
A few personal considerations about this enchanting dive into the state of Google Search from @cwarzel
ā Simone De Palma š¦ (@SimoneDePalma2) June 25, 2022
The search engine itself is not dying, itās the way people search and browse the Internet that has changed over the years.
Here’s a threadš§µ#seohttps://t.co/0zUY9VTKvB
Why is that?
Because this is the main gate to the realm of Entity Research.
In this post, I am going to show you how to kick-start Entity research from the scratches of NLP. This means that the following Python framework is going to process the Tokenization of excerpts of text to extract first-hand entities.
Requirements & Assumptions
For the purpose of this project, we are going to abide by a few preliminary requirements.
- Run the script either on Google Colab or Jupyter Notebook
- Make sure you
!pip install nltk
before starting. - Run a crawl with Screaming Frog and export an internal_html CSV file equipped with the following columns:
- url
- title
- description
- H1
What is Tokenization?
Tokenization is a data science method that reduces the words in a sentence into a comma-separated list of distinct words or values. Itās the entry gate to start processing text data during Natural Language Processing (NLP).
In fact, before diving into more granular NLP machine learning techniques, you would need to fragment your data. There are many coding languages out there that you can hand over for this task, but Python is undoubtedly the easiest one to manipulate.
In fact, we are going to leverage Pythonās Natural Language Toolkit (NLTK) to take text data from a Pandas data frame and return a tokenized list of words.
How To Kick Off Entity Research in NLP
First things first we need to import Pandas to load and manipulate our data, and the Natural Language Toolkit (NLTK) to perform the tokenization.
import pandas as pd
import nltk
Import the Data
We use Pandas to import the data you extracted from Screaming Frog or another crawler of your choice. For demonstration purposes, I am importing a dataset of titles and descriptions from SEO Depths.
However, I do suggest importing the copy from your landing pages or product pages. This is an excellent pathway to evaluating whether your copy disposes of potential entities.
df = pd.read_excel('/content/interni_html.xlsx')
df.head()
Concatenate the text into a single column
When performing NLP tasks, you want to prioritize the audit of the overall text available instead of narrowing it down into a single word from specific columns. What we are going to do is to merge the text in the two columns together using concatenation via the + operator, followed by ‘ ‘ .
df['text'] = df['title'] + ' ' + df['description']
Remove NaN values and cast to string
Now we need to make sure that any NaN values are not messing up with our new data frame. In fact, we must guarantee that we are going to deal with only string values.
df['text'].dropna(inplace=True)
df['text'] = df['text'].astype(str)
df.head()
Create a tokenizer using NLTK
Next, we use NLTK to create a tokenizer function. The command below will shoot the NLTK downloader and prompt the installation of the punkt data, the string in charge of executing the fragmentation of a sentence of words, and return individual values.
nltk.download('punkt');
In order to make it easier to apply tokenization to our new Pandas data frame column, we need to cook up a few further lines of code.
def tokenize(column):
"""Tokenizes a Pandas dataframe column and returns a list of tokens.
Args:
column: Pandas dataframe column (i.e. df['text']).
Returns:
tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
"""
tokens = nltk.word_tokenize(column)
return [w for w in tokens if w.isalpha()]
Tokenize your text data using NLTK
Finally, we run the last function on the Pandas column. A lambda function is required to pass in the whole text column, use NLTK to tokenize the values and return a new Pandas column containing comma-separated tokens.
df['tokenized'] = df.apply(lambda x: tokenize(x['text']), axis=1)
df[['tokenized']].head()
Here’s what you’ll get

Next Steps
Once your tokenization is complete, you could start performing plenty of analyses. Here are a few examples that may help your SEO strategy and offload your daily workflow:
- Falling short of ideas for kick-starting an SEO strategy? Run a complete SEO market research
- Find out more about your website’s entities and how search engines perceive them by running a sentiment analysis
- Find out how to tweak your SERP positioning by benchmarking your competitor entities
- What is your site all about? modulate a topic strategy using topic modeling and find out how your site fits in your own competitive arena.
Further Readings
This post was inspired by a couple of existing workarounds to whom ownership I pay my reference:
- Getting started with Google NLP API using Python by Greg Bernhardt
See Also