Recently Google has been training its search ranking algorithms with machine learning models aimed to hone in on search results clustering performance and improve the understanding of human-like search queries.
As a result, content strategists are changing their approach to content copywriting by adopting a view consisting of perceiving the World Wide Web as “Things” and no more as “Strings”.
A few personal considerations about this enchanting dive into the state of Google Search from @cwarzel— Simone De Palma 🦊 (@SimoneDePalma2) June 25, 2022
The search engine itself is not dying, it’s the way people search and browse the Internet that has changed over the years.
Here’s a thread🧵#seohttps://t.co/0zUY9VTKvB
Why is that?
Because this is the main gate to the realm of Entity Research.
In this post, I am going to show you how to kick-start Entity research from the scratches of NLP. This means that the following Python framework is going to process the Tokenization of excerpts of text to extract first-hand entities.
Requirements & Assumptions
For the purpose of this project, we are going to abide by a few preliminary requirements.
What is Tokenization?
Tokenization is a data science method that reduces the words in a sentence into a comma-separated list of distinct words or values. It’s the entry gate to start processing text data during Natural Language Processing (NLP).
In fact, before diving into more granular NLP machine learning techniques, you would need to fragment your data. There are many coding languages out there that you can hand over this task, but Python is undoubtedly the easiest one to manipulate.
In fact, we are going to leverage Python’s Natural Language Toolkit (NLTK) to take text data from a Pandas data frame and return a tokenized list of words.
How To Kick Off Entity Research in NLP
First things first we need to import Pandas to load and manipulate our data, and the Natural Language Toolkit (NLTK) to perform the tokenization.
import pandas as pd import nltk
Import the Data
We use Pandas to import the data you extracted from Screaming Frog or another crawler of your choice. For demonstration purposes, I am importing a dataset of titles and descriptions from SEO Depths.
However, I do suggest importing the copy from your landing pages or product pages. This is an excellent pathway to evaluating whether your copy disposes of potential entities.
df = pd.read_excel('/content/interni_html.xlsx') df.head()
Concatenate the text into a single column
When performing NLP tasks, you want to prioritize the audit of the overall text available instead of narrowing it down into a single word from specific columns. What we are going to do is to merge the text in the two columns together using concatenation via the + operator, followed by ‘ ‘ .
df['text'] = df['title'] + ' ' + df['description']
Remove NaN values and cast to string
Now we need to make sure that any NaN values are not messing up with our new data frame. In fact, we must guarantee that we are going to deal with only string values.
df['text'].dropna(inplace=True) df['text'] = df['text'].astype(str) df.head()
Create a tokenizer using NLTK
Next, we use NLTK to create a tokenizer function. The command below will shoot the NLTK downloader and prompt the installation of the punkt data, the string in charge of executing the fragmentation of a sentence of words and to return individual values.
In order to make it easier to apply tokenization to our new Pandas data frame column, we need to cook up a few further lines of code.
def tokenize(column): """Tokenizes a Pandas dataframe column and returns a list of tokens. Args: column: Pandas dataframe column (i.e. df['text']). Returns: tokens (list): Tokenized list, i.e. [Donald, Trump, tweets] """ tokens = nltk.word_tokenize(column) return [w for w in tokens if w.isalpha()]
Tokenize your text data using NLTK
Finally, we run the last function on the Pandas column. A lambda function is required to pass in the whole text column, use NLTK to tokenize the values and return a new Pandas column containing comma-separated tokens.
df['tokenized'] = df.apply(lambda x: tokenize(x['text']), axis=1) df[['tokenized']].head()
Here’s what you’ll get
As usual, Python provides fantastic support when the SEO workload awfully piles up.
Despite being the very first step of Entity research, you are now in the position to investigate rooms for entity optimization to better target your landing pages or product pages.
This post was inspired by a couple of existing workarounds to whom ownership I pay my reference:
- Getting started with Google NLP API using Python by Greg Bernhardt