How to Kick Off Entity Research with Tokenization in NLP

Reading time: 5 Minutes

Recently Google has been training its search ranking algorithms with machine learning models aimed to hone in on search results clustering performance and improve the understanding of human-like search queries.

As a result, content strategists are changing their approach to content copywriting by adopting a view consisting of perceiving the World Wide Web as ā€œThingsā€ and no more as ā€œStringsā€.

A meme pointing out entities come over keywords in SEO
Entities are sexier than keywords

Why is that?

Because this is the main gate to the realm of Entity Research.


In this post, I am going to show you how to kick-start Entity research from the scratches of NLP. This means that the following Python framework is going to process the Tokenization of excerpts of text to extract first-hand entities.


Requirements & Assumptions


For the purpose of this project, we are going to abide by a few preliminary requirements.


What is Tokenization?

Tokenization is a data science method that reduces the words in a sentence into a comma-separated list of distinct words or values. It’s the entry gate to start processing text data during Natural Language Processing (NLP).

In fact, before diving into more granular NLP machine learning techniques, you would need to fragment your data. There are many coding languages out there that you can hand over for this task, but Python is undoubtedly the easiest one to manipulate.

In fact, we are going to leverage Python’s Natural Language Toolkit (NLTK) to take text data from a Pandas data frame and return a tokenized list of words.


How To Kick Off Entity Research in NLP

First things first we need to import Pandas to load and manipulate our data, and the Natural Language Toolkit (NLTK) to perform the tokenization.

import pandas as pd
import nltk

Import the Data

We use Pandas to import the data you extracted from Screaming Frog or another crawler of your choice. For demonstration purposes, I am importing a dataset of titles and descriptions from SEO Depths.

However, I do suggest importing the copy from your landing pages or product pages. This is an excellent pathway to evaluating whether your copy disposes of potential entities.

df = pd.read_excel('/content/interni_html.xlsx') 
df.head()

Concatenate the text into a single column

When performing NLP tasks, you want to prioritize the audit of the overall text available instead of narrowing it down into a single word from specific columns. What we are going to do is to merge the text in the two columns together using concatenation via the + operator, followed by ‘ ‘ .

df['text'] = df['title'] + ' ' + df['description']

Remove NaN values and cast to string

Now we need to make sure that any NaN values are not messing up with our new data frame. In fact, we must guarantee that we are going to deal with only string values.

df['text'].dropna(inplace=True)
df['text'] = df['text'].astype(str)
df.head()

Create a tokenizer using NLTK

Next, we use NLTK to create a tokenizer function. The command below will shoot the NLTK downloader and prompt the installation of the punkt data, the string in charge of executing the fragmentation of a sentence of words, and return individual values.

nltk.download('punkt');

In order to make it easier to apply tokenization to our new Pandas data frame column, we need to cook up a few further lines of code.

def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.

    Args:
        column: Pandas dataframe column (i.e. df['text']).

    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
    """

    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]    

Tokenize your text data using NLTK

Finally, we run the last function on the Pandas column. A lambda function is required to pass in the whole text column, use NLTK to tokenize the values and return a new Pandas column containing comma-separated tokens.

df['tokenized'] = df.apply(lambda x: tokenize(x['text']), axis=1)
df[['tokenized']].head()

Here’s what you’ll get

Tokenization Output for a bunch of title tags and meta descriptions
Tokenization Output

Next Steps

Once your tokenization is complete, you could start performing plenty of analyses. Here are a few examples that may help your SEO strategy and offload your daily workflow:

Further Readings

This post was inspired by a couple of existing workarounds to whom ownership I pay my reference:

See Also