Natural language processing (NLP) is a fascinating discipline that bridges linguistics, computer science, and artificial intelligence. It revolves around the relationship between computers and human language, focusing on how to enable computers to process and analyze vast volumes of natural language data.
Some practical applications of NLP include:
- spellcheck and autocorrect (WhatsApp messages)
- auto-generated video captions (Netflix, Youtube)
- virtual assistants like Alexa, Cortana, or Siri
- Google autocomplete
- Dynamic product recommendations from an eCommerce PDP
Natural language processing finds application in many branches within the digital domain and SEO is one of them.
💡 I wrote an in-depth article to investigate how NLP can affect SEO
In this post, I am going to cover the most common NLP techniques and discuss use cases for SEO and content marketing using a little bit of Python code.
TL;DR
Natural language processing (NLP) combines computer science, linguistics, and artificial intelligence to enable computers to process human languages
The range of NLP techniques is extremely varied and includes – but is not limited to:
- Text preprocessing is a stage of NLP focused on cleaning and preparing text for other NLP tasks.
- Parsing is an NLP technique concerned with breaking up text based on grammar and syntax.
- Word vectors and embeddings are machine models that transform words into structured data such as vectors or binary numbers, providing semantic context to machines. Common models include bag-of-words, TF-IDF, and neural language modeling (NLM).
- Topic modeling is the NLP process by which hidden topics are identified given a body of text.
- Text similarity is a facet of NLP concerned with the similarity between instances of language.
The guide will provide you with 4 SEO use cases to streamline and improve the following tasks:
- Data Cleaning from a Crawl’s Custom Extraction
- Run an Entity Gap Analysis
- Explore Entity Similarity to extract a text’s semantics
- Audit Redirect Matching in under 2 minutes
Disclaimer
A couple of caveats before setting off.
1️⃣ Please make sure to open a Google Colab new notebook and install the following dependencies
!pip install spacy
!pip install text-preprocessing
!pip install sklearn
All you have to do is click on the link to Colab and hit New Notebook
2️⃣ Please be advised, the scripts are for example purposes only. The code provided is intended to help the reader get started with NLP tasks applied to ideal SEO scenarios. Feel free to submit them to ChatGPT in order to adapt the code to your needs.
3️⃣ Please note that text starting with the following characters [>>>] equals the start of the output from a script.
🏎️ Right, we are now off to start!
Text Pre-processing
Text preprocessing is all about cleaning and prepping text data so that it’s ready for other tasks.
Cleaning and organizing data helps minimize the presence of outliers in the output, whether it is for descriptive analysis or for a dataset that will be used to train a machine learning model.
In SEO, text processing is also referred to as “data cleaning”. Many people make the mistake of skipping this step and utilizing raw data, which can result in errors in their analysis.
There are several tasks concerning text pre-processing in NLP, many of which can also be applied to SEO.
Let’s see the most common.
Noise removal
The very least you can do is remove unnecessary characters and words from the query text to only keep the meaningful values.
Noise removal is a text preprocessing step concerned with removing unnecessary formatting from our text. Depending on the goal of your project and where you get your data from, you may want to remove unwanted information, such as:
- Punctuation and accents
- Special characters
- Numeric digits
- Leading, ending, and vertical whitespace
- HTML formatting
The type of noise that you need to remove from text usually depends on its source.
Fortunately, you can use the .sub()
method in Python’s regular expression re
library for most of your noise removal needs.
import re
headline_one = "{'source': {'description': <h1>Nation's Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'}}''"
# Remove HTML tags
headline_no_tag = re.sub(r"<.?h1>|{'source': {'description':|[^ws]", '', headline_one)
try:
print(f'nCleaned Text: ' +str(headline_no_tag))
except:
print('No variable called `headline_no_tag`')
>>> Cleaned Text: Nations Top Pseudoscientists Harness HighEnergy Quartz Crystal Capable Of Reversing Effects Of Being Gemini
Tokenization
Tokenization is another text preprocessing technique aimed at breaking up text into smaller units (usually words or discrete terms).
For many natural language processing tasks, we need access to each word in a string. To access each word, we first have to break the text into smaller components. The method for breaking text into smaller components is called tokenization and the individual components are called tokens.
A few common operations that require tokenization include:
- Finding how many words or sentences appear in a text.
- Determining how many times a specific word or phrase exists.
- Accounting for which terms are likely to co-occur.
As you will see in the following script, tokens don’t have to only be fragmented words, but they can also be sentences or other size pieces of text.
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
ecg_text = 'An electrocardiogram is used to record the electrical conduction through a person's heart. The readings can be used to diagnose cardiac arrhythmias.'
tokenized_by_word = word_tokenize(ecg_text)
tokenized_by_sentence = sent_tokenize(ecg_text)
try:
print('nWord Tokenization:')
print(tokenized_by_word)
except:
print('Expected a variable called `tokenized_by_word`')
try:
print('nSentence Tokenization:')
print(tokenized_by_sentence)
except:
print('Expected a variable called `tokenized_by_sentence`')
>>>
Word Tokenization: ['An', 'electrocardiogram', 'is', 'used', 'to', 'record', 'the', 'electrical', 'conduction', 'through', 'a', 'person', "'s", 'heart', '.', 'The', 'readings', 'can', 'be', 'used', 'to', 'diagnose', 'cardiac', 'arrhythmias', '.']
Sentence Tokenization: ["An electrocardiogram is used to record the electrical conduction through a person's heart.", 'The readings can be used to diagnose cardiac arrhythmias.']
Tokenization and noise removal are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization.
Normalization for NLP- Stemming & Lemmatization
Normalization is the process of bringing words to their original stems. It’s a legit technique very common to recognize the linguistic stem of words.
Tasks involving normalization for NLP include Stemming and Lemmatization.
Stemming is a technique that consists in bluntly removing prefixes and suffixes from a word to their root form. For example “apple” and “apples” refer to the stem: “appl” and are treated the same during the vectorization stage – when normalized words are converted into numbers and passed on to machine learning algorithms for processing.
There are several types of stemmers: Porter Stemmer and Snowball Stemmer.
a. The Porter Stemmer is the most common and widely used stemmer.
b. The Snowball Stemmer is a variant of the Porter Stemmer and it’s known to be more accurate as it works on morphological language.
Lemmatization is a technique that involves replacing a single-word token with its root form while maintaining its syntactical inclination.
Due to being more focused on bringing words down to their root forms, lemmatization can take up high computational power resulting in quite inefficient, but not necessarily less accurate than stemming.
In fact, NLTK’s savvy lemmatizer knows very well that “am” and “are” are related to the verb “to be,” as opposed to common stemming, which would normally fail in this case.
Normalization for SEO – Lowercasing and Stopword Removal
When it comes to SEO, there are other normalization techniques that can be used to strike consistency within a text.
Lowercasing text and stopword removal are yet underrated steps that may compromise the efficiency of your descriptive analysis.
Changing the case of a string is useful to maintain consistency across a text and reduce the risk of incorrect parsing. However, removing stopwords can have significant implications if not done correctly.
Stopword removal is a preprocessing technique that involves eliminating words that may impact the semantic meaning of a sentence structure. These stopwords are typically the most common words in a language and do not provide any valuable information about the tone of a statement.
For instance, English pronouns like “a”, “an”, and “the” are stopwords that should always be removed to reduce noise and clutter in a text.
Imagine conducting a sentiment analysis on a text filled with these non-sensical words. It would undoubtedly lead to inaccurate results and hinder the efficacy of the analysis.
💡Learn more about stopwords removal with this hands-on tutorial
SEO Use Case #1 – Text Normalization from Custom Extraction
Say you have a client asking you to optimize meta tags for certain product category pages.
Sometimes, running a crawl alone may not be sufficient. In such cases, you can use custom extraction to fetch all the PLPs based on their category ID.
However, custom extractions can result in noisy output, possibly due to using incorrect XPath or CSS selectors. Instead of struggling to find the perfect Xpath, you can export the output and use a few text pre-processing techniques to clean up the messy category IDs.
Here is a complete example of how to normalize category IDs obtained from a custom extraction.
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
# Downloading the stopwords for English language
stop_words = stopwords.words("english")[:10]
# Import visualization libraries
import matplotlib.pyplot as plt
# Read the Excel file 'Book2.xlsx' into a DataFrame
df = pd.read_excel('/content/Book2.xlsx')
df.head()
def preprocess_text(text: str, remove_stopwords: bool) -> str:
"""Function that cleans the input text by:
- removing links
- removing special characters
- removing numbers
- removing specified characters: [m], [w], [ss], [cgid]
- removing stopwords (if specified)
- converting to lowercase
- removing excessive white spaces
Arguments:
text (str): input text to be cleaned
remove_stopwords (bool): whether to remove stopwords or not
Returns:
str: cleaned text
"""
# Remove links
text = re.sub(r"httpS+", "", text)
# Remove specified characters [m], [w], [ss], [cgid]
text = re.sub(r"b(?:cgid|m|ss|w)b", "", text, flags=re.IGNORECASE)
# Remove numbers and special characters
text = re.sub("[^A-Za-z]+", " ", text)
# Remove stopwords
if remove_stopwords:
# 1. Tokenize the text
tokens = nltk.word_tokenize(text)
# 2. Check if it is a stopword
tokens = [w for w in tokens if not w.lower() in stop_words]
# 3. Merge all the tokens
text = " ".join(tokens)
# Return the cleaned text without excessive white spaces, converted to lowercase
text = text.lower().strip()
return text
# Create a new column 'cleaned' with the cleaned text from the 'Category ID3 1' column
df['cleaned'] = df['Category ID3 1'].apply(lambda x: preprocess_text(x, remove_stopwords=True))
df.head()
Parsing
How do words relate to each other and the underlying syntax? How do they relate to the meaning of a sentence?
So far, we’ve learned that text can be broken into smaller units of discrete words or even entire sentences. Tokenization is an essential stage of text pre-processing, but it lacks semantic understanding.
Parsing is a more advanced technique concerned with segmenting text based on syntax. It involves breaking the text into smaller units, taking into account their semantic weight against the text.
The following techniques can be used to parse text:
- Part-of-speech (POS) is a method of categorizing words in a text based on their grammatical function. It involves assigning labels to words to indicate whether they are nouns, verbs, adjectives, adverbs, or other parts of speech.
- Named Entity Recognition (NER) is used to identify and classify specific named entities like people, organizations, and locations in a given text. It can be useful for enhancing brand recognition through schema markup by highlighting key individuals, locations, and organizations associated with your brand.
The ne_chunk
package from the nltk
library in Python can help us match a list of entity types with those emerging from a text.
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Function for NER to extract entities
def extract_entities(text):
# Tokenize the text into words
words = word_tokenize(text)
# Remove stopwords (optional)
stop_words = set(stopwords.words('english'))
words = [word for word in words if word.lower() not in stop_words]
# Part-of-Speech tagging
pos_tags = pos_tag(words)
# Perform Named Entity Recognition
entities = ne_chunk(pos_tags, binary=False)
return entities
# Example text
text = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
The company's headquarters is located in Cupertino, California.
Tim Cook is the current CEO of Apple Inc.
"""
# Extract entities from the text
entities = extract_entities(text)
# Function to display named entities
def display_entities(tree):
if hasattr(tree, 'label') and tree.label:
if tree.label() in ['GPE', 'PERSON', 'ORGANIZATION']:
print(' '.join([child[0] for child in tree]), ':', tree.label())
else:
for child in tree:
display_entities(child)
# Display the named entities
display_entities(entities)
Here are some of the commonly used entity types provided by the ne_chunk
library
Entity Type | Description |
---|---|
GPE | Geo-Political Entity |
PERSON | Person’s Name |
ORGANIZATION | Organization Name |
FACILITY | Facility Name (e.g., airports, buildings) |
GSP | Geopolitical Location (e.g., countries, cities) |
DATE | Dates |
TIME | Time Expressions |
MONEY | Monetary Values |
PERCENT | Percentage Expressions |
CARDINAL | Numerals (e.g., one, two, three) |
ORDINAL | Ordinal Numbers (e.g., first, second, third) |
QUANTITY | Quantities (e.g., measurements, weights) |
PRODUCT | Product Names |
EVENT | Event Names |
WORK_OF_ART | Titles of Works of Art |
LAW | Legal Document Titles |
Word Vectors
Basic machine learning algorithms for NLP are based on the idea of word vectors.
Word vectors are models that convert each word (unstructured text) into vectors or binary numbers (structured text). This is useful to provide search engines with an appropriate context to understand a piece of content (or even a search query)
There are several techniques used to generate word vectors:
Bag of Words (BoW)
Bag-of-words (BoW) is a statistical language model that focuses on word count in a text. It represents a way for computers to understand language based on probability.
BoW counts how many times words occur in a document and reduces them to their stem.
Albeit unable to grasp the semantic relationships between words, it is still capable of capturing the frequency of words.
Nevertheless, the model can have many use cases including:
- determining topics in a song
- filtering spam from your inbox
- finding out if a tweet has positive or negative sentiment
- creating word clouds
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV # Import part-of-speech constants
from collections import Counter
# Define a function to get the part of speech for lemmatization
def get_part_of_speech(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": NOUN, "N": NOUN, "V": VERB, "R": ADV}
return tag_dict.get(tag, NOUN)
# Change text to another string:
text = '''
Today is gonna be the day that they're gonna throw it back to you
And by now, you should've somehow realised what you gotta do
I don't believe that anybody feels the way I do about you now
And backbeat, the word is on the street that the fire in your heart is out
I'm sure you've heard it all before, but you never really had a doubt
I don't believe that anybody feels the way I do about you now
And all the roads we have to walk are winding
And all the lights that lead us there are blinding
There are many things that I would like to say to you, but I don't know how
Because maybe
You're gonna be the one that saves me
And after all
You're my wonderwall
'''
# Helper function to expand contractions (e.g., "gonna" to "going to")
def expand_contractions(text):
contractions_dict = {"gonna": "going to"}
# Pattern to match contractions
contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, text)
expanded_text = expand_contractions(text)
# Clean the text by removing non-word characters and converting to lowercase
cleaned = re.sub('W+', ' ', expanded_text).lower()
# Tokenize the cleaned text
tokenized = word_tokenize(cleaned)
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered = [word for word in tokenized if word not in stop_words]
# Lemmatize the filtered words
lemmatizer = WordNetLemmatizer()
# The lemmatization process is based on the word's part of speech for better accuracy
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in filtered]
# Print the normalized words with Lemmatization
print("nNormalized words with Lemmatization: " + str(lemmatized))
# Define the Bag of Words & print it
bag_of_words = Counter(lemmatized)
print("nBag of Words: " + str(bag_of_words))
>>>
Normalized words with Lemmatization: ['today', 'go', 'day', 'go', 'throw', 'back', 'somehow', 'realise', 'get', 'ta', 'believe', 'anybody', 'feel', 'way', 'backbeat', 'word', 'street', 'fire', 'heart', 'sure', 'heard', 'never', 'really', 'doubt', 'believe', 'anybody', 'feel', 'way', 'road', 'walk', 'wind', 'light', 'lead', 'u', 'blinding', 'many', 'thing', 'would', 'like', 'say', 'know', 'maybe', 'go', 'one', 'save', 'wonderwall']
Bag of Words: Counter({'go': 3, 'believe': 2, 'anybody': 2, 'feel': 2, 'way': 2, 'today': 1, 'day': 1, 'throw': 1, 'back': 1, 'somehow': 1, 'realise': 1, 'get': 1, 'ta': 1, 'backbeat': 1, 'word': 1, 'street': 1, 'fire': 1, 'heart': 1, 'sure': 1, 'heard': 1, 'never': 1, 'really': 1, 'doubt': 1, 'road': 1, 'walk': 1, 'wind': 1, 'light': 1, 'lead': 1, 'u': 1, 'blinding': 1, 'many': 1, 'thing': 1, 'would': 1, 'like': 1, 'say': 1, 'know': 1, 'maybe': 1, 'one': 1, 'save': 1, 'wonderwall': 1})
Unlike other language models, BoW focuses on individual words rather than word sequences, offering a host of advantages:
- it reduces data sparsity (i.e., less training knowledge to draw from)
- it reduces overfitting – when a model over-relies on the training dataset it embraces outliers leading to poor training and thereby returning inaccurate results.
The combination of reduced data sparsity and less overfitting makes the bag-of-words model particularly reliable, especially when dealing with smaller training data sets.
However, be advised that BoW has higher perplexity than other models, making it less ideal for language prediction.
TF-IDF
TF-IDF is an independent measure used in NLP to understand the importance of words in a collection of documents. It calculates a score for each word based on its relevance to a specific document, and the higher the score, the more important the word.
💡 TF-IDF deprioritizes the most common words to focus on less frequently used terms. Despite being counter-intuitive, deprioritizing the most frequent words makes sense when working with a large amount of data in order to remove stopwords like “the” and “is.”
While Bag-of-Words merely counts words and their frequencies in a text, TF-IDF evaluates the significance of each word based on both its frequency and rarity. This means TF-IDF provides a more contextual measure by comparing the importance of the most frequent words with the least frequent ones in a text.
SEOs have been using TF-IDF in the past few years to assess the semantic importance of entities on their web pages. While the idea is not wrong, Google now gives more importance to semantic search rather than lexical search. Therefore, relying on TF-IDF might be a flaky approach to determining relevance for rankings.
SEO Use Case #2 – Entity Gap Analysis of Article Intros
Say you want to outrank a competitor that’s using compelling blog post intros.
The first step is to compare your introductions to your competitors to identify potential weaknesses and room for improvement.
By analyzing what makes your competitors’ intros stand out, you can gain valuable insights into crafting compelling intros that grab readers’ attention.
But how?
You can run an entity gap analysis using text preprocessing and normalization techniques before applying TF-IDF.
The following example benchmarks the intro to my recent Orphan Pages audit with Python against articles from:
import pandas as pd
import re
import string
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
document_1 = '''
Finding webpages that have no links is difficult, but not impossible.
If there are pages on your website that users and search engines can’t reach, this is a problem you need to fix.
Fast.
These types of pages have a name: orphan pages.
In this post, you’ll learn what orphan pages are, why fixing them is important for SEO, and how to find every orphan page on your site.
'''
document_2 = '''
Orphan pages are website pages that are not linked to from any other page or section of your site. This means a user cannot access the page without knowing the direct URL. Additionally, these pages can’t be followed from another page by search engine crawlers, which means they are rarely indexed by search engines. In order for crawlers to find your pages, they need to be linked to other pages. Think of it like an actual web for a spider to crawl on. If parts of it are broken, the spider will have a difficult time getting from one place to another.
Most importantly, orphan pages represent missed opportunities to acquire and engage customers and can hurt your bounce rate. Fortunately, losing out on page traffic, retention, and revenue and hurting your SEO success because of orphan pages is something that can be easily remedied. Here is how you can use BrightEdge to cure your site of orphan pages.
'''
document_3 = '''
These pits of technical site errors, littered by several generations of previous agencies, slow down and hinder SEO efforts and progress.
And when you’re the one tasked to clean it up, finding the quick fixes is your number one task.
So you may start with a basic site audit and see several orphan pages. You’ve probably heard that orphan pages are bad for a site but do not fully understand what they are and how to fix them.
In this article, you’ll learn:
What orphan pages are
What causes orphan pages
Why orphan pages are bad for SEO
How to find orphan pages
How to fix orphan pages
How to prevent orphan pages
'''
document_4 = '''
Orphan pages are pages within a website’s architecture that are not linked to the main navigation.
Typically, these pages are not indexed unless they are linked historically or from external sources like XML Sitemaps or external links.
While having a small number of orphan pages is generally not a significant concern according to Google, it can become problematic at scale. Orphan pages can contribute to index bloat and waste crawl budget, leading to lower search rankings.
During my audits of large eCommerce websites, I often come across orphan pages as one of the most common issues.
The main causes of orphan pages include:
Discontinued product pages: This is the most common cause, where out-of-stock items are still present on valid and indexable pages (returning HTTP 200 and allowing indexing and following).
Old unlinked pages: Pages that were once published but are no longer linked within the website’s structure.
Site architecture issues: Poor vertical linking, where pages are not connected properly within the hierarchy of the website.
Auto-generation of unknown URLs at the CMS level.
Massive rendering issues that prevent search engines from accessing internal links.
Now that we have a better understanding of orphan pages, we can dive into the details using Python and SEO techniques.
For this tutorial, I will be using sample screenshots from a recent audit conducted on a large fashion luxury eCommerce website.
In this post, I will outline an accurate method using basic Python operations to streamline the process of auditing orphan pages.
'''
# Preprocess documents
def preprocess_text(text):
# Your text preprocessing code here
cleaned = re.sub('W+', ' ', text)
tokenized = word_tokenize(cleaned)
# Lowercasing text
text_lower = text.lower()
text_upper = text.upper()
# Remove stopwords, punctuation, and numbers
stop_words = set(stopwords.words('english'))
text_without_stopwords = [word for word in tokenized if word.lower() not in stop_words]
text_without_stopwords = [word.translate(str.maketrans('', '', string.punctuation + string.digits)) for word in text_without_stopwords]
# Lemmatization
def get_part_of_speech(word):
if word.startswith('J'):
return 'a'
elif word.startswith('V'):
return 'v'
elif word.startswith('N'):
return 'n'
elif word.startswith('R'):
return 'r'
else:
return 'n'
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word, get_part_of_speech(word)) for word in text_without_stopwords]
lowercased_lemmas = [lemma.lower() for lemma in lemmatized]
return ' '.join(lowercased_lemmas)
corpus = [document_1, document_2, document_3, document_4]
# Process documents
processed_corpus = [preprocess_text(doc) for doc in corpus if doc is not None and isinstance(doc, str)]
# Initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tf_idf_scores = vectorizer.fit_transform(processed_corpus)
# Get vocabulary of terms
feature_names = vectorizer.get_feature_names_out()
corpus_index = [f"Document {i+1}" for i in range(len(processed_corpus))]
# Create a DataFrame with the transpose of 'tf_idf_scores' and set 'feature_names' as index and 'corpus_index' as columns
df_tf_idf = pd.DataFrame(tf_idf_scores.T.todense(), index=feature_names, columns=corpus_index)
# Reset the index and move it to a separate column named 'Word'
df_tf_idf = df_tf_idf.reset_index()
df_tf_idf = df_tf_idf.rename(columns={'index': 'Word',
'Document 1':'SEJ',
'Document 2':'BrightEdge',
'Document 3':'Ahrefs',
'Document 4':'SEO Depths'})
# Save the DataFrame to an Excel file named 'tf_idf.xlsx' (excluding the index)
df_tf_idf.to_excel('tf_idf.xlsx', index=False)
# Print the DataFrame
df_tf_idf
This output allows you to identify areas where your article intros might be lagging in entity weights compared to the competition.
But what if you want to focus on the competition’s hotspot and see how the entities used in your intros actually perform?
A little bit of data wrangling can make it happen.
# Remove rows containing 0
competition = df_tf_idf.loc[(df_tf_idf != 0).all(axis=1)]
#convert floats to integers
competition['Ahrefs'] = competition['Ahrefs'].astype(int)
competition['BrightEdge'] = competition['BrightEdge'].astype(int)
competition['SEJ'] = competition['SEJ'].astype(int)
competition['SEO Depths'] = competition['SEO Depths'].astype(int)
#print results
competition
💡 I wrote a full tutorial on how you can run a semantic competitor analysis with NLP techniques
Word Embeddings
The terms “word embeddings” and “word vectors” are often used interchangeably in the SEO industry.
This is incorrect and extremely dangerous as it leads us to ignore that word embeddings are those capable of capturing semantics relationships as opposed to word vectors.
💡 Word embeddings are a special subset of word vectors designed to grasp the semantic and syntactic meaning of words.
In layman’s terms, the idea behind word embeddings is that a word’s meaning can be understood by its context or the words that surround it.
In more technical terms, embeddings are referred to as a series of floating points generated using neural network-based models, such as Word2Vec, Transformers, and BERT.
💡 Wordlift does an amazing job of preaching the advantages of NLP in SEO and machine learning. You can find more detailed information about word embeddings in this slideshare.
Word embeddings are not only the foundation of large GPT language models but also the core mechanism used in major search engines like Google.
An example of the application of vector embeddings in Google Search is the computation of cosine similarity to assess the relevance of search queries against a set of web pages and to determine the match between a source and a target URL.
In this scenario, Google would look into its index and calculate a semantic distance between the historical and current versions of the page to define parity.
The lower the distance, the more likely the redirect is valid, and it passes link equity to the destination.
Word2Vec and Text Embeddings in Machine Learning
Let’s get a bit more technical.
Now that we know about the gist of word embeddings, it’s important to address common models used to train them.
⚠️ Bear in mind that what follows is critical to understand how to build a word embedding model to use in semantic SEO.
Word2Vec is a neural network model that represents words in a vector space to generate word embeddings.
In plain English, Word2Vec converts words into digits in order to measure semantic similarity based on the surrounding context.
Word2Vec is a model built on the skip-gram approach, which considers the order in which words are presented within a vector space (document). This approach takes into account nearby words in a vector space (a document), leading to a better understanding of the connections between terms such as “strong” and “powerful,” rather than “strong” and “Paris.”
Given that word embeddings are created based on the context, the larger and more diverse the text, the better the word embeddings become.
This means that if we submitted a large article to be trained on Word2Vec using Gensim, the generated embeddings would help measure similarities between words by their underlying semantics.
SEO Use Case #3 – Entity Similarity to Learn Semantic Influences in an Article
In linguistics, words are signals used to convey meaning, and understanding how this works is important for creating content.
Using a piece of article from Healthline, we’re going to explore how key entities from a text relate to the most similar words in the article. This will help us grasp the inner copywriting strategy.
import gensim
import pandas as pd
from nltk.corpus import stopwords
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# Load the blog data from the Excel file
blog = pd.read_excel('/content/text.xlsx')
# Load stop words
stop_words = stopwords.words('english')
# Preprocess text and remove stopwords
def preprocess_text(text: str, remove_stopwords: bool) -> list:
# Remove links
text = re.sub(r"httpS+", "", text)
# Remove numbers and special characters
text = re.sub("[^A-Za-z]+", " ", text)
# Remove stopwords
if remove_stopwords:
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Filter out stopwords
tokens = [w.lower().strip() for w in tokens if not w.lower() in stopwords.words("english")]
# Return a list of cleaned tokens
return tokens
# Preprocess text in the 'text' column of the blog DataFrame
blog_processed = [preprocess_text(sentence, True) for sentence in blog['text']]
# Train word embeddings model
model = gensim.models.Word2Vec(blog_processed, vector_size=1000, window=5, min_count=1, workers=2, sg=1)
'''
FYI
vector_size = the dimensionality of the vectors. Usually, word embeddings have a dimensionality of over 1000 as higher values can capture more semantic information
window = the maximum distance between the and predicted word within a sentence. The larger the window, the broader the context captured by the model, while a smaller size focuses on local context.
sg = the training algorithm -->
0 = the model uses CBOW (Continuous Bag of Words)
1 = the model uses Skip-gram.
'''
# Measure similarity between "caregiver" and other terms in the vocabulary
similarity_scores = {}
for term in vocabulary:
similarity_scores[term] = model.wv.similarity('caregiver', term)
# Sort similarity scores in descending order
sorted_similarity_scores = {k: v for k, v in sorted(similarity_scores.items(), key=lambda item: item[1], reverse=True)}
# Print the top 10 most similar terms to "caregiver"
print("Top 10 most similar terms to 'caregiver':")
for term, score in list(sorted_similarity_scores.items())[:10]:
print(f"{term}: {score}")
💡 If you are interested in training a Word2Vec model with Gensim, I highly recommend reading a guide on how to cluster and find keyword similarities from a blog called Diario di un Analista
However, you should note that Word2Vec has some limitations as it mainly focuses on nearby words and may not fully capture the meaning of words in a text.
Advanced word embeddings like BERT and Transformers are better at understanding the context and can provide more accurate results.
Topic Modeling
Another common NLP technique is topic modeling, an area dedicated to uncovering latent, or hidden, topics within a body of language.
Topic modeling involves using a probabilistic machine model of language to generate topics by vectorizing text from a corpus.
The commonly used technique is TF-IDF, which we learned prioritizes less frequent terms as topics.
Once you have vectorized words using TF-IDF, the next step in your topic modeling journey is often latent Dirichlet allocation (LDA), a statistical model that takes your documents and determines which words keep repurposing in the same contexts (i.e., documents).
💡 I have written an extensive guide on topic modeling for SEO and how to use it to improve your content strategy using Python
Text Similarity
Text similarity is a fundamental area of NLP with a myriad of practical applications. Whether it’s identifying similar documents, enabling spell-check, or improving autocorrect features, the quest to measure similarity between texts is a crucial aspect of language processing.
Text similarity can be split into two key components
Semantic Similarity | It captures how closely two documents convey the same meaning. |
Lexical Similarity | It evaluates the overlap in vocabulary between two texts (e.g; documents with the exact same vocabulary will score a lexical similarity of 1 and viceversa). |
Handling word similarity and addressing misspellings are critical tasks in language processing.
The Levenshtein distance is one common approach and it’s referred to as the minimal edit distance (number of insertions, deletions, and substitutions) required to transform one word into another.
💡 The Levenshtein distance is an efficient tool to measure lexical similarity and offers insight into how many operations are required to transform one string into another.
A higher distance implies lower similarity, while a lower distance indicates stronger similarity.
For example, a distance of 0 indicates an exact match, while a higher distance means more changes are required to transform one word into the other.
This concept finds applications in NLP and SEO. As a lexical similarity measure, it opens up practical solutions for various technical SEO tasks.
SEO Use Case #4 – Audit Redirect URLs for Similarity
In the context of a post-website migration or a simple technical audit, you can use the Levensthein distance to audit redirect matching.
Here’s a 2-minute example you could use in the context of a tech audit.
!pip install python-Levenshtein==0.12.2
import pandas as pd
import Levenshtein as lev
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.metrics import edit_distance
# Function to calculate Levenshtein distance
def calculate_levenshtein_distance(source, destination):
return edit_distance(source, destination)
# Read Excel with Source and Destination strings
df = pd.read_excel('/content/Book1.xlsx')
# Calculate Levenshtein distance for each row
df['Levenshtein Distance'] = df.apply(lambda row: calculate_levenshtein_distance(row['Source'], row['Destination']), axis=1)
# Print the DataFrame with the Levenshtein distances
df
Remember, the higher the score, the greater the difference between the URLs.
However, bear in mind that this is just one of many NLP techniques that can be applied to SEO as a mere tool.
In fact, you should carefully examine the redirects and use your domain knowledge before providing SEO recommendations.
NLP for SEO – the sky is the limit
Natural Language Processing (NLP) is used by semantic search engines like Google to process the content of a page and match it up with a search query. This means that it mainly operates within the ranking stage of Google’s crawl cycle.
NLP disposes of such a broad range of techniques that can really help streamline and augment the impact of SEO execution, whose velocity is key to gaining a competitive advantage nowadays.
On the flip side, NLP would only provide raw tools that shall be embraced with SEO expertise and domain knowledge.
I highly discourage you from blindly relying on the provided scripts to provide SEO consultations. However, this should not stop you from testing and expanding the field of application.
The sky is the limit in NLP for SEO, but your domain knowledge and business acumen should always come first.
FAQ
What is Natural Language Processing (NLP)?
NLP is a discipline that enables computers to process and analyze human language data.
What are some practical applications of NLP?
NLP is used in spellcheck, autocorrect, virtual assistants, Google autocomplete, and dynamic product recommendations.
How does NLP relate to SEO?
NLP finds application in SEO for text processing, topic modeling, and text similarity measurement.
How does tokenization work in NLP?
Tokenization breaks text into smaller units, like words or sentences, for further analysis.
What are some common NLP techniques used in text preprocessing?
Noise removal, tokenization, stemming, and lemmatization are common text preprocessing techniques.
What is the purpose of noise removal in text preprocessing?
Noise removal eliminates unnecessary characters, punctuation, and other unwanted formatting from text.
How do parsing techniques help in NLP?
Parsing segments of text based on syntax and understanding the relationship between words.
What are word vectors and word embeddings in NLP?
Word vectors and embeddings convert words into structured data, providing semantic context.
Why is Word Embeddings Useful?
Word embeddings enable us to compare and contrast how words are utilized and recognize words that appear in similar contexts.
What’s the difference between TF-IDF and Bag-of-Words?
BoW counts word occurrences without considering relevancy, while TF-IDF evaluates word significance based on frequency and rarity across the corpus, providing a more contextual measure of word importance in a text.
Why Use Gensim versus spaCy for generating word embeddings?
Gensim allows you to create word embeddings based on any corpus of text, while spaCy uses a prepackaged corpus.
How does text similarity measurement work in NLP?
Text similarity is measured using metrics like Levenshtein distance, indicating how similar two words are.
How can NLP and SEO be combined effectively?
By utilizing NLP techniques in text preprocessing, understanding context, and measuring text similarity, NLP can enhance SEO practices.
Related Posts