The common image that we associate when hearing the term “entity” with respect to any Google search is a long bar appearing underneath the image search bar.
These clustered boxes of “things” are better known as Entities, and they are usually identified by a process named Named Entity Recognition (NER).
Fair enough, entities not only rule the Visual search but are trailing the traditional search journey.
As an example, two similar queries may return different results because of the addition of a core entity. Whether a search query of โwhite houseโ may refer to a place (img. 1), a slightly different query writing as โwhite house paintโ might refer instead to a colour of โwhiteโ and a product category of โpaintโ (img.2).
Following the ongoing progress with AI and machine learning models, Google is starting to observe the Internet realm with increased responsiveness and perhaps even consciousness. In fact, Google is now able to decipher the Internet realm more as things rather than strings.
What can we do as SEOs?
Google is learning to parse plain text copy and break down sentences in a bid to harvest its own entity network within the World Wide Web. Hence, SEOs must turn the other cheek and master a few methods to automate the analysis and parsing of the utter amount of daily SEO data.
In this post, I am going to walk you through a method for entity optimization and then we will jump on a Python framework designed to extract entities and return a small audit of the sentiment spurring from a product page copy.
For the purpose of this task, I am going to refer to the Nogging board game.
โญ๏ธ Learn how search engines understand search queries with NLP and NLU
Table of Contents
- How does Entity Optimization work?
- Requirements and Assumptions
- Import the Packages
- Upload your NLP API Key
- Text Tokenization
- Stemming the Text
- Lemmatization of the Text
- Identify Entities from Lemmatized Text
- Text-mine Sentiment Analysis with NLU
- Conclusion
How does Entity Optimization work?
Simply put, entity optimization brings together three concepts to provide the most holistic results
- What: the central topic of the query and what the searcher expects in the content when they search for that topic. All different queries are distinct keywords, but they are just one topic since they mean the same thing.
- Why: The intent behind the query. Are they seeking information? Are they considering and evaluating options? Are they looking to transact and make a decision?
- How: How your content is delivered is just as important. If your audience expects a video and you deliver text, it may not have the desired impact. These are various entities that we put together to create the most holistic semantic SEO strategy. Therefore, page/content layout becomes a critical part of a semantic search strategy.
These content elements can be marked up with structured data to help search engines understand the context and relationships between the content elements and the entities in your content.
Search algorithms have surpassed keywords and determine the context and intent behind queries by understanding existent relationships between entities on a radar.
To put this into context, let’s jump on the bandwagon of the operational processes with the following instructions on how to extract entities and sentiment from a product page copy.
Requirements and Assumptions
For the purpose of this coding script, there are a few requirements that we need to satisfy before setting up the environment.
- Run the script either on Google Colab.
- Sign up on Google Cloud Platform and open an account. Next create a New Project, hop on the Credentials section from the sidebar and create a New Key.
- Download the NLP API, which is in JSON-LD format, and enable it. Make sure to keep it safe and sound for next use.
- Beware that the NLP API needs to be uploaded every time you run this script
Import the Packages
In the first part of this script, we are going to leverage NLP to tokenize a text from a product page from which we will extract entities.
To kick off, we need to import a few Python libraries.
import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums
from google.cloud import language
from google.cloud.language import types
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize,pos_tag
We are going to import the OS module as it provides a way for interacting with your operating system.
Since we started our journey from the Google Cloud Platform, we need to import the related libraries as well.
Numpy and Matplotlib will play a crucial role in plotting a few semantic differential scales.
Finally, we need to retrieve the specific libraries executing both the Stemming and Lemmatization techniques from the main NLTK Python library.
โญ๏ธ Learn more about that and how NLP and NLU can impact SEO.
Upload your NLP API Key
Once we have imported our modules, we first need to upload the NLP API. For this purpose, we are going to take advantage of the OS module.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "YOUR_API_KEY.json"
Text Tokenization
We finally enter the core of NLP process as we are going to confront the very first task, Tokenization.
For this specific project, I wanted to avoid any fuss and hassle with Python. Hence, the first thing we are going to do is simply import our product page text into the following basic function.
text = "In a Fastest Thinker First Frenzy, players take turns to deal their playing cards: a combination of Letter Cards and Action Cards, down on three piles of Letter Cards. As your playing cards are laid down in turn, one by one, on the equal three piles, an Action Card will appear, leaving the two Letter Cards visible."
print(text)
To process the text tokenization, we can choose whether to leverage the Stemming or the Lemmatization technique.
Stemming the Text
To cover both of the options, let’s first get started with the Stemming technique.
As you notice, we need to import an additional library from NLTK which is called PorterStemmer. Hence, include within the word_list function every single bits of the words from the sentences that we imported a while ago.
from nltk.stem import PorterStemmer
porter = PorterStemmer()
word_list = ['In',
'a',
'Fastest',
'Thinker',
'First',
'Frenzy',
'players',
'take',
'turns',
'to',
'deal',
'their',
'playing',
'cards,',
'a',
'combination',
'of',
'Letter',
'Cards',
'and',
'Action',
'Cards,',
'down',
'on',
'three',
'piles',
'of',
'Letter',
'Cards',
'As',
'your',
'playing',
'cards',
'are',
'laid',
'down',
'in',
'turn',
'one',
'by',
'one',
'one',
'the',
'equal',
'three',
'piles',
'an',
'action',
'card',
'will'
'appear',
'leaving',
'the',
'two',
'Letter',
'Cards',
'visible',
]
print("{0:20}{1:20}".format("Word","Porter Stemmer"))
for word in word_list:
print("{0:20}{1:20}".format(word,porter.stem(word)))
๐ก The Porter Stemming Algorithm is the most popular stemming algorithm used for semantic research purposes. In fact, it has represented the benchmark for stemming since 1980.
Here is what you will get
Lemmatization of the Text
The other opportunity you have to execute the tokenization is to apply the Lemmatization technique.
First, we need to install the spaCy library on the fly and download its “en” model. Then, we import spaCy and initialize the model by only keeping the tagger component needed for lemmatization, which is (“en_core_web_sm”).
spaCy is relatively new in the space and is billed as an industrial strength NLP engine.
Secondly, we paste the original sentence from our copy text after the Sentence function and we parse it using the loaded “en” model object called “nlp“.
Ultimately, we can extract the lemma for each token and join.
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = "In a Fastest Thinker First Frenzy, players take turns to deal their playing cards: a combination of Letter Cards and Action Cards, down on three piles of Letter Cards. As your playing cards are laid down in turn, one by one, on the equal three piles, an Action Card will appear, leaving the two Letter Cards visible."
doc = nlp(sentence)
" ".join([token.lemma_ for token in doc])
# ULTIMATE OUTPUT
in a Fastest Thinker First Frenzy , player take turn to deal their playing card : a combination of Letter Cards and Action Cards , down on three pile of Letter Cards . as your playing card be lay down in turn , one by one , on the equal three pile , an Action Card will appear , leave the two Letter Cards visible .
Identify Entities from Lemmatized Text
Since I reckon lemmatization is probably the best tokenization technique to extract entities from a text, I am going to show you how to identify entities from a copy that has been ultimately lemmatized.
It’s not even that difficult given that we finally dispose of a full lemmatized text at our fingertips.
All we need to do now is to paste the reviewed sentence within the “text_content” function
text_content = "in a Fastest Thinker First Frenzy , player take turn to deal their playing card : a combination of Letter Cards and Action Cards , down on three pile of Letter Cards . as your playing card be lay down in turn , one by one , on the equal three pile , an Action Card will appear , leave the two Letter Cards visible ."
text_content = text_content[0:1000]
client = language_v1.LanguageServiceClient()
type_ = enums.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": text_content, "type": type_, "language": language}
encoding_type = enums.EncodingType.UTF8
response = client.analyze_entities(document, encoding_type=encoding_type)
for entity in response.entities:
print(u"Entity Name: {}".format(entity.name))
print(u"Entity type: {}".format(enums.Entity.Type(entity.type).name))
print(u"Salience score: {}".format(round(entity.salience,3)))
for metadata_name, metadata_value in entity.metadata.items():
print(u"{}: {}".format(metadata_name, metadata_value))
print('\n')
Once you execute the above lines of code, you will get something similar to the following output.
As you may note, the output comes up with a “Salience score” which is deemed to represent a metric measuring the calculated importance in relation to the rest of the text.
Whether this is not obviously calculated from any Google algorithms, make sure to take it with a grain of salt as it is only the result of a few algorithmic elaborations parsed by spaCy’s algorithms.
Despite not being the case for this project, the above output may sometimes return an additional metric called MIDS. This would indicate that Google has strong confidence in understanding the entity it refers to, as it is likely to own a comprehensive spot in the Knowledge Graph.
Text-mine Sentiment Analysis with NLU
In the second part of this framework, we are going to leverage NLU to carry out an easy sentiment analysis of the submitted product page text.
First, we plot an overview of the Sentiment Attitude spurring from the tone adopted on the product page by setting up the sentiment analysis environment and using Numpy and Matplotlib to plot the outcomes.
document = types.Document(
content=text_content,
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
sscore = round(sentiment.score,4)
smag = round(sentiment.magnitude,4)
if sscore < 1 and sscore < -0.5:
sent_label = "Very Negative"
elif sscore < 0 and sscore > -0.5:
sent_label = "Negative"
elif sscore == 0:
sent_label = "Neutral"
elif sscore > 0.5:
sent_label = "Very Positive"
elif sscore > 0 and sscore < 0.5:
sent_label = "Positive"
print('Sentiment Score: {} is {}'.format(sscore,sent_label))
predictedY =[sscore]
UnlabelledY=[0,1,0]
if sscore < 0:
plotcolor = 'red'
else:
plotcolor = 'green'
plt.scatter(predictedY, np.zeros_like(predictedY),color=plotcolor,s=100)
plt.yticks([])
plt.subplots_adjust(top=0.9,bottom=0.8)
plt.xlim(-1,1)
plt.xlabel('Negative Positive')
plt.title("Sentiment Attitude Analysis")
plt.show()
This is what you might obtain.
Next, we narrow down a bit and calculate the perceived amount of emotion in a text.
if smag > 0 and smag < 1:
sent_m_label = "No Emotion"
elif smag > 2:
sent_m_label = "High Emotion"
elif smag > 1 and smag < 2:
sent_m_label = "Low Emotion"
print('Sentiment Magnitude: {} is {}'.format(smag,sent_m_label))
predictedY =[smag]
UnlabelledY=[0,1,0]
if smag > 0 and smag < 2:
plotcolor = 'red'
else:
plotcolor = 'green'
plt.scatter(predictedY, np.zeros_like(predictedY),color=plotcolor,s=100)
plt.yticks([])
plt.subplots_adjust(top=0.9,bottom=0.8)
plt.xlim(0,5)
plt.xlabel('Low Emotion High Emotion')
plt.title("Sentiment Magnitiude Analysis")
plt.show()
As a bonus, we can also try to predict a suitable categorization for our product page based on the sentiment emerging from the copy.
The estimation comes fully equipped with a level of confidence which may hint at the responsiveness of the outcomes.
response = client.classify_text(document)
for category in response.categories:
print(u"Category name: {}".format(category.name))
print(u"Confidence: {}%".format(int(round(category.confidence,3)*100)))
Even though the outcome does not return a statistically acceptable representation, the alleged category for our Noggin Board Game product page seems to be /’Adults’
Conclusion
Entity Research is not as straightforward as keyword research, whether manual or automated. As long as Google holds training sessions on search algorithms, SEOs need to keep afloat with the changes and possibly adopt new methods not only to double down on semantic optimization but to improve time management.
Further Readings
This post was directly inspired by the comprehensive workaround on Entities and sentiment analysis conducted by Greg Bernhardt:
Please check this out for further reference:
Getting started with Google NLP API using Python
Related Post