Unveiling a Data-Informed SEO Strategy with Python and NLP

February 17, 2024

As NLP and machine empower search engine algorithms, SEOs are recommended to learn a way to gain access to them to improve entity relationships.

As I covered in my post inspecting the evolution of Google Search through the algorithms, all of this comes as no surprise. Yet, the industry appears to lag behind demystified old myths revolving around the black magic of vanity metrics (search volume, DA and LSI ghosts 👻) or nuking keyword exact-match targeting.

Today we shall learn to master a new language to do real SEO. Given the algorithmic landscape featuring the state of the Search, the new idiom appears strictly tied to Data Science and Statistics. In my humble opinion, regardless you work on the tech side or on the content side of SEO you should get a grasp on the unlimited impact that Data Science is holding on this industry.

Because yes, it’s gonna get worse.

In this post, I am going to reveal a comprehensive Python model designed to provide a solid data-informed method to kick off an outside-in strategy chasing the blueprint of a Blue Ocean strategy drilled on a semantic dissection of entities.

From a comprehensive overview of trends for selected entities, Python will help us draw an Exploratory Data Analysis (EDA) with semantic keyword research.

🎖Honourable Mentions

This post was conceived from a set of outstanding workaround Data Science and machine learning applied to SEO. Let’s give a shout-out to the following amazing guys and their resources.

🏅Ultimate Guide to Pytrends by Lazarina Stoy
🏅Clustering Product Names with Python by Lily Wu
🏅 Query count, N-Grams and Cluster Analysis for SEO (adapted) by big man Marco Giordano
🏅How to identify SEO keywords using Google Autocomplete by Matt Clarke

How to Fit this post in your Marketing Strategy

This post requires a bit of a necessary premise so you understand what the following Python enshrines within your overall marketing strategy.

There are two main approaches when it comes to strategy:

Inside-out: when a company looks at its internal asset and resources to explore the extent of marketing a brand-new product or taking different market-oriented approaches. In plain English, it’s all about understanding what’s going on inside an organisation and considering internal resources in developing a strategy.
Outside-in: when a company looks at the market needs in the first place to absorb valuable insights into the firm and explore the opportunity to market a new product or take advantageous initiatives to maximize profits. This approach revolves around an active orientation thinking consumers-first as it involves believing the company should adapt its business’s resources to the market needs.

If we look at the arrow as a continuum, we notice that by adopting only first an inside-out approach we could face some problems once the market target change, which potentially makes it slow to meet customer needs over the longer term.

Conversely, on the other side of the continuum, the marketing orientation literature suggests that we need to be more open to change and pioneer new ideas and concepts.

The pursuit of differentiation and low-cost opportunities to break into a new market space and create new demand is the pure belief behind the bespoke Blue Ocean strategy.

The Outside-In approach is the strategic tool that will help you pursue the Blue Ocean strategy. Capturing uncontested market space to chip away at the existing competition is the inner objective of this post.

Requirements & Assumptions

Before letting the ball rolling, there are a few technical requirements and advisories you should be aware of.

Upload a Kewyord.xlsx file with a list of terms that you want to investigate. The terms should be listed in the first column of the Excel
An open account on Serp API. A free plan allows you to play around with up to 100 searches per month, but you can upgrade whenever you want.
A GPU connection. I highly recommend tuning in with a Google Colab new notebook as you can rely on the browser bandwidth to process a massive amount of scripts (sorry, it’s quite a massive work).
Also, Google Colab has a low dependency on external libraries which will make it easier for us to import what we need straight away.
Although no in-depth Data Analysis expertise or coding experience is required, it is desirable at least an overall comprehension of data management and manipulation (e.g Pandas)

💡 DISCLAIMER

Before diving in, I highly recommend checking out the ultimate guide to NLP and use cases for SEO. This will help you gain a solid threshold of expertise before venturing out through the data pre-processing stage of this post.

Prepare environment

First and foremost, it’s really important to enable the GPU runtime in Colab to experience decent speed in this tutorial.

Once you click on “Runtime“, head on “Change Runtime type”, select “Hardware accelerator “ and set it on GPU

Install Dependencies

As I mentioned, using Google Colab will enable us to install a limited number of external libraries. Let’s have a look at what they are

Google Search Results: this package is meant to scrape and parse Google, Google Scholar, Bing, Baidu, Yandex, Yahoo, Ebay results using SerpApi.
Pytrends: the unofficial API for Google Trends that allows a simple interface for automating downloading of reports from Google Trends.
requests_html: a library that intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. There’s a nice tutorial covering the ins and outs of Python requests from JC Chouinard. I suggest you give it a look.

!pip install google-search-results
!pip install pytrends
!pip install requests_html

Import Packages

Next up, we’re going to import the required packages into our environment.

To put it more into context, the Python packages to call up can be grouped together by their main functionality.

Functionality	Packages
Scraping	– requests – urllib
Data Manipulation	– Pandas – Numpy
Data Visualization	– Seaborn – Matplotlib

#scraping
import requests
from serpapi import GoogleSearch
import urllib
import urllib.parse
import json
from urllib.parse import (parse_qsl, urlsplit)
from requests_html import HTML
from requests_html import HTMLSession

#data manipulation
import pandas as pd
import numpy as np

#data viz
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
large = 22; med = 16; small = 12
params = {'axes.titlesize': large,
          'legend.fontsize': med,
          'figure.figsize': (16, 10),
          'axes.labelsize': med,
          'axes.titlesize': med,
          'xtick.labelsize': med,
          'ytick.labelsize': med,
          'figure.titlesize': large}
plt.rcParams.update(params)
plt.style.use('seaborn-whitegrid')
sns.set_style("white")
%matplotlib inline

Use Pytrends to find Interest Over Time

Once the machine learning environment is set up, we can run a preparatory market analysis of our choice by consulting Google Trends.

For the purpose of this tutorial, I decided to develop my research around the hot drinks market sector in the UK.

To kickstart, we need to import the bespoke Pytrends library and provide a few initial directives.

from pytrends.request import TrendReq
pytrends = TrendReq(hl='en-GB', tz=360, timeout=(5,15))

Attribute	Description
hl	Geolocation and Language
tz	Timezone Offset (Google uses 360 for the US by default)
timeout	timeout time (connection, read)

Next, we upload the keyword list on our XLSX spreadsheet containing a few terms aligning with our market selection that we want to investigate

colnames = ["keywords"]
df = pd.read_excel("/content/Keyword.xlsx", names=colnames)
df2 = df["keywords"].values.tolist()

Hence, we set up an empty data frame and loop for Pytrends, but first we need to build the payload.

Pytrends payload can be configured as follows:

PyTrends Attribute	Description
kw_list	Up to 5 keywords are allowed into the ordinary attribute, but not here as we loop for the uploaded XLSX
cat=0	Category to narrow results. Default is 0 but you can check the wiki page with all the categories
timeframe	Date to start from
geo	Two letter country abbreviation (e.g USA = US; UK = GB-ENG)

You can optionally include gprop that refers to what Google property to filter to.

Gprop could be, for example, images, news, youtube or froogle (for Google Shopping results)

You can find more on the pytrends official guidance


dataset = []

for x in range(0,len(df2)):
     keywords = [df2[x]]
     pytrends.build_payload(
     kw_list=keywords,
     cat=0,
     timeframe='2022-06-26 2022-09-23',
     geo='GB-ENG')
     data = pytrends.interest_over_time()
     if not data.empty:
          data = data.drop(labels=['isPartial'],axis='columns')
          dataset.append(data)

Next, we merge the results into the new Pandas dataframe created before building the payload and download the output as CSV.

result = pd.concat(dataset, axis=1)
result.to_csv('search_trends.csv')
result.head()

We can plot the results with a time series to get the full picture from this Pandas data frame

df = pd.read_csv(r'/content/search_trends.csv') 
df.plot(xlabel="Time", ylabel="Trends",figsize=(25,10), kind="line")

Bear in mind, though, there are multiple plotting methods you can leverage from dedicated Python libraries, such as Matplotlib or Seaborn.

Have a look at the top 50 matplotlib graphs you can build with Python and feel free to improve this time series for the better

Kick-Off Entity Research

Despite a recent descending trend, “Coffee” has been the trending entity in the beverage sector in the UK for the last 8 months.

Hence, I’d pick up this entity and start to investigate the spectrum of potential entity connections

To do so, we need to expand on:

Related Entities according to Google

Related Queries from Google Autosuggest

Related Entities surfacing Google

Here’s another quick preparatory analysis that will bring us inside the entity object of the research.

To explore early connections to “Coffee” as Google’s entity, we can still leverage the Pytrends library by calling up the pytrends.suggestions function.

After creating a Pandas data frame to store the related entities, we’re going to drop the unnecessary “mid” column that will be auto-generated by pytrends.

keywords = pytrends.suggestions('coffee')

df = pd.DataFrame(keywords)

cols = ['mid', 'Entity', 'Meaning']

df.columns = cols

df = df.drop("mid",axis=1)

df

⚠️WARNING⚠️

Despite we set a timeout for both connecting and reading Google Trends, your scraping request could still be blocked and return a 4xx error. You can wriggle out these events by either using a proxy in the payload or just delete and restart the runtime of your Google Colab.

Bear in mind that the names I attributed to the columns on this data frame are not casual.

“Coffee” is a physical and tangible sign, which encapsulates a large spectrum of meanings depending on cultural traits and psychographic features.

From a semiotic perspective, though, the main meaning of “Coffee” ties primarily with “Drink” . This should be indeed the main association that comes to mind to pretty much all of us.

💡Entity is a sign that depends on a spectrum of inner meanings and relates to a hundred others

Related Queries from Google Autosuggest

Having cleared the nuances around our primary target entity, it’s time to dig deeper and consult Google Autosuggest to explore the rest of the descending affinities of our target entity.

To do so, we need to establish connections with Google Autosuggest via requests and urllib_parse

def get_source(url):

    try:
        session = HTMLSession()
        response = session.get(url)
        return response
    except requests.exceptions.RequestException as e:
        print(e)

def get_results(query):
    query = urllib.parse.quote_plus(query)
    response = get_source("https://suggestqueries.google.com/complete/search?output=chrome&hl=en&q=" + query)
    results = json.loads(response.text)
    return results

Next, we type in our search term

lang_code = "en-gb" 
search_term = "coffee " 
results = get_results(search_term)
results

And we process a bit of results formatting.

def format_results(results):
    suggestions = []
    for index, value in enumerate(results[1]):
        suggestion = {'term': value, 'relevance': results[4]['google:suggestrelevance'][index]}
        suggestions.append(suggestion)
    return suggestions

formatted_results = format_results(results)
formatted_results

💡“Relevance” refers to an automated estimation based on the frequency that specific term is subjected to in the search results page

Right, now we’ve got our large set of connections to “Coffee”. Yet we should probably craft our findings based on our use cases.

Let’s start by spicing up the list with a few suffixes

def get_expanded_term_suffixes():
    expanded_term_suffixes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm','n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
    return expanded_term_suffixes

So we can add a Keyword Modifier to provide a search intent angle to our list of queries

def get_expanded_term_prefixes():
    expanded_term_prefixes = ['what *', 'where *', 'how to *', 'why *', 'buy*', 'how much*','best *', 'worse *', 'rent*', 'sale*', 'offer*','vs*','or*'
                             ]
    return expanded_term_prefixes

Here’s a refresh of which modifiers can tweak your content angle to shape the search intent:

Next, we expand the range of searches on Google Autosuggest to make sure we can collect as many results as possible

def get_expanded_terms(query):

    expanded_term_prefixes = get_expanded_term_prefixes()
    expanded_term_suffixes = get_expanded_term_suffixes()   

    terms = []
    terms.append(query)

    for term in expanded_term_prefixes:
        terms.append(term + ' ' + query)

    for term in expanded_term_suffixes:
        terms.append(query + ' ' + term)

    return terms

#Look for further suggestions

def get_expanded_suggestions(query):

    all_results = []

    expanded_terms = get_expanded_terms(query)
    for term in expanded_terms:
        results = get_results(term)
        results = format_results(results)
        all_results = all_results + results
        all_results = sorted(all_results, key=lambda k: k['relevance'], reverse=True)
    return all_results

After a bit of data cleaning, we can download the output with the results.

expanded_results = get_expanded_suggestions(search_term)
expanded_results_df = pd.DataFrame(expanded_results)
expanded_results_df.columns = ['Keywords', 'Relevance']
expanded_results_df.to_csv('results.csv')
pd.set_option('display.max_rows',expanded_results_df.shape[0]+1)
expanded_results_df.drop_duplicates('Keywords')
expanded_results_df

If you want to be more granular, you can check those queries that are below the 80th quantile of the query dataset for Relevance.

In layman’s terms, we are filtering out all the fat-head entities to focus on more granular queries deemed as less relevant than the top 20% of the dataset but likely more approachable in your strategy.

top_relevance = expanded_results_df[expanded_results_df['Relevance'] <= expanded_results_df['Relevance'].quantile(0.8)]
top_relevance.to_csv('top relevance.csv')
top_relevance

Looking at the array, “what coffee is good for weight loss” sounds like a captivating search query disclosing a plethora of topics.

I note down this query as it will turn out useful later on.

Keyword Clusterization

Once we gain a full list of related entities from Google Autosuggest, we’d like to learn how they connect to each other.

The most probabilistic method is to classify samples of similar entities into different groups so that we can gain groups with similar occurrences and concurrently set apart those showcasing discrepancies

💡Cluster analysis is a multivariate method which aims to classify a sample of subjects (or ob- jects) on the basis of a set of measured variables into a number of different groups such that similar subjects are placed in the same group.
Stats Tutor

The method will consider a number of steps:

Preprocessing. Cleaning and rinsing up the text of our related entities into uncluttered words so that we can turn them into numerical data.
Vectorisation. From words to numbers, this is the crucial step for data conversion to set up our next machine learning environment.
Applying K-means clustering. Providing a few inputs to the machine so that it’ll work on its own through this unsupervised machine learning algorithm. This will group query names with similar words.
Assessing cluster quality through cluster labelling and visualisation.

Now that we have a roadmap to clusterization, let’s install the required libraries and import the necessary dependencies.

Library	Description
Fuzzywuzzy	For sting-matching purposes, this library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
Plotly	This is a high-level, declarative charting library that enables to plot highly visual plots
Gensim	This is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

!pip install fuzzywuzzy
!pip install plotly
!pip install gensim

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px

#Libraries for preprocessing
from gensim.parsing.preprocessing import remove_stopwords
import string
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

#Download once if using NLTK for preprocessing
import nltk
nltk.download('punkt')

#Libraries for vectorisation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
from fuzzywuzzy import fuzz

#Libraries for clustering
from sklearn.cluster import KMeans

#Open the Output
df = pd.read_csv('results.csv')
text1 = df['Keywords']

Preprocessing

Within the Preprocessing stage of the analysis, also known as Tokenization, we remove unnecessary characters and words from the query text to only keep the meaningful values.

There are several Tokenization techniques and their selection varies on the use case.

For this tutorial, we are going to leverage some of the most popular:

Removing stopwords, punctuation and numbers. Stopwords are common words in a language like ‘the’, ‘a’, ‘is’, ‘and’.
Stemming words. Just like in the event of any basic entity and sentiment analysis in NLP, stemming is a tokenization technique that consists in shortening words to their root form.

There are several types of stemmers and the Porter’s is by far the most popular as it is available in the python NLTK module and it is also language-independent.

However, the Snowball stemmer works language-specific and might return better results.

#Remove stopwords, punctuation and numbers
text2 = [remove_stopwords(x)\
        .translate(str.maketrans('','',string.punctuation))\
        .translate(str.maketrans('','',string.digits))\
        for x in text1]

#Stem and make lower case
def stemSentence(sentence):
    stemmer = SnowballStemmer('english')
    token_words = word_tokenize(sentence)
    stem_sentence = [stemmer.stem(word) for word in token_words]
    return ' '.join(stem_sentence)

text3 = pd.Series([stemSentence(x) for x in text2])

Vectorizing Queries

Vectorisation is the process of turning cleaned text into numerical data so that we can perform statistical analysis on it. Again, there are plenty of techniques to choose from.

Bag of words: a basic model that counts the occurrences of words in a document, but still this is a very basic count of words that fails to properly extract the underlying meaning
TF-IDF (word level): this measures the frequency of a word in a document and compares it to the frequencies of all words in the text to assign it a weighted score of importance.
TF-IDF (n-grams level): this measures the frequency of sequences of words. In this scenario, we are going to use the ngram_range parameter to focus solely on n-grams between 1 (individual words) and 2 (sequences of 2 words or bi-grams).

#Bag of words
vectorizer_cv = CountVectorizer(analyzer='word')
X_cv = vectorizer_cv.fit_transform(text3)

#TF-IDF (word level)
vectorizer_wtf = TfidfVectorizer(analyzer='word')
X_wtf = vectorizer_wtf.fit_transform(text3)

#TF-IDF (n-gram level)
vectorizer_ntf = TfidfVectorizer(analyzer='word',ngram_range=(1,2))
X_ntf = vectorizer_ntf.fit_transform(text3)

🙏🏻 Shout out for this to Lily Wu and her Clustering Product Names with Python which I recommend you check out.

But the proper “coup de theatre” of vectorisation comes with the application of the LDA algorithm.

Vectorisation with Topic Modelling

The Latent Dirichlet Allocation (LDA) is a probabilistic unsupervised learning model typically used as a threshold for Topic Modelling tasks. The idea behind the algorithm is to parse multiple words in a document to probabilistically estimate the topics. Essentially, the algorithm assumes each text corpora is made up of several topics and similar topics use similar words.

Here is the result for 30 topics and the top 5 words per topic extracted from the original dataset of scraped queries.

lda = LatentDirichletAllocation(n_components=30, learning_decay=0.9)
X_lda = lda.fit(X_cv)

def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(6, 5, figsize=(30, 30), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f'Topic {topic_idx +1}',
                     fontdict={'fontsize': 30})
        ax.invert_yaxis()
        ax.tick_params(axis='both', which='major', labelsize=20)
        for i in 'top right left'.split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)
    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()
    
#Show topics
n_top_words = 5
feature_names = vectorizer_cv.get_feature_names()
plot_top_words(X_lda, feature_names, n_top_words, '')

(Please note that the screenshot will show only 15 topics due to length reasons).

Vectorisation with Topic Similarity

The last stage of vectorisation is to calculate a similarity score between two strings. Fuzzywuzzy applies the fuzzy matching calculation of similarity between strings worded slightly differently and returns a similarity score.

X_fuzz = pd.crosstab([text3.index,text3],text3).apply(lambda col: [fuzz.token_sort_ratio(col.name, x) 
                                                                   for x in col.index.get_level_values(1)])
X_fuzz.head()

Here is a matrix showcasing each coffee query and its similarity score with every other coffee word. Please note that the similarity score ranges from 0-100.

Cluster Analysis with K-means

About time to get deep to the core of the cluster analysis.

What we have to do is to specify the number of clusters and the algorithm will assign each observation to a cluster until each cluster’s observations are as close as possible to its mean (centroid)

To define such a number, we assign K-means (K) to each occurrence

💡K-means is a machine learning algorithm typically employed in unsupervised learning for clustering problems.

K-Means Clustering: How Does it Work?

K-means is an unsupervised machine learning model, hence it requires a few manual prompts to get started.

Just for your information, here’s the roadmap implicated in the K-Means process.

Prompt	Description
1	Select the number of clusters you want to identify
2	The Kmeans algorithm will select a number of random data points (centroids)that match the number of clusters.
3	The algorithm calculates the distances between each point(centroids) to each cluster point
4	Within each cluster, the algorithm starts to identify the closest data points in its range (internal clustering)
5	Calculate the mean of each cluster and assign it to the new cluster centroid
6	Redefine clusters, thereby assigning a new mean as the next cluster centroid
7	Rinse and repeat the process as long as you identify clusters that minimize errors (SSE)

💡SSE [Distortion] refers to the distance of all points within a cluster from the centroid of other points. In other words, SSE stands for the difference between the observed value and the predicted value
GeeksForGeeks.com

The Elbow Method

Some of you may claim that you can’t improvise to draw a feasible number of clusters to assign to the K-means. What is Data Science all for by the way?

There is, in fact, a method that will help us select the optimal number of clusters according to the number of occurrences in our entity dataset.

The Elbow method defines a range of K values to run k-means clustering on and evaluates the SSE (Distortion) for the model using each of the defined numbers of clusters.

The Elbow Method, as the name suggests, will provide a breaking scree plot. The data point where the plot cracks represent the number of clusters that you’d be better picking up

The best data source to apply the Elbow method is probably our bespoke Bag of Words as it encloses a count of word occurrences

But how many words did we find within the Related Entity dataset?

Let’s ask the Bag of Words

text3.count()

Once we know how many samples we picked up, we can finally apply the Elbow Method by assigning it to n_sample= the number generated from the Bag of words

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

#setting the model
features, labels = make_blobs(
    n_samples=320,
    centers=15
)
 
# Instanciate the model by assigning the K-means a hypothetical number of clusters
# and 10 iterations with different centroid seed 
model = KMeans(
    n_clusters=10,
    n_init=10,
    random_state=42
    )
 
# train the model
model.fit(features)
 
# make a prediction on the data
p_labels = model.predict(features)

Next up, we visualize the Scree Plot.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

ks = range(1,320)

ssr = []
# For each cluster K
for k in ks:
    # create model instance of K clusters
    model = KMeans(n_clusters=k, max_iter=1000).fit(X_cv)
 
 
    # append the inertial to a list
    ssr.append(model.inertia_)
     
# Plot
plt.plot(ks, ssr, '-o')
plt.xlabel('Clusters (k)')
plt.ylabel('SSR')
plt.xticks(ks)
plt.title('Elbow method for optimal K')
plt.axvline(x=5,linestyle='--',c='grey')
plt.show()

⚠️WARNING⚠️
Please note that the more clusters you submit to the check of the Elbow Method the longer it’s going to take to the script to parse the requests. So, grab a coffee in the meantime and have a little patience 😄

The scree plot looks to break at 300. However, we can assume that the real Elbow occurs at around 100, as this is the closest threshold where SSE starts to diminish.

Cluster Generation

Next up, we start generating the clusters by noting down the real number of groups we want to aim for. Then we concatenate the findings from our Bag of Words into a Pandas data frame.

kmeans = KMeans(n_clusters=100)
kmeans.fit(X_cv)
result = pd.concat([text1,pd.DataFrame(X_cv.toarray(),columns=vectorizer_cv.get_feature_names())],axis=1)
result['cluster'] = kmeans.predict(X_cv)
result[['Keywords','cluster']]

💡BONUS 💡
When you make a cluster analysis, it is imperative to compare matrices that have mutually exclusive data points. In other words, you can use SSE to compute a clustering task only on one single type of matrice due to their data format.

For example, we used the bag of words which uses integers (0, 1, 2, etc.) to represent the occurrence of a word. Alternatively, we could have used the TF-IDF (as words or n-grams) using float (0.1,0.2,0.3) but we would have never mixed it up with any other matrices using different data values.

Now we can give a name to each cluster to improve readability.

To do so, we’re going to replace the numerical values with the related categorical value

clusters = result['cluster'].unique()
labels = []
for i in range(len(clusters)):
    subset = result[result['cluster'] == clusters[i]]
    words = ' '.join([x for x in np.where(subset.all()!=0,subset.columns,None) if x and x!='Keywords' and x!='cluster' and len(x.split()) == 1])
    labels.append(words)
labels_table = pd.DataFrame(zip(clusters,labels),columns=['cluster','label'])
result_labelled = pd.merge(result,labels_table,on='cluster',how='left')

Next, we’re going to provide a numerical label for our new clusters so that we can be able to plot them accordingly

pd.pivot_table(result_labelled,index=['label'],values=['Keywords'],aggfunc='count').sort_values(['Keywords'], ascending=False)

The results show that at K=104 with the bag of words matrix, 104 out of 320 query names weren’t labelled, and 52 queries are in clusters of 1.

As a result, overall 33% of the query names were grouped into a clean cluster.

Should we be satisfied? Not really.

This is genuinely what Google Autosuggest provided to us by default, meaning there is little to no room for cluster improvement. Nevertheless, you could well toy around with the number of clusters to see if you can minimize the outliers and aim to cluster as many queries as possible.

As far as I’m concerned, I’ve been trying with 150, 160, 80, 90 and even 60 to strike the right balance. Though, all I ever found was either too many query names left unlabelled (excluded from the clusterization) or too many unique clusters.

None of the solutions would have helped, hence I agreed to make a compromise towards leaving some query names out of the process and collecting a few unique clusters.

Plot the Clusters on a Treemap

Let’s get some peace of mind now and plot the clusters onto a neat visual treemap.

A treemap visualization created with Plotly displays undoubtedly the clearest picture of the entire cluster analysis of our query dataset.

result_summary = pd.pivot_table(result_labelled,index=['label'],values=['Keywords'],aggfunc='count').reset_index().rename(columns={'Keywords':'count'})
result_treemap = result_summary[(result_summary['label'] != '') & (result_summary['count'] > 1)]
fig = px.treemap(result_treemap,path=['label'],values='count')
fig.show();

What we can learn is that Google seems prone to suggest “Coffee” search queries that align with:

Local search (“Coffee near”, “coffee shop”)
Discovery prompt (“bean coffee”, “coffee machine”)
Commercial Investigation (“best coffee”, “coffee vs”,”coffee worst”)

N-Grams Analysis on Top Queries

If you want to push your Related Entities analysis down the root of semantics, we can start digging into NLP and see how it applies to our research.

Natural Language Processing (NLP) is one of the most interesting subsets of Machine Learning that should lure any SEO practitioner.

These sorts of tasks could apply in case you want to explore a dataset imbued with categorical data to serve either Entity analysis , Sentiment analysis or even competitor analysis.

However, the primary convenient task that comes out in style is n-gram analysis. In layman’s terms, what we are going to investigate now is what are the most common sequence of words within our bespoke list of Related Entities collected from Google Autosuggest.

Let’s start with importing the NLTK library to set up the NLP text analysis

import nltk
textlist = expanded_results_df['Keywords'].to_list()
pd.Series(textlist).value_counts().head(10)

Next up, we intercept and remove stopwords to rinse up our text analysis

nltk.download('stopwords')
from nltk.corpus import stopwords
stoplist = stopwords.words('english')

Hence, import Counter to count related queries

from collections import Counter
x = Counter(textlist)
x.most_common(10)

And create a data frame including a list of bigrams and trigrams.

#create dataframe with bigrams and trigrams
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(2,3)) #can also select bigrams only

# matrix of ngrams
ngrams = c_vec.fit_transform(expanded_results_df['Keywords'])

# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)

# list of ngrams
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
            ).rename(columns={0: 'frequency', 1:'bigram/trigram'})

Print the most frequent bigrams/trigrams for every related query with the frequency they occur on the SERP for “Coffee”

df_ngram.head(20).style.background_gradient()

Big shout out on this to Marco Giordano and his Query count, N-Grams and Cluster Analysis for SEO which I adapted to this research.

SERP Analysis for a Specific Query

Remember that nice query connecting coffee to weight loss?

It’s about time to use its potential for a proper SERP analysis.

In the last part of this tutorial, we assign the search query “what coffee is good for weight loss” to an API that will help us extract fleshy insights from the SERP.

SERP Scraping: Snippets and About This Result

What can you scrape by using the Serp Api applied to the search results page?

The good news is that you can extract plenty of hidden SERP features that normally other APIs or scraping tools don’t explore.

Here’s a quick peek at the low-hanging fruits achievable from the next scraping analysis:

Feature	Description
Position	The ranking position of a URL in a given SERP
Title	A page’s title link as it appears on the SERP
Link	The webpage URL
Breadcrumbs	The path where users will find the exact result
Date	Just the last modified date of a webpage
Snippet	Any rich result that stands out in plain English (*If Any)
Highlighted Words in Snippet	Any bolded word from the snippet (*If Any)
About this Result	A semantic prompt of the source of information according to Google
About page	A link to expand on the About This Result feature for the specific URL
Cached page	A cached version of a URL
Related pages	A link to a similar SERP to find similar information

First thing, we need to import the Google Search library from Serp Api and then upload the API key.

Bear in mind that you have to sign up and open a free account to get access to the API.

Next, we build the payload with the following parameters and enclose the results into a Pandas data frame.

engine: specify the search engine you’ll be scraping from
query: this is the query you want to search for
location: the country you want to communicate the search is coming from
lang: the language the query search is in
num: how many results you want to be returned. 8 usually is the first page, 16 is two pages, etc, etc
google_domain: the country-specific domain you want to search from (ex. google.com or google.fr). This usually aligns with some of the parameters above.
gl: your target market where you can specify the country location further
hl: the language of the target market, AKA the SERP

You can check the full Google Search API documentation for additional parameters.


from serpapi import GoogleSearch

serp_apikey = "#######" 

params = {
    "engine": "google",
    "q": "what coffee is good for weight loss",
    "location": "United Kingdom",
    "google_domain": "google.com",
    "gl": "uk",
    "hl": "en",
    "num": 10,
    "api_key": serp_apikey
}

client = GoogleSearch(params)
data = client.get_dict()

# access "organic results"
df = pd.DataFrame(data['organic_results'])
df.to_csv('data.csv', index=False)
df

Here’s the result

Serp Api offers a large bag of credits at an advantageous price, so you shouldn’t worry too much about running this script. However, if you want to avoid subscribing for credits you can make sure that your SERP data frame is effectively downloaded by adding this tiny line:

! ls

Now, let’s process some data cleaning on the SERP data frame, as you will notice it’s all bloated with special characters that make it hard to get through the findings.

#remove special characters from values
Data = pd.read_csv('data.csv')
Data['snippet_highlighted_words'] =  Data['snippet_highlighted_words'].str.replace("\[|\"|\]", "")
Data['about_this_result'] = Data['about_this_result'].str.replace("{'source': {'description':", "")
Data['about_this_result'] = Data['about_this_result'].str.replace("https:\/\/serpapi.com\/searches\/6339f21a2c68788a45d8c065\/images\/\w*.png'}}", "")

#remove the first column that is mistakenly iterated win the process
Data = Data.iloc[: , 1:]

#fill NaN value 
Data = Data.fillna(0)
Data.isnull().sum()
Data.head()

Next, we can rename the columns to improve readability. Bear in mind you can’t append the following lines of code to the previous ones, so make sure to proceed in steps.

cols = ['Position','Title','Link', 'Breadcrumbs', 'Date', 'Snippet', 'Highlighted Words in Snippet', 'About this Result','About page', 'Cached page', 'Related pages']
Data.columns = cols
Data.head()

Et voilà.

You can now download the final masterpiece

Data.to_csv('SERP.csv', index=False)

Conclusion

By all means, this has been the hardest and longest Python framework I’ve put together. Although you can argue the model is not a hundred per cent probabilistic and may contain pitfalls, still it provides a strategic baseline to start off with entity research within the semantic realm.

I decided to put several data science bits together to provide solid roots to the project, but I couldn’t have done it without the fundamental expertise of the people I mentioned throughout this article. A special thanks to them for being such an inspiration source 🤩

To conclude, just bear in mind I am still a budding Pythonista and I am open to receiving critiques about how this model can be improved.

Cheers!

Simone De Palma

Technical SEO Executive

Simone De Palma is a Technical SEO Executive at iProspect UK and the founder of SEO Depths.

He graduated in Marketing and Management from Università IULM before completing a degree in Digital Marketing and Data Science at Leeds Beckett University.
Simone has worked as an SEO Specialist in digital agencies in Italy and the United Kingdom and he’s a contributor for the Search Engine Land.

When he’s away from his double screens, he enjoys cooling down with a refreshing swim at the pool. You could find him exploring art museums or enjoying the company of a classic romance.