Build your SERP Similarity Tool with Python

Reading time: 12 Minutes

Search Intent can be a bit of a beast to identify when it comes to search queries sharing the same semantic meaning.

Sleeping on identifying untapped search intent nuances sometimes can spark an unpleasant snowball effect on your SEO strategy that in turn could affect your ROI.

Good news is today there are plenty of automated tools out there that can help you cut the mustard for so much rusty work. For instance, Keyword Insights has recently released a juicy Similarity Tool to benchmark subtle intent differences welcomed with unbelievable warmth by the SEO community.

I’m not gonna lie, this is a useful and cool tool in the realm of machine learning applied to SEO. Personally, what struck me the most was the algorithmic foundations behind the release. Hence, I grew interested in exploring room for reiterate the brand-new masterpiece with a few Python libraries, and it looks there is plenty of potential.

In this post, you will learn how to build a handmade Similarity Tool aimed at help you better inform a market-oriented SEO strategy by benchmarking SERP parameters and identifying potential search intent discrepancies.

In other words, this post will help you to:

āœ… Compare 2 SERPs concurrently
āœ… Benchmarking URLs, snippets, highlighted words in a snippet and more
āœ… Analyzing n-grams (strings only)

Requirements and Assumptions

Let’s go through a few necessary requirements before kicking off.

  • Serp API: The only practical requirement for running this framework is an open account on Serp Api to retrieve a plethora of organic features from the SERPs. You can sign up for free and get an account in seconds, then you will have granted a free starting credit to play around with the API.
  • Data Analysis – beginners level: Due to the large dependency on Pandas, it is recommended users are quite confident with the basics of this data analysis package. However, it’s not mandatory as this post is designed to take you hand-by-hand through each coding step.
  • Google Colab: although you could use Jupyter notebook, I’d still suggest jumping on Colab as it provides plenty of ready-to-run libraries and it’s the fastest option to execute basics scripts

āš ļøWARNINGāš ļø
Despite looking quite lengthy, the framework is actually doubled to give users the chance to flex the code for each SERP. It is recommended to take a moment to read the instructions before running every scripts to avoid unpleasant bugs.

Honourable Mentions

The birth of this framework is largely due to the following sources of inspiration:

šŸŽ–Redirect Matching Automation by Alex Romero Lopez

šŸŽ– Query Analysis (adapted Google Colab) by Marco Giordano

šŸŽ– Machine Learning Use Cases for Tech SEO, AKA Patrick Stox‘s impressive presentation at Brighton SEO

Install and Import Dependencies

In order to get the ball rolling with the implementation of our similarity tool, we require to install and import some dependencies.

Essentially, the only external installs paint a fair overview of what this project is going to cover.

Python PackageDescription
PolyfuzzThis library performs fuzzy string matching, string grouping, and contains extensive evaluation functions.
We’ll use it to strike the bespoke similarity scores by the end of the framework.
Google search resultsThis is the official Serp Api library to scrape Google Search results. We’ll use this real-time API to retrieve a number of SERP features, like featured snippets.
PlotlyThis is a powerful library for getting your hands dirty with outstanding data visualization charts.
We’ll use it to plot a similarity score graph.
gensimThis is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
We’ll use it to preprocess(tokenizing) words from string-based corpora in the event of N-gram analysis.

Next up, we need to bring a few related packages into the environment.

You can refer to the prompts after the bash (#) in the following lines of code

!pip install polyfuzz
!pip install google-search-results
!pip install plotly
!pip install gensim

#Libraries for data manipulation
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px

#Libraries fo preprocessing/tokenization
from gensim.parsing.preprocessing import remove_stopwords
import string
from nltk.stem.snowball import SnowballStemmer
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#Libraries for vectorisation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#Libraries for similarity clustering
from sklearn.cluster import KMeans
from polyfuzz.models import TFIDF
from polyfuzz import PolyFuzz

Run the First SERP Analysis

About time to kickstart with the model.

First and foremost, we need to identify the object of our benchmark analysis.

If you work with content, it shouldn’t be too hard for you to find similar queries with subtle search intent crossing over multiple SERPs.

However, for the purpose of this tutorial, I’m going to use a peculiar research demand based on a personal living experience.

As an Italian SEO living in the heart of England, I’ve always struggled to find the difference between “Latte” and “Cappuccino“. 

This is a typical example of a linguistics wear-out as both terms share the same signified but are named after a different signifier.

In other words, both terms identify the same item (with tiny differences, c’mon🤣) whilst getting recognized under different expressions.


šŸ’” Signified and Signifier stand for the two main components of a sign, where signified pertains to the “plane of content”, while signifier is the “plane of expression”.


If you’re targeting “Latte” as your primary entity, are you going to compete against websites confronting on the Cappuccino’s search results page in the UK?

The only way to find out is to compare both SERPs and pick up a few organic features to use as our benchmarking tools.

Moving on the technical side of the methodology, we need to build a payload that sends our search queries to the Serp API key, which in turn will loop through the first 10 search results.

There are a number of parameters that you can include, and this is especially convenient when it comes to kickstarting a comprehensive semantic market audit.

You can refer to the official Google Search Results API documentation from Serp Api.

from serpapi import GoogleSearch

serp_apikey = "#####" 

params = {
    "engine": "google",
    "q": "latte",
    "location": "United Kingdom",
    "google_domain": "google.com",
    "gl": "uk",
    "hl": "en",
    "num": 10,
    "api_key": serp_apikey
}

client = GoogleSearch(params)
data = client.get_dict()

# access "organic results"
df = pd.DataFrame(data['organic_results'])
df.to_csv('results_1.csv', index=False)
df

In layman’s term, what you are required to do is nothing more than pasting your API key(serp_apikey) function and include the search query in the payload (after the "q").

Manipulate Data from a Specific Column

That said, Pandas will have provided you with the organic results enclosed in a data frame that we are going to manipulate to focus solely on the SERP parameters needed for the analysis.

Given that our original research demand is to explore the extent of a cross competition between the “Latte” and “Cappuccino” vertical in the UK, we probably need to look at the URLs and how they are structured.

For this scope, we’re going to create a smaller data frame called “SERP_One” that singles out the link column from the Serp API output and it’s thereby renamed as URL1

SERP_One = pd.read_csv('/content/results_1.csv')
df = pd.DataFrame(SERP_One, columns=['link'])
df.columns = ['URL1']
df
First SERP analysis - URL column inspection

āš ļø Beware in case you were benchmarking snippets that you will have to replace link with snippet and rename the column as Snippet1.

Extra Data Cleaning (Optional)

In case you want to benchmark another SERP string feature – such as the Highlighted Snippet Results column – we could apply a few extra lines of code to remove special characters from the strings.

This part is entirely optional though, as you may gloss over and keep up with comparing URLs.

df ['Highlighted_Words1'] =  df ['Highlighted_Words1'].str.replace("\[|\"|\]|\'", "")
df = df.fillna(0)
df.isnull().sum()
df

N-Grams Analysis on Strings

On the other hand, this part is highly recommended if you benchmark string features from our first SERP dataset, such as any snippets.

Performing an n-gram analysis involves setting up a micro preprocessing of words from the dataset so that machine learning can help us plot a list of bigrams or trigrams.

This is beneficial as it enables us to learn more about the most frequent words at their semantic root on a certain search results page.

To accomplish this purpose, the nltk library is top-notch to tokenize words, whereas the Snowball stemmer will be leveraged for minimizing each word at their semantic root.

Let me show you an example as if you were putting snippets under the spotlight of your benchmark analysis between both SERPs.

import nltk
textlist = df['Snippet1'].to_list()

from collections import Counter
x = Counter(textlist)

#download stopwords list to remove what is not needed
nltk.download('stopwords')
from nltk.corpus import stopwords
stoplist = stopwords.words('english')

#create dataframe with bigrams and trigrams
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(2,3)) #can also select bigrams only
# matrix of ngrams
ngrams = c_vec.fit_transform(df['Snippet1'])
# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)
# list of ngrams
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
            ).rename(columns={0: 'frequency', 1:'bigram/trigram'})
            
#Get the output
df_ngram.head(20).style.background_gradient()

Beware that if you’re eager to calculate a similarity score on these sort of string-based SERP parameters, this process is the main requirement to abide by before progressing onto the next stage.

Performing some data preprocessing on our snippet corpora of words will improve the accuracy of the ultimate similarity score.

n-gram snippet analysis of the first SERP

Run the Second SERP Analysis

From now on you’ll notice the following process mimics what we’ve just done.

Tallying on with our main research demand, we’re going to venture out on the other side of the market that we presume could contaminate the “Latte”‘s vertical.

Let’s investigate the SERP for “Cappuccino” by building a payload with Serp Api.


from serpapi import GoogleSearch

serp_apikey = "#######" 

params = {
    "engine": "google",
    "q": "cappuccino",
    "location": "United Kingdom",
    "google_domain": "google.com",
    "gl": "uk",
    "hl": "en",
    "num": 10,
    "api_key": serp_apikey
}

client = GoogleSearch(params)
data = client.get_dict()

# access "organic results"
df2 = pd.DataFrame(data['organic_results'])
df2.to_csv('results_2.csv', index=False)
df2

Manipulate Data on the Second SERP Output

Hence, we create SERP_Two as another mini data frame and we skim through the SERP parameters to only keep the link column that gets renamed as URL2 with the aid of Pandas.

SERP_Two = pd.read_csv('/content/results_2.csv')
df2 = pd.DataFrame(SERP_Two, columns=['link'])
df2.columns = ['URL2']
df2
Second SERP analysis - URL column inspection

āš ļø Beware in case you were benchmarking snippets that you will have to replace link with snippet and rename the column as Snippet2.

From now on, we will just rinse and repeat the bespoke optional data cleaning and the neat n-gram analysis.

# Optional Data Cleaning
########################

# df2 ['Highlighted_Words2'] =  df2 ['Highlighted_Words2'].str.replace("\[|\"|\]|\'", "")
# df2 = df2.fillna(0)
# df2.isnull().sum()
# df2

# N-gram Analysis
########################

import nltk
textlist = df2['Snippet2'].to_list()

from collections import Counter
x = Counter(textlist)

#download stopwords list to remove what is not needed
nltk.download('stopwords')
from nltk.corpus import stopwords
stoplist = stopwords.words('english')

#create dataframe with bigrams and trigrams
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(2,3)) #can also select bigrams only
# matrix of ngrams
ngrams = c_vec.fit_transform(df2['Snippet2'])
# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)
# list of ngrams
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
            ).rename(columns={0: 'frequency', 1:'bigram/trigram'})
            
#Get the output
df_ngram.head(20).style.background_gradient()

āš ļø Make sure to remove the # from the functions (starting with df2) if you want to execute the optional data cleaning. Also, be sure you are preprocessing the right string from the second data frame, which in our case refers to “Snippet2“.

Here’s the output.

n-gram snippet analysis of the second SERP

Comparing the Object of your Analysis

After preprocessing URLs or snippets, it’s time to merge the small data frames that we created from both the “scraped”Latte” and “Cappuccino”‘s SERPs.

To serve this purpose, we use the pd.concatfunction from Pandas to issue a new data frame grouping up the URLs (or Snippets) columns from above, and we’ll start setting up the bespoke similarity score environment.

Bearing in mind our initial research demand, we’re hereby comparing the URLs scraped from the “Latte” and the “Cappuccino” search result pages.

Comparison = pd.concat([df, df2], axis=1)
Comparison = Comparison.dropna()
Comparison
URL similarity comparison from the SERP of "Latte" and "Cappuccino"

Fabulous. Let’s just make sure that your data columns are on the same page as of Length against each other.

URL1 = Comparison['URL1'].tolist()
cleanedList2 = [x for x in URL1 if str(x) != 'nan']
len(URL1)
URL2 = Comparison['URL2'].tolist()
cleanedList2 = [x for x in URL2 if str(x) != 'nan']
len(URL2)

āš ļø If you were benchmarking the Snippet parameter, make sure to replace URL1 and URL2 with Snippet1 and Snippet2 on all occurrences

Calculate TF-IDF of submitted values

Without making use of fuzzy matching we couldn’t progress to strike any similarity score on our URLs.

This is done by importing the TF-IDF model from Polyfuzz, the peak of fuzzy matching in my opinion.

tfidf = TFIDF(n_gram_range=(3,3), min_similarity=0.95, cosine_method='knn')
model = PolyFuzz(tfidf)
model.match(cleanedList, cleanedList2)
similarity = model.get_matches()

Display and Plot the Similarity Score

Finally, we display the output of our URL benchmarking analysis.

āš ļø Please, note that the order of the items is sorted by Similarity and not by SERP ranking

outcome = pd.DataFrame(similarity)
outcome.sort_values('Similarity', ascending=False, inplace=True)
outcome
URL similarity between SERP for the queries "Latte" and "Cappuccino"

As you notice, both the “Latte” and the “Cappuccino” SERPs cross over to a decent extent, with coffeebean.com being the only exact touchpoint between the verticals.

However, it’s interesting to learn who are the established players in both verticals and assume their approach toward their content marketing strategy. For instance, we assume bbcgoodfood.com thrives in differentiating its content proposition over both the SERPs.Besides, this is similar to what Wikipedia and Wikitionary do.

Now, we can’t assume that the bbggoodfood.com is top-notch in providing helpful content and pinpointing the differences between Latte and Cappuccino. Still, we have learned that they cross over both verticals by supplying a neat URL structure.


If we look at the Snippet benchmark analysis, you will probably have received a snippet comparison that looks like the following:

Snippet similarity for the query Latte and Cappuccino

Possibly this excerpt of the benchmark analysis provides the most truthful answer to our original research demand.

We can immediately spot a significant difference between the snippets popping over from both SERPs, as the similarity scores turn out to be struggling to strike a neat correlation highlighting any similarity instances.


If you struggle to visualize the results and raise valuable assumptions, you can always plot the benchmark analysis with a Precision-Recall Curve, whether it is for URLs or Snippets.

To give you an example, this is how the URLs benchmark analysis will look like to us

model.visualize_precision_recall()
recall precision curve

Conclusions

We can confirm the findings from our similarity study report that “Latte” and “Cappuccino” are to be considered as rather different entities in the eyes of Google in compliance with their distinct signifiers, albeit in real life they tend to be perceived as very similar items.

It was so interesting to be able to get a grip on who is playing who and what sorts of goals in the marketplace a website tries to achieve.

On the same page, learning about potential snippet gaps against two SERPs can help you adjust your keyword research depending on the semantic competitive arena

Long story short, I was constantly affected by my Italian biases and Google was right – as normally would be.

Further Readings

Never Miss a Beat

Subscribe now to receive weekly tips about Technical SEO and Data Science šŸ”„