Build your SERP Similarity Tool with Python

When search queries share the same semantic meaning, search intent may get a bit nuanced to identify.

Sleeping on untapped search intent sometimes can spark an unpleasant snowball effect on your SEO strategy that could end up affecting your ROI.

Good news is today there are plenty of automated tools that can help you streamline so much of the daily boring tasks. For instance, Keyword Insights has developed a SAAS version of what I’m going to showcase to you in this post. In a nutshell, this is a way to compare search results by assumed search intent similarity.

Let me be brutal: why spend thousands on yet another subscription when AI and language models can achieve the same results?

Well, this topic’s got legs and that’s why I’m bringing it on in this post where you will effectively learn how to build a handmade Similarity Tool to benchmark SERP parameters and identifying potential search intent discrepancies.

💡 Python can even help you gracefully visualize search intent.
Find out in this post how to visualize search intent discrepancies with a Sankey diagram

In other words, this post will help you to:

✅ Compare 2 SERPs at the same time
✅ Benchmarking URLs, snippets, highlighted words in a snippet and more
✅ Analyzing n-grams (strings only)

What you Need to Get Started

Let’s go through a few requirements first.

  • Serp API: The only practical requirement for running this framework is an open account on Serp Api to retrieve a plethora of organic features from the SERPs. You can sign up for free and get an account in seconds, then you will be granted a free starting credit to play around with the API.
  • Data Analysis – beginners level: Due to the large dependency on Pandas, it is recommended users are quite confident with the basics of this data package, but it’s not really mandatory at this stage.
  • Google Colab: although you could use Jupyter notebook, I’d still suggest jumping on Colab as it provides plenty of ready-to-run libraries and it’s the fastest option to execute basic scripts

⚠️WARNING⚠️
Despite looking quite lengthy, the framework is actually doubled to give users the chance to flex the code for each SERP. It is recommended to take a moment to read the instructions before running every scripts to avoid potential hiccups.

Honourable Mentions

The birth of this framework is largely due to the following sources of inspiration:

🎖Redirect Matching Automation by Alex Romero Lopez

🎖 Query Analysis (adapted Google Colab) by Marco Giordano

🎖 Machine Learning Use Cases for Tech SEO, AKA Patrick Stox‘s impressive presentation at Brighton SEO

Install and Import Dependencies

First you need to install and import some libraries that will be used as dependencies of our project.

LibaryDescription
PolyfuzzThis library performs fuzzy string matching, string grouping, and contains extensive evaluation functions.
We’ll use it to strike the bespoke similarity scores by the end of the framework.
Google search resultsThis is the official Serp Api library to scrape Google Search results. We’ll use this real-time API to retrieve a number of SERP features, like featured snippets.
PlotlyThis is a powerful library for getting your hands dirty with outstanding data visualization charts.
We’ll use it to plot a similarity score graph.
gensimThis is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
We’ll use it to preprocess(tokenizing) words from string-based corpora in the event of N-gram analysis.

Next, we must bring a few related packages into the environment.

You can refer to the prompts after the bash (#) in the following lines of code

!pip install polyfuzz
!pip install google-search-results
!pip install plotly
!pip install gensim

#Libraries for data manipulation
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px

#Libraries fo preprocessing/tokenization
from gensim.parsing.preprocessing import remove_stopwords
import string
from nltk.stem.snowball import SnowballStemmer
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#Libraries for vectorisation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#Libraries for similarity clustering
from sklearn.cluster import KMeans
from polyfuzz.models import TFIDF
from polyfuzz import PolyFuzz

Run the First SERP Analysis

First and foremost, we need to identify the object of our benchmark analysis.

If you work with content, it shouldn’t be too hard for you to find similar queries with subtle search intent crossing over multiple SERPs.

However, for this tutorial, I’m going to use a peculiar research demand based on a personal living experience.

As an Italian SEO living in the heart of England, I’ve always struggled to find the difference between “Latte” and “Cappuccino“. 

This is a typical example of a linguistics wear-out as both terms share the same signified but are named after a different signifier.

In other words, both terms identify the same item (with tiny differences, c’mon🤣) whilst getting recognized under different expressions.


💡 Signified and Signifier stand for the two main components of a sign, where signified pertains to the “plane of content”, while signifier is the “plane of expression”.


If you’re targeting “Latte” as your primary entity, are you going to compete against websites confronting on the Cappuccino’s search results page in the UK?

The only way to find out is to compare both SERPs and pick up a few organic features to use as our benchmarking tools.

Moving on the technical side of the methodology, we need to build a payload that sends our search queries to the Serp API key, which in turn will loop through the first 10 search results.

There are several parameters you can include, and this is especially convenient when it comes to kickstarting a comprehensive semantic market audit.

You can refer to the official Google Search Results API documentation from Serp Api.

from serpapi import GoogleSearch

serp_apikey = "YOUR_API_KEY" 

params = {
    "engine": "google",
    "q": "latte",
    "location": "United Kingdom",
    "google_domain": "google.com",
    "gl": "uk",
    "hl": "en",
    "num": 10,
    "api_key": serp_apikey
}

client = GoogleSearch(params)
data = client.get_dict()

# access "organic results"
df = pd.DataFrame(data['organic_results'])
df.to_csv('results_1.csv', index=False)
df

In layman’s terms, what you are required to do is nothing more than paste your API key(serp_apikey) function and include the search query in the payload (after the "q").

Manipulate Data from a Specific Column

That said, Pandas will have provided you with the organic results enclosed in a data frame that we are going to manipulate to focus solely on the SERP parameters needed for the analysis.

Given that our original research demand is to explore the extent of a cross competition between the “Latte” and “Cappuccino” verticals in the UK, we probably need to look at the URLs and how they are structured.

For this scope, we’re going to create a smaller data frame called “SERP_One” that singles out the link column from the Serp API output and it’s thereby renamed as URL1

SERP_One = pd.read_csv('/content/results_1.csv')
df = pd.DataFrame(SERP_One, columns=['link'])
df.columns = ['URL1']
df
First SERP analysis - URL column inspection

⚠️ Beware in case you were benchmarking snippets that you will have to replace link with snippet and rename the column as Snippet1.

Extra Data Cleaning (Optional)

In case you want to benchmark another SERP string feature – such as the Highlighted Snippet Results column – we could apply a few extra lines of code to remove special characters from the strings.

This part is entirely optional though, as you may gloss over and keep up with comparing URLs.

df ['Highlighted_Words1'] =  df ['Highlighted_Words1'].str.replace("\[|\"|\]|\'", "")
df = df.fillna(0)
df.isnull().sum()
df

N-Grams Analysis on Strings

On the other hand, this part is highly recommended if you benchmark string features from our first SERP dataset, such as any snippets.

Performing an n-gram analysis involves setting up a micropreprocessing of words from the dataset so that machine learning can help us plot a list of bigrams or trigrams.

This is beneficial as it enables us to learn more about the most frequent words at their semantic root on a certain search results page.

To do so, the nltk library is top-notch for tokenizing words, whereas the Snowball stemmer will be leveraged for minimizing each word at their semantic root.

Let me show you an example as if you were putting snippets under the spotlight of your benchmark analysis between both SERPs.

import nltk
textlist = df['Snippet1'].to_list()

from collections import Counter
x = Counter(textlist)

#download stopwords list to remove what is not needed
nltk.download('stopwords')
from nltk.corpus import stopwords
stoplist = stopwords.words('english')

#create dataframe with bigrams and trigrams
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(2,3)) #can also select bigrams only
# matrix of ngrams
ngrams = c_vec.fit_transform(df['Snippet1'])
# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)
# list of ngrams
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
            ).rename(columns={0: 'frequency', 1:'bigram/trigram'})
            
#Get the output
df_ngram.head(20).style.background_gradient()

Beware that if you’re eager to calculate a similarity score on these sorts of string-based SERP parameters, this process is the main requirement to abide by before progressing to the next stage.

Performing some data preprocessing on our snippet corpora of words will improve the accuracy of the ultimate similarity score.

n-gram snippet analysis of the first SERP

Run the Second SERP Analysis

Moving forward, we’ll follow a similar approach to what we’ve just implemented.

In line with our primary research objective, we’ll now explore the other side of the market, which we suspect could impact the “Latte” vertical.

Let’s investigate the SERP for “Cappuccino” by building a payload with Serp Api.


from serpapi import GoogleSearch

serp_apikey = "YOUR_API_KEY" 

params = {
    "engine": "google",
    "q": "cappuccino",
    "location": "United Kingdom",
    "google_domain": "google.com",
    "gl": "uk",
    "hl": "en",
    "num": 10,
    "api_key": serp_apikey
}

client = GoogleSearch(params)
data = client.get_dict()

# access "organic results"
df2 = pd.DataFrame(data['organic_results'])
df2.to_csv('results_2.csv', index=False)
df2

Manipulate Data on the Second SERP Output

Hence, we create SERP_Two as another mini data frame and we skim through the SERP parameters to only keep the link column that gets renamed as URL2 with the aid of Pandas.

SERP_Two = pd.read_csv('/content/results_2.csv')
df2 = pd.DataFrame(SERP_Two, columns=['link'])
df2.columns = ['URL2']
df2
Second SERP analysis - URL column inspection

⚠️ Beware in case you were benchmarking snippets that you will have to replace link with snippet and rename the column as Snippet2.

From now on, we will just rinse and repeat the bespoke optional data cleaning and the neat n-gram analysis.

# Optional Data Cleaning
########################

# df2 ['Highlighted_Words2'] =  df2 ['Highlighted_Words2'].str.replace("\[|\"|\]|\'", "")
# df2 = df2.fillna(0)
# df2.isnull().sum()
# df2

# N-gram Analysis
########################

import nltk
textlist = df2['Snippet2'].to_list()

from collections import Counter
x = Counter(textlist)

#download stopwords list to remove what is not needed
nltk.download('stopwords')
from nltk.corpus import stopwords
stoplist = stopwords.words('english')

#create dataframe with bigrams and trigrams
from sklearn.feature_extraction.text import CountVectorizer
c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(2,3)) #can also select bigrams only
# matrix of ngrams
ngrams = c_vec.fit_transform(df2['Snippet2'])
# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)
# list of ngrams
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
            ).rename(columns={0: 'frequency', 1:'bigram/trigram'})
            
#Get the output
df_ngram.head(20).style.background_gradient()

⚠️ Make sure to remove the # from the functions (starting with df2) if you want to execute the optional data cleaning. Also, be sure you are preprocessing the right string from the second data frame, which in our case refers to “Snippet2“.

Here’s the output.

n-gram snippet analysis of the second SERP

Comparing the Object of your Analysis

After preprocessing URLs or snippets, it’s time to merge the small data frames that we created from both the “scraped”Latte” and “Cappuccino”‘s SERPs.

To serve this purpose, we use the pd.concatfunction from Pandas to issue a new data frame grouping up the URLs (or Snippets) columns from above, and we’ll start setting up the bespoke similarity score environment.

With in mind our initial search, we’re about to compare the URLs scraped from the “Latte” and the “Cappuccino” search result pages.

Comparison = pd.concat([df, df2], axis=1)
Comparison = Comparison.dropna()
Comparison
URL similarity comparison from the SERP of "Latte" and "Cappuccino"

Fabulous. Let’s just make sure that your data columns are on the same page as of Length against each other.

URL1 = Comparison['URL1'].tolist()
cleanedList2 = [x for x in URL1 if str(x) != 'nan']
len(URL1)
URL2 = Comparison['URL2'].tolist()
cleanedList2 = [x for x in URL2 if str(x) != 'nan']
len(URL2)

⚠️ If you were benchmarking the Snippet parameter, make sure to replace URL1 and URL2 with Snippet1 and Snippet2 on all occurrences

Calculate TF-IDF of submitted values

Without making use of fuzzy matching we couldn’t progress to strike any similarity score on our URLs.

This is done by importing the TF-IDF model from Polyfuzz, the peak of fuzzy matching in my opinion.

tfidf = TFIDF(n_gram_range=(3,3), min_similarity=0.95, cosine_method='knn')
model = PolyFuzz(tfidf)
model.match(cleanedList, cleanedList2)
similarity = model.get_matches()

URL Benchmark Output

Finally, we display the output of our URL benchmarking analysis.

⚠️ Please, note that the order of the items is sorted by Similarity and not by SERP ranking

outcome = pd.DataFrame(similarity)
outcome.sort_values('Similarity', ascending=False, inplace=True)
outcome
URL similarity between SERP for the queries "Latte" and "Cappuccino"

As you notice, both the “Latte” and the “Cappuccino” SERPs cross over to a decent extent, with coffeebean.com being the only exact touchpoint between the verticals.

However, it’s interesting to learn who are the established players in both verticals and assume their approach toward their content marketing strategy. For instance, we assume bbcgoodfood.com thrives in differentiating its content proposition over both the SERPs. Besides, this is similar to what Wikipedia and Wikitionary do.

Now, we can’t assume that the bbggoodfood.com is top-notch in providing helpful content and pinpointing the differences between Latte and Cappuccino. Still, we have learned that they cross over both verticals by supplying a neat URL structure.


If we look at the Snippet benchmark analysis, this is what you should have received:

Snippet similarity for the query Latte and Cappuccino

Right away, we see a big difference between the chunks of meta descriptions used as snippets in Google Search. The similarity scores show there’s very little correlation, pointing to very few similar elements.

Conclusions

We can confirm the findings from our similarity study report that “Latte” and “Cappuccino” are to be considered as rather different entities in the eyes of Google, although in real life they tend to be perceived as very similar items.

It was so interesting to get a grip on who is playing who and what sorts of goals in the marketplace a website tries to achieve.

On the same page, learning about potential snippet gaps against two SERPs can help you adjust your keyword research depending on the semantic competitive arena

Long story short, I was constantly affected by my Italian biases and Google was right – as normally would be.

Further Readings

Summarise this post