Drill your Content Strategy on Topic Modelling with Python

On August 25th, Google announced the fortnightly rollout of the “Helpful Content Update” posing several questions on how SEO will approach content strategy. Given that Google is not going to accept content devised to please search engines, it is likely that in the long term a large proportion of content is doomed to a tragic end.

So what?

Other than avoiding obsolete SEO shenanigans like click-baiting, keyword stuffing and so forth, you may want to stay away from any attempt to overlap your content strategy. This means that if you’ve got a blog about make-up you obviously don’t want to jump on the bandwagon of financial matters. In a nutshell, you should genuinely focus on your niche and ensure your content always returns value and in-depth expertise straight to your users’ fingertips.

By the way, this content update is likely a natural prosecution of what Google has been doing so far throughout the latest product reviews update, that is remarking the Expertise label of E-A-T

💡If you want to learn more about E-A-T, do check out why expertise is the most important ranking factor of them all from Lily Ray

Nevertheless, the SEO community has been chatting over a lot in a desperate attempt to draw probabilistic outcomes. Since the purpose of this article drifts away from the gist of the HCU, I suggest jumping on the handy checklist compiled by Aleyda Solis to make sure your content aligns with the purposes of the “Helpful Content Update”.

To put it straight, though, the foundations of this update aren’t that complicated. You need to review your content to make sure it doesn’t drift from your main topic lane and provide in-depth value to your readers. As a matter of fact, if you’re looking after a website you should be aware of the pros and cons of your content.

So far so good. But what if you’re in charge of optimizing a website you don’t have a clue about? What if you wanted to just have a quick peek at the most relevant vs irrelevant topics according not to your Google Analytics but to what the inner Google algorithm deems? What if you wanted to know at least where to start with your SEO?

One method you can use to find out is using an old machine learning asset, topic modelling.

Why does it matter?

Google uses multiple topic layers when trying to rank content. This means that in case your content didn’t meet the statistical expectations for a given topic, it would likely slip off the Google Index as it won’t be considered relevant to users.

Table of Contents

What is Topic Modelling

Topic Modelling is a suite of NLP techniques to identify latent themes in a corpus or a group of text documents like newspaper articles or tweets. Whether topic modelling can be leveraged for a number of reasons, people who developed the models like to think of it as a sort of amplified reading of large groups of texts that you could never possibly read yourself.

To get a nice overview of Topic Modelling we should first provide a bit of context around the gist of content clusters.

Content Clusters in SEO and Machine Learning

We may already be aware or at least in the position to grasp what content clusters are in the context of SEO. These are groups of similar content clustered under a similar topic, and they normally come in a broad range of formats To make it simple, think of it in terms of the bespoke internal links pointing to a pillar page on your website with the likes of hyperlinks.

On the other hand, content clusters take a slightly different shape when it comes to machine learning and this is due to how they are addressed.

Whether the old SEO practice was to have a standalone page relying on backlinks to signal its importance, the new practices cling to semantic meanings and convey that page into a specific context of other existing content on your website. That is the main reason why you should refer to content clusters as Topic Models.

To date, Latent Dirichlet Allocation (LDA)represents the most powerful and efficient algorithm to work on Topic Modelling. Based on a probabilistic unsupervised learning model, the idea behind this algorithm is to end up with multiple words that can stand as descriptors for a particular topic whilst the algorithm calculates the probability of each word in the document.

What LDA provides to webmasters, but especially SEOs is the chance to look at different words describing a particular subtopic in a document and infer clues on both their semantic meaning and the TF (time-frequency) that these subtopics appear.

In a nutshell, topic modelling is an unsupervised machine learning method that helps us discover hidden semantic structures in a text and learn about topic representations.

In this post, we will learn topic relevance on a spectrum using this website as a “guinea pig”. LDA will be applied from the roots of a Python framework to convert a set of crawled pages to a set of topics. This will come as beneficial to inform your next content strategy moves and to start digging deeper into other specific SEO tasks:

  • 🏘 create topic clusters
  • 👵🏻 update old content
  • 🕳 find content gaps
  • ❌ find irrelevant topics to remove from Google Index

The Process

I highly recommend processing the following Python framework in Google Colab because of the readiness and UX provided, especially if you don’t master coding and programming languages.

Also, using Colab will speed up the setups as we’re not required to install from scratch every single Python libraries but just import them from the Colab database.

Alternatively, if you fancy using the Jupyter notebook as your machine baseline make sure to kickstart the script with the following installations:

!pip install Pandas
!pip instal matplotlib
!pip install pyLDAvis!pip install re
!pip install os
!pip install pickle
!pip install spacy
!pip install gensim
!pip install nltk
!pip install pprint

Here’s a small roadmap of the next steps through the topic modelling analysis:

  • Crawl your website with Screaming Frog or other web crawlers and upload the file in CSV . If you fancy a free web crawling solution, you can jump on the basic Advertools web crawler by Elias Dabbas.
  • Set up the environment by importing Python libraries
  • Perform data cleaning by removing punctuation and lower cases to allow LDA to work in a clean environment
  • Prepare the text for LDA topic model training
  • Analysis of the LDA topic model output

Install and Import Packages

Since we are going to use Google Colab, we only need to install a single library sitting outside Colab’s database. In fact, it’s a bit of a peculiar one as it resonates with the LDA algorithm.

pyLDAvis is in fact a Python module designed to help interpret topics in a topic model. Actually, the package extracts information from a topic model to ultimately deliver an interactive web-based visualization, which is the end goal of this post.

!pip install pyLDAvis

⚠️ IMPORTANT ⚠️

Please note that due to the recent dismissing of the original LDA package in Python – pyLDAvis.gensim – you shall install pyLDAvis alone and then importpyLDAvis.gensim_models

Then, we can start importing the packages required to progress with the setups.

For the purposes of this tutorial, I’d rather take you through all the steps so you don’t get lost in an awful lot of code. Hence, for the moment we will only import the threshold packages to enable the LDA algorithm and a few libraries for the data visualization and data cleaning.

Please note that these activities will take place later in the process, but it’s just fine to have them already at our disposal.

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

from pprint import pprint

import spacy

import pickle
import re 
import pyLDAvis
import pyLDAvis.gensim_models

import matplotlib.pyplot as plt 
import pandas as pd

Load the Crawled Pages

Since we’ve already imported Pandas as a baseline for building and styling data frames in Python, the next step is to import os to ultimately save the output of this topic modelling framework.

Then we leverage Pandas to upload the crawled pages that you should have already saved as a CSV file, and let Python read the crawl for us.

In case you came up with a crawl in Excel (.xlsx), I suggest tweaking the pd.read_csv so that it becomes pd.read_xlsx.

Don’t forget to append to the directory path .xlsx.

import os

os.chdir('..')

# Read the crawled pages
crawl = pd.read_csv('/DIRECTORY/PATH.csv')

# Print head
crawl.head()

This is what you get

Data Cleaning

Now that we’ve got a full crawl of the target website’s pages we need to fetch the text data from each page. Hence, the meta data column that we need to retrieve is the body_text column along with the ‘url’, ‘title’,’h1′ to provide a bit of context to the anlaysis.

crawl = pd.read_csv (r'/DIRECTORY/PATH.csv')
df = pd.DataFrame(crawl, columns= ['url', 'title', 'meta_desc', 'h1', 'body_text']) 
df

Similarly to uploading our crawl as a CSV file with Pandas, we iterate the Python package to save the only meta data columns required for the purposes of this analysis.

⚠️ BONUS ⚠️

For larger websites with plenty of editorial content, you may want to broaden the sample extraction by appending:axis=1).sample(100) right after the last square bracket ” ] “

Now that we’ve got a proper table to work on, we can start the factual data cleaning by removing punctuations and lower casing.

Hence, we are going to perform a simple preprocessing on the content of thebody_text column to make it more amenable for analysis, and reliable results.

To do that, we’ll use a RegEx to remove any punctuation, and then lowercase the text.

We don’t have to worry about importing or installing the right package as we’ve done it before.

# Remove punctuation
crawl['body_text_processed'] = \
crawl['body_text'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
crawl['body_text_processed'] = \
crawl['body_text_processed'].map(lambda x: x.lower())

# Print out the first rows of papers
crawl['body_text_processed'].head()

To test the preprocessing, we are going to put together a fancy word cloud using the wordcloud package. This will be beneficial in obtaining a visual representation of the most common words after hygienization.

As a whole, this is a crucial step to apprehend the data and ensuring we are on the right track, as well as whether any further preprocessing is necessary before kicking off the training of the LDA algorithm.

For this specific task, we are going to import wordcloud.

# Import the wordcloud library
from wordcloud import WordCloud

# Join the different processed titles together.
long_string = ','.join(list(crawl['body_text_processed'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=1000, contour_width=3, contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

This is likely what you might get

LDA Training Model Setup

Next, let’s work to transform the textual data in a format that will serve as an input for training the LDA model.

First and foremost we need to update the range of Python packages to import.

From pyLDAvis we need to import gensim.utils a sub-package of gensim. This lowercase tokenizes and optionally de-accents text to ultimately return Unicode strings that won’t be processed any further.

In addition to that, we need to import nltk to help us tokenise words and finally remove stopwords.

Let’s start with tokenizing the text and removing stopwords.

import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]


data = crawl.body_text_processed.values.tolist()
data_words = list(sent_to_words(data))

# remove stop words
data_words = remove_stopwords(data_words)

print(data_words[:1][0][:30])

Next, we convert the tokenized objects into a corpus and dictionary by importing another peculiar sub-package with the aid of pyLDAvis, gensim.corpora.

import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)

# Create Corpus
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0][:30])

And finally we kick off the LDA training.

Though we first need to import pprint, a Pyhton module that we use to print data structures in a human-friendly way so that we can read them.

To keep things simple, we’ll keep all the parameters to default except for inputting the number of topics.

The number of topics is left at 10 by default, where each topic is a combination of keywords, and each keyword contributes a certain weightage to the topic.

from pprint import pprint

# number of topics
num_topics = 10

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

Interpretation of the LDA Topic Model Output

Now that we have a model fully trained, it’s time to visualize the topics for interpretability.

To do so, we’ll stick with the bespoke visualization package, that pyLDAvis that we’ve been using so often throughout this journey.

As we mentioned the package would equip our framework with a handy overview of the topics leveraged by our target website.

In particular, this will help us to:

  • Better understanding and interpretation of individual topics
  • Better understanding of the relationships between the topics.
# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_data_filepath = os.path.join('/DIRECTORY_PATH' +str(num_topics))

if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

pyLDAvis.save_html(LDAvis_prepared, '/DIRECTORY_PATH'+ str(num_topics) +'.html')

LDAvis_prepared

⚠️BONUS⚠️

Please make sure to stress your directory path where you want to save the LDA topic modelling output.

I suggest you use the file path you used to upload the crawl file at the beginning without the file extension (.csv / .xlsx).

This is finally what you might get

What can we learn from this output?

As briefly mentioned in the introduction to this post, the LDA algorithm would help inform about topic relevance on the website subject of the analysis, thereby getting a clue on which topics are deemed relevant and irrelevant.

Due to the probabilistic nature of the model, the feasible outcomes will rely on existing variables representing the sample population.

That is the reason why we should expect outcomes to be gauged on entity salience or the overall weight of a single topic relevance against the sample (the website) and the relationship established between entities, or the connections in place between topics within the same sample(the website).

However, the data visualization above appears to provide an even more in-depth portrait.

1️⃣ Topic Relationships

If you look at the first matrix, you get an overview of topic relationships on the targeted website.

In this example, we may assume that data tends to drift from the main topic clusters. This means you might need to investigate whether pages embedding this topic are worth to be removed.

2️⃣ TF of Entities within Topics

If you look at the bar chart on your right you get the overall term frequency along with the estimated term frequency within the selected topics. This is important to understand which entities are deemed more or less important to Google across the website.

Obviously, only cross-referencing the bar chart with the matrix will allow you to learn that the estimated term frequency on the main Topic Cluster (n.1) deems the topic named as “search” as the most relevant, whereas “entities” and “algorithms” are considered as the least relevant.

Final Thoughts

Machine learning may seem a bit of a beast at first touch, but there’s no need to fear it. Paradoxically, you should be more aware of the risks in fully relying on machine learning rather than on your first-hand experience and testing.

Because LDA is a probabilistic model to pull the plug on Topic Modelling analysis, it doesn’t mean that it’s 100% right. Data science comes with natural outliers, meaning errors characterize every single prediction based on sample analysis.

Machine learning and language technology models should be taken with a grain of salt. Hence, I highly recommend taking advantage of this Python framework as a complementary tool to help better inform your content strategy.

In-Depth Reading & References

FAQ

How to upload a file with Pandas

To upload a document or a file with Pandas, you need to use the three arrows next to your uploaded file and copy/paste the file directory within the brackets following the pd.read_csv attribute

Why do we import NLTK in Python?

We import NLTK in Python to help us tokenize and tag excerpts of text before submitting them to NLP machine learning models.

What is Gensim Corpora?

Gensim Corpora is a set of complex concepts that resonate with:
– Document: some text.
– Corpus: a collection of documents.
– Vector: a mathematically convenient representation of a document. – Model: an algorithm for transforming vectors from one representation to another.
Gensim corpora is oftentimes associated with the concept of a Dictionary, in terms of a mapping between words and their integers.

Why do we import pprint?

You can import the pprint module in Python for debugging code bloated with API requests and large JSON files. It is useful to essentially print data structures in a readable, pretty way.

What is pyLDAvis?

pyLDAvis is a Python module designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Related Posts