Backlinks are foundational to SEO. They signal authority, relevance, and trust to search engines.
Traditional backlink metrics like Domain Rating (DR) tell you how authoritative a linking site is, but they often miss how relevant those links are to your specific content.

The Backlink Semantic Analysis project tackles this by combining authority metrics with semantic relevance to return deeper insights into link value.
You can either fork the backlink analysis repository on GitHub or jump straight into the App.
This project is part of a growing SEO tool station, where Iโm collecting practical applications that support real-world SEO analysis.
๐ฅHonourable Mention
I’d like to thank Ramon Ejikemans for inspiring and brainstorming this project. Give him a follow on Linkedin!
What This Tool Does
The Backlink Semantic Analysis project enhances backlink audits by blending Domain Rating and semantic similarity between backlinks and your target pages. It focuses on links placed in the main content area of pages rather than in footers or sidebars โ which are often less valuable and might be spammy.
Instead of treating all backlinks equally, the tool measures how contextually related each backlink is to the content it points to. This helps you understand not just where links come from but how meaningful they are to your SEO goals.
How it Works
The workflow is straightforward:
- Export Backlinks from Ahrefs
Use a backlink tool such as Ahrefs to export links with only the ones placed within content โ not footers or navigation. - Prepare Your Data
Clean the exported CSV and convert it to.xlsx, keeping only the relevant columns required for analysis - Upload to the Streamlit App
Upload your cleaned.xlsxfile to the projectโs Streamlit interface - Choose a Model
Pick a sentence-embedding model (e.g.,all-MiniLM-L6-v2). This model calculates semantic similarity efficiently, even for large backlink datasets.
The app then calculates semantic similarity between each backlink URL and the corresponding target page, producing visuals and exportable data for further analysis.
Understanding Semantic Similarity
Semantic similarity โ measured via cosine similarity โ is a way of quantifying how topically aligned two pieces of text are. In this project, it compares backlink URLs to the pages they link to, giving you a numerical proxy for relevance.
Unlike raw string matching, this method captures deeper contextual signals. However, it has limitations: it doesnโt fully capture meaning nuances or how search engines like Google rank content holistically.
The reason lies in boring algebra. A cosine similarity compares the direction of word meanings (their โangleโ) but ignores their “strength” (length).
This means it only measures general topic similarity โ it doesnโt understand word order or deeper relationships.
To simplify, I elaborated on an example I borrowed from Elie Berreby‘s piece on why cosine similarity misses the angle:

Cosine similarity would see these as practically the same, even thoughย the meaning is different
In turn, Googleโs ranking algorithms (i.e; RankEmbed) go further. They also consider vector length (magnitude), allowing it to combine meaning with other signals like PageRank, freshness, or click data.
What this means for SEO
- For indexation: Google doesnโt rank pages by meaning alone โ it blends authority, freshness, and relevance (DotProduct & RankEmbed algorithms).
- For audits: Cosine similarity can be a useful early indicator, but itโs not the full picture. Donโt base important SEO decisions on it alone.
Introducing the Contextual Authority Score (CAS)
A key innovation in this project is the Contextual Authority Score (CAS) โ a single score that combines:
- Page authority of the backlink source
- Link dilution (number of other outbound links)
- Semantic relevance to the target page
CAS helps you prioritise backlinks that are not only strong in authority but also contextually relevant. A high CAS means the link is likely valuable, while a low CAS could signal low relevance or dilution.
This metric is particularly useful when auditing your link profile or identifying new link opportunities that align with authority and content relevance.

- UR: Measures the strength of the specific linking page, not the overall domain.
- ExLC: Counts outbound links to external domains and acts as a dilution factorโthe more links on a page, the less value each link carries.
- S: Indicates how topically relevant the linking page is to your content.
How to interpret CAS
- High CAS: The backlink is both authoritative and highly relevant.
- Low CAS: The link may be weak, diluted by many outbound links, or topically misaligned.
Important caveats with CAS
- CAS should not be used in isolation; always consider placement, anchor text, and other qualitative signals.
- The metric is influenced by clickstream data sources shaping inputs like UR (i.e; Ahrefs) โ you don’t know how this data has been pre-processed, treat it directionally.
| Metric | Authority | Relevance | Link Equity Awareness |
|---|---|---|---|
| Domain Rating (DR) | โ | โ | โ |
| URL Rating (UR) | โ | โ | โ |
| TF*IDF Similarity | โ | โ | โ |
| CAS | โ | โ | โ |
Step 1: Export Backlinks from Ahrefs
If you want to fork the project, you can follow along with more details on how to prepare the input file and action the underlying mechanisms.
Start by exporting backlinks from Ahrefs (or a similar tool).

Recommended filters
- Link type: In-content
- Exclude: footer, sidebar, navigation
- Include:
- Referring page URL
- Target URL
- Referring page DR / URL rating
- Number of outgoing links
This ensures the dataset reflects editorial links rather than structural ones.
Step 2: Prepare the Input File and Upload to the App
The app expects a clean .xlsx file.

Typical columns:
referring_urlreferring_page_http_codedomain_ratingURreferring_domainsExternal_linksPage_trafficTarget_url
๐ก Pro Tip โ Although the app already takes care of pre-processing, I’d suggest you regardless to remove non-HTTP 2xx pages and lost backlinks at this stage.
Next up, submit the file to the app and watch it gear up.

If you’d like to tweak the source code, follow along with the rest of the post or jump to the conclusions.
Step 3: Install Dependencies
Clone the repository and install the required Python packages.
git clone https://github.com/simodepth96/Backlink-Analysis.git
cd Backlink-Analysis
pip install -r requirements.txt
Key libraries used:
- pandas
- sentence-transformers
- scikit-learn
- streamlit
Step 4: Launch the Streamlit App
The analysis runs through a Streamlit interface.
streamlit run app.py
This opens a local web interface where you can upload your backlink file and configure the analysis.
Step 5: Load a Sentence Embedding Mode
Semantic similarity is calculated using transformer-based embeddings.
Example model:
all-MiniLM-L6-v2
Representative code snippet:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
This model converts URLs (or extracted text) into numerical vectors that capture topical meaning.
Step 6: Compute Semantic Similarity
The tool calculates cosine similarity between referring URLs and target URLs.
Example logic:
from sklearn.metrics.pairwise import cosine_similarity
ref_vectors = model.encode(df["referring_url"].tolist())
target_vectors = model.encode(df["target_url"].tolist())
df["semantic_similarity"] = [
cosine_similarity([r], [t])[0][0]
for r, t in zip(ref_vectors, target_vectors)
]
The output is a score between 0 and 1, where higher values indicate stronger topical alignment.
Step 7: Calculate Contextual Authority Score (CAS)
CAS combines authority, dilution, and relevance into a single metric.
Simplified example:
df["cas"] = (
df["dr"] *
df["semantic_similarity"] /
df["outgoing_links"]
)
What this captures:
- High DR pages carry more weight
- Links surrounded by fewer outbound links are stronger
- Contextually relevant links score higher
Visualise and Export Result
The Streamlit app generates:
Distribution charts for semantic similarity

CAS-based backlink rankings

Downloadable Table Output

Lowest Performing Backlinks (HTTP 200 Only)

This allows you to:
- Identify low-value but high-DR links
- Spot highly relevant backlinks worth protecting
- Prioritise future outreach targets
Why This Matters for SEO
Semantic similarity does not replace traditional SEO metrics, it complements them. CAS should be interpreted directionally to help you reframe backlink evaluation around meaning and context, not just raw authority.
Traditional backlink audits often emphasise quantity and broad authority metrics like DR. However, a link may come from a weak page, be overly diluted, or be semantically off-topic.
While useful, these metrics overlook how topically relevant a link is to your content. Evaluating backlinks through the lens of semantic similarity and CAS helps you:
- Identify undervalued, contextually strong links
- Avoid irrelevant high-DR backlinks that donโt support your SEO narrative
- Prioritise outreach and link building based on both authority and relevance
This approach aligns more closely with modern SEO priorities, where relevance and contextual alignment are critical for ranking signals.

