ML-Powered Internal Linking Opportunity Discovery with LinkBERT
Internal linking at scale requires more than manual review—this specialized Google Colab notebook uses LinkBERT-large (a transformer model specifically trained on document relationships) to automatically identify semantic connections between articles for strategic internal linking. Created by Lazarina Stoy for her Introduction to ML for SEOs course, this workflow goes beyond simple keyword matching by generating deep semantic embeddings from article titles and content, then calculating cosine similarity scores between all article pairs to reveal which pages discuss related topics and should link to each other. Unlike general-purpose embeddings, LinkBERT was pre-trained on hyperlinked text and understands document relationships, making it particularly effective for identifying meaningful internal linking opportunities that strengthen topical authority and site architecture.
The notebook provides a flexible, production-ready pipeline for analyzing entire content libraries. You upload a CSV with three columns (Address/URL, Title, Content), then choose your analysis approach: title embeddings only (fast, good for high-level topical clustering), content embeddings only (comprehensive but slower, captures deep semantic relationships), or combined title+content embeddings (balanced approach that considers both headline topics and full article context). The algorithm processes articles in batches with GPU acceleration when available, extracts 1024-dimensional LinkBERT embeddings for each article, computes a similarity matrix showing how related every article pair is (0 to 1 scale), and outputs results in two formats: a “high similarity pairs” CSV (Source URL, Target URL, Similarity Score) filtered by your threshold (recommended 0.5-0.8) showing actionable linking opportunities, and a full similarity matrix with all articles as both rows and columns—useful for network visualization or advanced analysis. The example processes 1,341 articles, demonstrating scalability for enterprise content libraries, and the notebook automatically downloads all embedding files so you can reuse them without recomputing.
Use this for:
‧ Automated internal linking strategy by identifying which articles should link to each other based on semantic relevance rather than manual editorial judgment
‧ Topical cluster optimization by discovering related content that should be interconnected to strengthen topical authority signals for search engines
‧ Content silo architecture by analyzing the similarity matrix to identify distinct content clusters and gaps in your linking structure
‧ Link opportunity prioritization using similarity scores—highest scores represent the strongest semantic relationships and most valuable linking opportunities
‧ Programmatic link insertion by feeding the high-similarity pairs output into content management systems or bulk editing tools
‧ Content audit insights by identifying orphaned content (pages with no high-similarity matches) that may need expansion or consolidation
‧ Competitive advantage through ML-powered site architecture that goes beyond what manual internal linking strategies can achieve at scale
This is perfect for SEO professionals, content strategists, and site architects managing large content libraries (100+ articles) where manual internal linking becomes impractical—particularly valuable for enterprise sites, publishers, or content-heavy SaaS companies that need to systematically optimize internal link structure based on semantic relationships rather than relying on authors to manually identify relevant linking opportunities across hundreds or thousands of articles.
What’s Included
- Uses LinkBERT-large, a transformer model specifically pre-trained on document relationships and hyperlinked text, making it particularly effective for internal linking compared to general-purpose embeddings
- Processes large content libraries at scale (example shows 1,341 articles) with batch processing, GPU acceleration, and memory-efficient similarity computation
- Flexible analysis options: title embeddings only, content embeddings only, or combined approach with configurable similarity thresholds (0.5-0.8 recommended)
- Dual output formats—high-similarity pairs CSV for immediate actionable linking opportunities plus full similarity matrix for network visualization and advanced analysis
Created by
Introduction to Machine Learning for SEOs
This resource is part of a comprehensive course. Access the full curriculum and learning path.
View Full CourseAvailable in Academy
This resource is available to academy members.
Access in Academy