Web Content Clustering: Topic Modeling with BERTopic (Notebook)
Move beyond traditional topic modeling—this Google Colab notebook uses BERTopic to discover semantically coherent themes in your web content through transformer-based embeddings instead of word frequency patterns. Created by Lazarina Stoy for MLforSEO.com, this modern approach leverages sentence transformers to understand the actual meaning of your content, producing more accurate and interpretable topic clusters than traditional methods like LDA, especially for shorter texts or content with varied vocabulary.
The notebook walks you through the complete BERTopic pipeline: upload a CSV with URLs and content, and the code handles preprocessing (expanding contractions, removing stopwords, lemmatizing), generates transformer-based embeddings using Sentence-BERT, reduces dimensionality with UMAP, and clusters content using HDBSCAN. Unlike bag-of-words approaches, BERTopic understands semantic relationships—meaning “smartphone” and “mobile device” are recognized as related concepts even if they never co-occur. The customizable clustering parameters let you control granularity (minimum cluster size, neighbor relationships), and the interactive intertopic distance map visualizes how topics relate to each other in 2D space. You get two downloadable CSV files: one mapping each URL to its assigned topic, and another summarizing each topic with its most representative terms and document counts.
Use this for:
‧ Discovering semantic content themes that traditional keyword-based methods miss, especially in varied or creative writing
‧ Clustering content with inconsistent terminology where different authors use different words for the same concepts
‧ Analyzing shorter content pieces (product descriptions, social posts, titles) where LDA struggles due to limited word co-occurrence
‧ Creating content taxonomies based on meaning rather than surface-level keyword patterns
‧ Identifying outlier content that doesn’t fit into any clear topic cluster (marked as topic -1)
‧ Understanding topic relationships through interactive 2D visualization showing which themes are similar or distinct
This is perfect for content strategists and SEO professionals working with diverse content libraries who need more nuanced topic discovery than word frequency can provide, or anyone dealing with shorter content formats where traditional topic modeling methods fail to capture meaningful themes.
What’s Included
- Transformer-based semantic understanding captures meaning and context rather than just word co-occurrence—producing more accurate topics for modern web content
- Fully customizable clustering pipeline with adjustable UMAP and HDBSCAN parameters to control topic granularity and sensitivity
- Interactive intertopic distance map visualizes topic relationships in 2D space, making it easy to understand which themes overlap or diverge
Created by
Introduction to Machine Learning for SEOs
This resource is part of a comprehensive course. Access the full curriculum and learning path.
View Full CourseAvailable in Academy
This resource is available to academy members.
Access in Academy