Coding Script or Notebook
Google Colab
Academy (Access via Course)

Web Content Clustering: Topic Modeling with LDA (Notebook)

Topic modeling just got accessible for SEO professionals—this comprehensive Google Colab notebook uses Latent Dirichlet Allocation (LDA) to automatically discover hidden themes and content clusters within your website or content corpus. Created by Lazarina Stoy for MLforSEO.com, this template takes raw web content and reveals the underlying topic structure, helping you understand what themes dominate your content library, identify gaps, and organize pages into natural semantic groups.
The notebook guides you through the complete LDA workflow: upload a CSV file containing URLs and their corresponding content, and watch as the code preprocesses your text (handling contractions, removing stopwords, lemmatizing), generates word clouds for visual exploration, and builds an LDA model that identifies distinct topics within your content. The real power lies in the hyperparameter tuning section—you can test different numbers of topics and adjust alpha and beta values to maximize coherence scores, ensuring your topics are meaningful and distinct. The interactive pyLDAvis visualization lets you explore topic relationships and see which terms define each topic, while the final output assigns each URL to its primary topics with confidence scores, complete with bar charts showing the top-performing articles for each discovered theme.
Use this for:
‧ Discovering natural content themes across your website or content library without manual categorization
‧ Identifying content gaps by seeing which topics are underrepresented in your corpus
‧ Organizing hundreds of blog posts, articles, or product pages into semantic clusters for better site architecture
‧ Understanding topic overlap and relationships across your content using interactive visualizations
‧ Validating your content strategy by confirming whether your published content aligns with your intended topic coverage
‧ Creating data-driven content hubs by grouping URLs that share similar topic distributions
This is perfect for content strategists, SEO professionals, and digital marketers managing large content libraries who need to understand their content’s thematic structure at scale, identify organizational opportunities, and make data-driven decisions about content planning and site architecture.

What’s Included

  • Complete LDA pipeline with preprocessing, model training, hyperparameter tuning, and interactive topic visualization using pyLDAvis
  • Automated hyperparameter optimization tests multiple combinations of topic counts, alpha, and beta values to find the highest coherence score for your specific dataset
  • Assigns each URL to discovered topics with confidence scores and visualizes the top-performing articles per topic with bar charts—making it easy to understand which content best represents each theme
Academy Resource

Available in Academy

Introduction to Machine Learning for SEOs

This resource is available to academy members.

Access in Academy
Community support
Regular updates