Map Keywords to Topics: Supervised (sBERT + Fuzzy) vs. Unsupervised (BERTopic) (Notebook)
Manually assigning thousands of keywords to topic clusters is time-intensive and inconsistent across team members—this comprehensive Google Colab notebook automates keyword-to-topic mapping using four different methodologies ranging from simple string matching to advanced transformer-based semantic analysis, enabling both supervised classification (when you have predefined topic labels) and unsupervised discovery (when topics need to be extracted from the data itself). Created by Lazarina Stoy for MLforSEO Academy, this workflow provides flexibility for different keyword research scenarios: fuzzy string matching for lightweight approximate matches based on character-level similarity, Sentence-BERT (sBERT) for semantically-aware topic assignment using embeddings and cosine similarity, and BERTopic for unsupervised topic modeling that automatically discovers thematic clusters without requiring predefined labels—each method progressively more sophisticated but requiring more computational resources, allowing SEO professionals to choose the appropriate technique based on dataset size, available topic taxonomies, and semantic complexity requirements.
The notebook implements three distinct mapping approaches with clear input/output workflows. The fuzzy matching section uses FuzzyWuzzy’s partial_ratio algorithm to compare keywords against predefined topic labels through character-level similarity scoring—users upload two CSV files (seed.csv containing topic labels or seed keywords, and match.csv containing keywords to be categorized) and the script calculates similarity scores between each match keyword and all seed keywords, identifying the best match and recording the score. This lightweight method works well for scenarios where topics and keywords share similar terminology or when handling typos and slight variations. The output (fuzzy_matching_result.csv) includes three columns: Match Keyword, Best Seed Keyword, and Match Score—enabling filtering by confidence threshold to identify high-certainty assignments versus ambiguous cases requiring manual review.
The sBERT supervised approach uses the all-MiniLM-L6-v2 Sentence-BERT model for semantic similarity-based mapping when you have predefined topic labels. Users upload keywords.csv (containing a Keywords column) and topics.csv (containing a Topics column), and the script generates vector embeddings for both datasets using transformer-based encoding that captures semantic meaning rather than just string similarity. Cosine similarity calculations create a similarity matrix comparing every keyword embedding against every topic embedding, and each keyword gets assigned to its highest-scoring topic. The example demonstrates dog owner keywords (first time dog owner, tips for first time dog owners, best small dogs for first time owners) being mapped to predefined topics like Dog Training and Behavior, Dog Accessories and Equipment, and Dog Breeds and Breed-Specific Care. The output (keyword_topic_mapping.csv) includes the original keywords, assigned topics, and similarity scores—providing semantic matching that understands meaning rather than just lexical overlap, making it superior to fuzzy matching for cases where keywords and topics use different terminology but refer to the same concepts.
The BERTopic unsupervised section addresses scenarios where you don’t have predefined topic labels and need the algorithm to discover thematic clusters directly from keyword data. Users upload a single keywords.csv file, and BERTopic applies transformer embeddings combined with clustering algorithms to automatically identify topics based on semantic patterns in the data—eliminating the need for manual topic definition or existing taxonomies. The script outputs two files: keywords_with_topics.csv (showing each keyword with its assigned topic number) and topic_summary.csv (providing topic metadata including count, representative keywords, and topic names). BERTopic handles short-form text effectively through transformer-based representations and includes model persistence (save and reload capabilities for multi-session work) and comprehensive visualization options using Plotly for interactive topic maps, word clouds showing representative terms for each topic, and bar charts displaying topic distribution by keyword count. The visualization section processes top 50 topics by count for distribution charts and generates word clouds for the top 10 topics to reveal dominant terms and thematic coherence.
Use this for:
‧ Supervised keyword classification when you have established topic taxonomies or content categories and need to automatically assign new keyword research exports to existing buckets
‧ Semantic topic matching using sBERT when keywords and topic labels use different terminology but refer to the same concepts—capturing meaning rather than just string similarity
‧ Unsupervised topic discovery with BERTopic when exploring new markets, analyzing competitor keyword sets, or working with keyword universes where natural thematic clusters aren’t predefined
‧ Multi-method validation by running the same keyword list through fuzzy matching, sBERT, and BERTopic to compare results and identify where methods agree (high confidence) versus diverge (requires review)
‧ Content architecture planning by using BERTopic to discover natural thematic groupings in search query data, revealing how users conceptually organize topics versus imposed taxonomies
‧ Quality control benchmarking by comparing fuzzy string matching results against sBERT semantic results to understand when lexical similarity misleads versus semantic understanding provides better assignments
‧ Scalable keyword categorization for large datasets (1000+ keywords) using the appropriate method based on available resources—fuzzy matching for quick lightweight processing, sBERT for semantic accuracy with predefined labels, BERTopic for discovery
‧ Topic taxonomy validation using BERTopic’s unsupervised discovery to test whether your predefined topics align with natural semantic clusters in actual keyword data
‧ Similarity score filtering across all three methods to separate high-confidence assignments (implement automatically) from low-scoring matches (manual review required)
‧ Iterative topic refinement by starting with BERTopic discovery to identify natural clusters, then using those discovered topics as seed labels for sBERT supervised mapping on larger datasets
This is perfect for SEO strategists, content planners, and keyword research specialists managing large keyword universes (500+ keywords) who need flexible, scalable topic assignment methods—particularly valuable when transitioning from manual keyword categorization to automated ML-driven approaches, when working with diverse keyword sources that need consistent topic assignment across multiple research exports, when exploring unfamiliar industries where topic discovery is needed before classification, when validating existing topic taxonomies against natural semantic clusters in search data, or when balancing accuracy needs against computational resources by choosing fuzzy matching for speed, sBERT for semantic precision with existing taxonomies, or BERTopic for unsupervised exploration—all implemented with downloadable CSV outputs, similarity scores for confidence assessment, model persistence for long-term projects, and visualization capabilities that make abstract topic assignments tangible through word clouds and distribution charts.
What’s Included
- Three progressive methods accommodate different scenarios: fuzzy string matching for lightweight character-based matching, sBERT for supervised semantic classification with predefined topics, and BERTopic for unsupervised topic discovery when labels don't exist
- Supervised sBERT approach uses all-MiniLM-L6-v2 transformer embeddings with cosine similarity calculations to assign keywords to topics based on semantic meaning rather than lexical similarity—capturing conceptual relationships even when terminology differs
- Unsupervised BERTopic workflow automatically discovers topics from keyword data without predefined labels, generating topic summaries with representative keywords and keyword counts for data-driven taxonomy creation
- Comprehensive visualization suite includes Plotly interactive topic maps, word clouds for top topics showing dominant terms, and bar charts for topic distribution—plus model save/reload functionality for multi-session analysis
Created by
Semantic ML-enabled Keyword Research
This resource is part of a comprehensive course. Access the full curriculum and learning path.
View Full CourseGet Instant Access
Enter your email and we’ll send you the download link immediately.
