Coding Script or Notebook
Google Colab
Free (Access via Email)

Calculating Semantically Similar Terms with Fuzzy Matching for Keyword Clusers (Notebook)

Keyword research generates hundreds of variations—this streamlined Google Colab notebook automatically identifies the top 10 most similar keywords for every term in your list using Levenshtein distance-based fuzzy matching. Created by Lazarina Stoy for MLforSEO, this focused tool solves a specific keyword research problem: when you have a large keyword universe and need to quickly understand which terms are variations of each other for clustering, grouping, or deduplication. Unlike semantic similarity (which requires embeddings and understands meaning), Levenshtein distance measures character-level similarity—perfect for catching typos, plurals, word order variations, and modifier differences like “home gym equipment” vs “best home gym equipment” (88% similar in the example).
The notebook implements a straightforward pairwise comparison workflow. You upload a CSV with a “Keyword” column containing your keyword list, and the algorithm calculates similarity scores between every keyword pair using normalized Levenshtein distance: it counts the minimum number of character edits (insertions, deletions, substitutions) needed to transform one keyword into another, then normalizes by the length of the longer keyword and converts to a percentage (0-100%). For each keyword, the notebook identifies the 10 most similar terms from the rest of the list, ranks them by similarity score, and outputs a CSV with two columns: “Keyword” (the original term) and “Top Similar Keywords” (a formatted list of the 10 closest matches with their percentage scores). The example shows “home gym equipment” matched with variations like “best home gym equipment,” “at home gym equipment,” and “home gym equipment for sale”—all scoring above 80% similarity due to shared core phrases with minor modifier differences.
Use this for:
‧ Keyword clustering by automatically grouping similar terms together for content planning or site architecture without manual categorization
‧ Duplicate and near-duplicate detection in keyword research exports from multiple tools (Google Keyword Planner, Ahrefs, SEMrush) that often contain overlapping terms
‧ PPC ad group organization by identifying which keywords should be grouped together based on string similarity for tighter match control
‧ Quality control for keyword lists by finding unintentional duplicates or typo variations that need cleaning before analysis
‧ Long-tail keyword analysis by understanding which longer queries are variations of shorter seed keywords
‧ Variation identification for title tag and meta description optimization—seeing which keyword variants can be naturally combined
‧ Data-driven keyword prioritization by identifying clusters of similar terms that collectively represent search volume opportunities
This is perfect for SEO professionals, PPC managers, and content strategists managing large keyword universes (100+ terms) who need to quickly understand similarity patterns within their keyword list—particularly valuable when combining keyword research from multiple sources, organizing keywords into topic clusters, or identifying which variations should be consolidated versus targeted separately, all based on string-level similarity rather than requiring semantic understanding or manual review of every possible keyword pair.

What’s Included

  • Pairwise comparison of all keywords in a single list—for each keyword, returns the top 10 most similar terms with percentage similarity scores
  • Levenshtein distance-based matching captures character-level similarity, making it effective for typos, plurals, word order variations, and modifier differences
  • Simple, focused output: two-column CSV showing each keyword and its 10 closest matches with scores, enabling quick visual review of similarity clusters
  • Normalized percentage scores (0-100%) make similarity intuitive to interpret and threshold for different use cases like strict deduplication (95%+) versus loose clustering (70%+)

Get Instant Access

Enter your email and we’ll send you the download link immediately.

No spam, ever
Instant delivery