Mapping 404 URLs to Live Pages with Fuzzy Matching (Guide)
Manually creating redirects for hundreds of 404 errors is tedious and error-prone—this production-ready Google Colab notebook automates the entire process by matching broken URLs to live ones using four different fuzzy matching algorithms (Levenshtein Distance, RapidFuzz, PolyFuzz with TF-IDF, FuzzyWuzzy) and automatically generating ready-to-implement .htaccess redirect rules. Created by Lazarina Stoy for MLforSEO, this workflow is specifically designed for technical SEO professionals managing large-scale 404 cleanup by finding the most semantically similar live URL for each broken link based on URL structure and path similarity. The quadruple-algorithm approach provides implementation confidence: when multiple algorithms agree on the same match with high similarity scores (75%+), you can implement redirects with minimal manual review; when algorithms diverge, those URLs are flagged for human judgment.
The notebook implements an intelligent, domain-aware matching pipeline with built-in safety validation. You upload two CSV files: 404 URLs (typically from Screaming Frog or server logs showing client errors) and live URLs (current working pages from your sitemap or crawl). The system prompts for your root domain (e.g., https://www.example.com) and automatically validates that all URLs belong to this domain, removing mismatches with user confirmation—preventing accidental cross-domain redirects or processing external broken links. After stripping the root domain for cleaner comparison (comparing “/blog/post-title” instead of full URLs), the notebook runs four independent matching algorithms: Levenshtein Distance (character-level edit distance, good for typos or slight URL variations), RapidFuzz (optimized implementation of fuzzy ratio, very fast for large datasets), PolyFuzz with TF-IDF (document-level similarity, captures semantic path structure), and FuzzyWuzzy (token-based matching, handles word reordering well). Each algorithm produces similarity scores normalized to 0-1 scale, filters matches below 75% threshold (configurable), and generates Apache .htaccess format redirect rules: “Redirect 301 /old-broken-url https://example.com/new-live-url”. The example processes 14,268 broken URLs against 1,773 live URLs, generating four separate CSV files (one per algorithm) with redirect rules ready for implementation.
Use this for:
‧ Bulk 404 cleanup by automatically finding the best redirect target for hundreds or thousands of broken URLs without manual comparison
‧ Post-migration 404 fixes when URL structures change and old indexed URLs return 404s—matching them to new equivalents based on path similarity
‧ Link equity preservation by creating redirects from broken URLs to similar live content rather than serving 404s and losing link value
‧ Implementation confidence validation by comparing results across four algorithms—matches where all methods agree can be implemented immediately
‧ Quality control and prioritization using the similarity threshold (75%) to separate high-confidence redirects from edge cases requiring manual review
‧ .htaccess rule generation that outputs production-ready Apache redirect syntax for immediate implementation or bulk import
‧ Domain safety with automatic validation preventing cross-domain redirect errors or processing of external broken links
This is perfect for technical SEO professionals, site administrators, and webmasters managing large-scale 404 cleanup—particularly valuable after site migrations, content consolidations, URL structure changes, or when inheriting sites with hundreds of accumulated broken links that need systematic redirect mapping rather than manual one-by-one analysis, and when you need algorithmically-validated redirect targets based on URL path similarity to preserve link equity and user experience.
What’s Included
- Quadruple-algorithm validation using Levenshtein Distance, RapidFuzz, PolyFuzz (TF-IDF), and FuzzyWuzzy for comprehensive match confidence across different similarity calculation methods
- Domain-aware validation automatically checks that all URLs belong to specified root domain, removing mismatches with user confirmation to prevent cross-domain redirect errors
- Automatic .htaccess redirect rule generation outputs production-ready Apache "Redirect 301" syntax for immediate implementation—no manual formatting required
- Configurable similarity threshold (default 75%) filters low-confidence matches, generating four separate CSV files (one per algorithm) for comparison and prioritized implementation
Created by
Introduction to Machine Learning for SEOs
This resource is part of a comprehensive course. Access the full curriculum and learning path.
View Full CourseRelated Concepts
Available in Academy
This resource is available to academy members.
Access in Academy