Synthetic vs User Query Classification Notebook with Rule-Based and Machine Learning Detection
Distinguishing AI-generated synthetic queries from natural user searches is critical for keyword research quality—this Google Colab notebook provides three progressive classification approaches (feature extraction, rule-based decision trees, machine learning) to automatically identify whether keywords originated from real user searches or AI tools. Created for MLforSEO lesson practice, this implementation enables SEO professionals to filter synthetic keywords that inflate research data but lack real search volume, using 12 linguistic features including character count, stop word ratios, reading grade complexity, entity density, and technical vocabulary presence to score queries on a USER vs SYNTHETIC scale.
We have a free-to-use no-code version of this script as tool in our Tool library – see the Synthetic vs User-generated Queries Classifier ✨
Section 1: Feature Extractor
The QueryFeatureExtractor class analyzes queries across four dimensions: length features (character count, word count, average word length), lexical features (stop word ratio, question words, unique word ratio), semantic features (entity count via spaCy, entity density, Flesch-Kincaid reading grade), and structural features (question marks, prepositions, proper nouns, capitalization). Example outputs show user queries like “best running shoes” scoring 18 characters/3 words while synthetic queries like “running shoes biomechanical support features” score 44 characters/5 words with higher reading complexity.
Section 2: Rule-Based Classification
The QueryClassifier implements seven weighted rules based on research findings: queries 5+ words (synthetic indicator), low stop word ratio <0.2 (synthetic), high reading grade >10 (synthetic weight: 2), high entity density >0.35 (synthetic weight: 2), technical vocabulary presence (domain-specific terms like biomechanical, methodology, physiological), absence of question words in multi-word queries (synthetic), and multiple proper nouns (synthetic). Classification uses a threshold system where scores ≥4 out of 8.5 maximum indicate SYNTHETIC with confidence percentages. Example results demonstrate “best pizza near me” classifying as USER (0.06 confidence) while “neapolitan pizza dough fermentation techniques” classifies as SYNTHETIC (0.65 confidence).
Section 3: Machine Learning Classification
The MLQueryClassifier uses RandomForestClassifier trained on labeled query datasets (0=USER, 1=SYNTHETIC). The implementation includes training data preparation converting feature dictionaries to arrays, model training with 80/20 train-test split, performance evaluation via classification reports and confusion matrices, and prediction with probability scores. Feature importance analysis reveals top predictive features: char_count (24.3%), stop_word_ratio (22.1%), avg_word_length (17.2%), and reading_grade (13.4%). The KeywordAnalyzer wrapper applies classification to keyword CSV files, generating insights showing USER vs SYNTHETIC distribution percentages and average search volumes for strategic filtering decisions.
Use this for:
‧ Keyword research quality control by filtering AI-generated synthetic queries that contaminate datasets with terms nobody actually searches
‧ Content gap validation ensuring identified opportunities reflect real user language patterns rather than AI-generated variations
‧ Search volume reliability assessment by separating authentic queries with measurable demand from synthetic expansions lacking real traffic
‧ Training data creation for custom classifiers by understanding which linguistic features distinguish natural searches from AI generations
‧ Competitive keyword analysis by identifying whether competitors target synthetic long-tail variations versus actual user queries
This is perfect for SEO analysts, keyword researchers, and data scientists working with large keyword datasets (1000+ terms) who need automated synthetic query detection—particularly valuable when using AI-powered keyword expansion tools that generate variations lacking real search demand, when auditing keyword research quality before content planning, or when building custom classification models trained on industry-specific query patterns.
What’s Included
- Three progressive approaches from basic feature extraction through rule-based classification (7 weighted rules with confidence scoring) to supervised machine learning with Random Forest
- 12 linguistic features across length, lexical, semantic, and structural dimensions including stop word ratios, reading grade complexity, entity density, and technical vocabulary detection
- Rule-based threshold system scores queries 0-8.5 with classification at ≥4 for SYNTHETIC, providing explainable reasons like "Long query; Low stop word usage; High reading complexity"
- Batch keyword analysis workflow processes CSV files returning USER/SYNTHETIC classifications with confidence scores and aggregate insights showing distribution percentages for strategic filtering
Created by
Semantic ML-enabled Keyword Research
This resource is part of a comprehensive course. Access the full curriculum and learning path.
View Full CourseAvailable in Academy
This resource is available to academy members.
Access in Academy