How to Structure and Categorise Keyword Data for Semantic Analysis

Written by Lazarina Stoy.

Keyword research doesn’t end when you have a large list of queries. In modern SEO, the real value comes from how that keyword data is structured, labelled, and analysed. Without categorisation, even the most comprehensive keyword universe stays hard to interpret, prioritise, and activate.

This post covers the intro-level approaches to organising keyword data using a combination of rule-based automation and machine-learning-enabled methods. The goal isn’t to build complex models — it’s to create meaningful labels that reveal patterns in search behaviour, uncover optimisation opportunities, and support smarter marketing decisions. The full implementation — the specific regex pattern libraries, the ML-enabled clustering workflows, the practical lab on organising a database and applying categorisation end-to-end at scale — is its own substantial body of work covered in depth elsewhere.

Why keyword categorisation matters in semantic SEO

Traditional keyword research often treats queries as isolated ranking opportunities. Semantic analysis takes a different view, treating keywords as signals of user needs, intent evolution, and contextual behaviour across sessions and platforms.

Categorising keywords makes large datasets easier to interpret and act on. Labelling helps surface patterns, identify opportunities to optimise content for specific audiences, tailor strategies to different stages of the buyer journey, and support prioritisation using metrics like search volume and competition. Most importantly, it translates raw keyword data into insights that inform content, UX, and growth decisions.

The goal isn’t to label everything perfectly from day one. It’s to build a system that improves over time. As labels are refined, unclassified queries are reduced, and new rules or models are introduced, the keyword universe becomes progressively clearer and more useful.

There’s a practical observation worth flagging up front: well-categorised keyword data compounds in value, while uncategorised keyword data depreciates. A 10,000-keyword universe with rich categorisation is useful for a year or more — every team can query it from their angle, every refresh adds incremental insight. A 10,000-keyword universe with no categorisation is read once, used in one report, and never opened again. The categorisation work is what makes the difference between a one-off deliverable and a long-term strategic asset.

Branded vs. non-branded keyword segmentation

The first and most fundamental layer of categorisation is the distinction between branded and non-branded keywords. This separation is the foundation of almost all meaningful keyword analysis — it clarifies whether search demand is driven by existing brand awareness or by broader, generic market interest.

Separating branded from non-branded queries lets you track brand equity in search performance more accurately. Branded searches reflect how visible and memorable a brand is to users. Changes in branded demand often signal shifts in brand awareness, trust, or offline activity rather than SEO performance per se.

This distinction also isolates high-value branded traffic from non-branded growth opportunities. Branded keywords convert differently and are usually easier to rank for. Non-branded queries reveal where a brand can expand reach and capture new demand. Without this separation, it’s easy to overestimate organic performance by conflating brand strength with genuine market growth.

Comparing brand-driven searches against generic market demand also enables more realistic benchmarking, clearer competitive analysis, and better-informed strategic decisions across content, acquisition, and measurement.

Rule-based classification in Google Sheets

A simple REGEXMATCH formula can automatically classify keywords by whether they contain identifiable brand signals. At its most basic, this means checking each query for your brand name, known brand variations, common misspellings, and competitor brand names or their variations.

Rule-based classification is fast, effective, and accessible — it works at scale directly in Google Sheets, even on large keyword datasets. Build the regex once, apply it to a column, get a labelled dataset.

This works particularly well combined with existing brand knowledge. Data from Google Search Console, historical keyword analysis, or internal naming conventions builds a more comprehensive list of brand variants. It won’t be perfect, but it gives you a strong foundation that can be refined with additional rules or manual adjustments later.

Extracting brand names from keyword lists

Branded keyword classification often needs additional rules or manual adjustments — particularly when competitor brand names aren’t already known. In those cases, brand signals can be extracted directly from the keyword list itself using two complementary approaches.

LLM-assisted extraction. For smaller datasets, prompting ChatGPT, Claude, or Gemini with a list of keywords and asking for a regex-ready list of recognisable brand names works well. Quick to implement and effective when the keyword universe is limited and easy to review.
Entity-based extraction. For larger or more complex datasets, entity analysis is the more scalable solution. Run entity extraction on the keyword list, filter results by entity type (specifically Organisation), and generate a deduplicated list of brand names directly from the keyword universe. Join those entities into a single regex string and use it for branded vs. non-branded classification at scale.

Both approaches introduce errors. Some non-branded queries get classified as branded; some genuine branded searches get missed. Sub-brands or product lines often need to be added manually because they may represent branded intent without explicitly containing the main brand name.

Automation reduces manual effort significantly, but careful review is still essential at this stage to maintain accuracy.

Categorising keywords by search intent

Search intent categorisation makes sure content aligns with user motivation. By understanding whether a query is informational, commercial, transactional, navigational, or localised, you design content that matches user expectations and conversion potential.

Sources of intent data

Intent labels can come from:

SEO tools that provide intent labels out of the box (Semrush, Ahrefs)
API services that return intent at scale
Rule-based classification using keyword patterns and SERP features

Tools like Semrush use a mixed approach combining keyword terms with observed SERP features. Their labels are decent starting points but often need refinement for vertical-specific nuance.

Rule-based intent classification

A basic Google Sheets formula assigns intent based on keyword modifiers:

“how,” “why,” “guide,” “tips” → informational
“best,” “compare,” “review” → commercial
“buy,” “order,” “discount” → transactional
“near me,” “address,” “hours” → localised or navigational

For more on the full taxonomy of intent — including hybrid categories, implicit intent, and micro-intents — see the post on search intent in semantic keyword research.

Reducing unclassified queries

The objective of intent categorisation is to reduce unclassified keywords over time. Rather than aiming for perfect coverage from the outset, the focus should be on building a system that improves with each iteration as more patterns are discovered and incorporated.

Each pass through the data should involve manual review of remaining unclassified queries. This review surfaces recurring language patterns, modifiers, or structural similarities the initial rules didn’t capture. Translate those patterns into additional rules or more granular intent subcategories, and coverage gradually expands.

As the dataset matures, you can move beyond purely rule-based approaches. Building a custom classification model — using an LLM with few-shot examples, or training a Sentence-BERT classifier on labelled data — can automate intent detection at scale while accounting for more nuanced language use. Effective intent classification is iterative; accuracy and coverage improve over time rather than being achieved in a single pass.

Short-tail vs. long-tail keyword classification

Distinguishing between short-tail and long-tail keywords gives you a high-level overview of the keyword universe and supports strategic balance.

Short-tail keywords typically offer higher search volume, broader reach, and greater competition. Long-tail keywords offer lower competition, higher specificity, and stronger conversion potential.

The conventional rule: keywords with two terms or fewer are short-tail; longer queries are long-tail. With more conversational and AI-influenced search behaviour, there’s value in introducing additional categories — very long-tail or highly contextual queries — when the data allows. This becomes especially relevant in AI search, where users are interacting with much longer, more conversational queries than the traditional Google search box ever encouraged.

A simple Google Sheets formula counts the number of words in each keyword and assigns the label. Combined with search volume and intent data, this classification supports balanced content strategies — making sure you’re investing across the volume curve rather than only chasing short-tail head terms or only working long-tail conversion plays.

Keyword clustering and n-gram analysis

Keyword clustering and n-gram analysis reveal thematic patterns within a keyword dataset. Rather than analysing queries in isolation, these techniques surface how terms relate to one another and which concepts consistently appear together across the universe.

Clustering identifies recurring semantic themes that are difficult to detect in a flat list. It highlights dominant keyword combinations and shows how related queries naturally group under shared concepts, topics, or entities.

These insights are especially useful for topic modelling and content planning. Understanding how keywords cluster lets teams avoid fragmented content strategies and make structured decisions about how topics should be organised, prioritised, and expanded.

When to apply clustering

Clustering can be applied to:

The full keyword universe
Ranked keywords (your existing performance set)
Competitor keyword sets
Selected subsets for deeper analysis

The objective isn’t clustering for its own sake. It’s understanding which semantic combinations appear most frequently and where opportunities exist.

Topic-based keyword clustering

Topic-based keyword clustering provides a higher-level semantic structure by grouping keywords into broader subject areas. Instead of focusing on individual queries, this organises the universe around themes — making it easier to understand how different searches relate to one another within a wider topic space.

Both supervised and unsupervised methods work here, including string fuzzy matching, Sentence-BERT, k-means clustering, and BERTopic. When predefined topic labels already exist, supervised methods can map keywords directly to those topics. When no topic list is available, unsupervised approaches like BERTopic can automatically generate topic groupings from the data itself.

Sentence-BERT is particularly well-suited because it’s trained on short-form text, making it effective for keywords and search queries. Unlike string-based methods, it captures semantic meaning rather than surface-level similarity — letting it map keywords to broader or more generic topic labels accurately even when the keywords share no words with the label.

A practical note: choosing between supervised and unsupervised clustering depends on what you’re trying to do. Supervised works well when you already have a topic taxonomy that mirrors your site structure and want to map keywords onto it. Unsupervised works better when you don’t know what the topics are yet and want the data to surface them. Most mature keyword programmes use both, in sequence — unsupervised first to discover the natural topic structure of the data, then supervised to map subsequent batches of keywords against the discovered taxonomy.

Entity-based keyword categorisation

Entity-based keyword categorisation connects keywords to real-world concepts that search engines recognise. Moving beyond surface-level terms and focusing on entities aligns keyword analysis with how modern search systems interpret meaning and relationships.

Grouping keywords by entities enables several valuable forms of analysis. You can assess the relative importance of different entities across the dataset, identify which entities drive the greatest share of search demand, and understand which entities and attributes frequently co-occur within queries. This connects keyword analysis directly to knowledge graph concepts and semantic search behaviour.

Using entity extraction, each keyword is associated with a set of entities that appear within it. These divide into a core entity and one or more sub-entities. The core entity is identified by salience — the importance of the entity within the context of the keyword. The entity with the highest salience score is treated as the primary or core entity for that query.

This distinction enables deeper, more flexible analysis. A single keyword might include a content format as one entity (e.g., “guide”) and a user type as another (e.g., “beginner”). Understanding how entities interact within queries supports advanced segmentation — keywords analysed not just by topic or intent but by the real-world concepts they reference.

SERP feature categorisation

SERP feature categorisation provides context about how search engines interpret and respond to user intent. SERP features — featured snippets, videos, image packs, local results — act as signals of what Google believes is the most appropriate way to satisfy a particular query.

Identifying which SERP features appear for different keywords lets you map content opportunities at scale. These features help infer preferred content formats and reveal patterns in how search engines present information — visual elements, short answers, richer interactive results.

SERP feature data typically comes from SEO tools or SERP scraping APIs (DataForSEO, SerpAPI). Once available, it can be incorporated into the keyword dataset as an additional categorisation layer, enabling more informed analysis and supporting content strategies that align with observed search behaviour.

In the AI search era, SERP feature categorisation has expanded to include AI Overviews. Tracking which queries trigger AI Overviews and which sources get cited within them is now a meaningful signal — it tells you both where Google is confident enough to synthesise an answer and which sources are winning the citation game in your space. Treating “triggers AI Overview” as a keyword tag in your universe lets you slice your data in ways that weren’t possible eighteen months ago — which clusters are most AI-mediated, which still drive clicks, where the visibility versus traffic trade-off is sharpest.

Content type, platform, and content depth

Beyond intent, keywords can be categorised by the type of content users expect. This clarifies not just why a user is searching but how they prefer that information delivered.

Content type and platform signals come from several sources. SERP features act as generic indicators of preferred content formats. Explicit query modifiers reveal whether users are looking for specific formats or platforms. Rule-based regex matching detects these signals at scale, letting you group keywords by recurring patterns in the query language.

Content type indicators include terms suggesting video content, short-form formats, platform-specific searches, or downloadable resources like PDFs and guides. Applying these labels aligns content strategy with both user preferences and organisational capabilities — making it easier to assess where opportunities exist and whether they’re feasible to pursue.

Keywords also indicate desired depth. Modifiers like “beginner,” “basic,” “introduction,” “advanced,” or “expert” signal the level of knowledge users expect content to address. Paired with other dimensions like personas or intent, depth labels help prioritise which educational levels to focus on and highlight gaps in existing coverage.

User persona categorisation

User personas can often be inferred directly from the language of search queries. The words, modifiers, and phrasing within a keyword signal underlying motivations, preferences, or constraints — making persona-based categorisation a useful extension of intent analysis.

Linking keywords to user personas uncovers alternative targeting opportunities and provides clearer insight into audience expectations. It also supports better alignment between messaging, offers, and content — making sure what’s created reflects the needs and priorities implied by the search behaviour.

Use rule-based matching for traits like affluent, budget-conscious, innovator, family-oriented, adventurer, DIY enthusiast. These categories are derived from observed query patterns and refined through manual review. Automation makes them scalable, but a solid understanding of the market is essential to make sure persona definitions are accurate, relevant, and meaningful.

In an AI search context, persona signals have become more important — not less. Personalised query fan-out (which I’ve written about extensively on iPullRank) means AI systems are inferring user persona from limited signals and shaping their fan-out queries around those inferences. Content that’s clearly tagged for specific personas in your own analysis is content that’s more likely to align with the personas AI systems are inferring on the user’s side. The match between your persona tagging and the system’s persona inference is what determines whether your content surfaces for the right audience.

Visualising the categorised keyword universe

Once keywords are categorised, visualisation becomes possible — and that’s where patterns become decisions. When keyword data is structured with consistent labels, even simple dashboards can reveal insights difficult to detect in a raw list.

Using a straightforward Looker Studio dashboard, you can visualise:

Branded vs. non-branded distribution
Short-tail vs. long-tail
Search intent breakdowns
Content format and depth preferences
User persona representation
Relative importance of topics and entities
N-gram and bigram dominance

These insights can be generated from a basic keyword list with search volume as the primary metric — demonstrating how much analytical value categorisation adds without requiring additional data sources.

How keyword categorisation enables smarter SEO decisions

Keyword categorisation transforms a flat list of queries into a structured semantic dataset. Layering labels — intent, topic, entity, persona, content format — makes the data multidimensional and far more useful for strategy. Instead of treating keywords as isolated terms, this approach reveals how dimensions of search demand intersect and where meaningful opportunities exist.

With simple formulas, light automation, and iterative refinement, you can prioritise opportunities, identify coverage gaps, and align keyword strategy with business goals. Even a basic keyword list, properly categorised, supports informed decisions across content, marketing, and growth.

Keyword categorisation isn’t a one-off task. As additional data sources are incorporated — user paths, query refinements, autocomplete data, AI Overview citation data — the keyword universe should evolve. Each iteration improves clarity, reduces ambiguity, and strengthens the connection between observed search behaviour and real outcomes. The teams that treat categorisation as ongoing operational work build keyword universes that stay sharp for years. The teams that treat it as a one-off setup end up with universes that drift out of relevance within months.

Continue your learning (MLforSEO)

This post covered the categorisation methods that turn a flat keyword list into a structured semantic universe. The full implementation — including the Google Sheets templates, regex pattern libraries, ML-enabled clustering workflows, AutoML approaches to building custom classifiers, the integration with SERP and entity data, and the practical lab on organising a database and applying categorisation end-to-end — is in the Semantic AI-Powered SEO Keyword Research course on MLforSEO. The course connects categorisation directly to the broader semantic concepts (intent, entities, query refinement, user journey mapping) that together turn keyword data into operational decisions.

Lazarina Stoy.

seo@lazarinastoy.com – Web

Lazarina Stoy is a Digital Marketing Consultant with expertise in SEO, Machine Learning, and Data Science, and the founder of MLforSEO. Lazarina’s expertise lies in integrating marketing and technology to improve organic visibility strategies and implement process automation.

A University of Strathclyde alumna, her work spans across sectors like B2B, SaaS, and big tech, with notable projects for AWS, Extreme Networks, neo4j, Skyscanner, and other enterprises.

Lazarina champions marketing automation, by creating resources for SEO professionals and speaking at industry events globally on the significance of automation and machine learning in digital marketing. Her contributions to the field are recognized in publications like Search Engine Land, Wix, and Moz, to name a few.

As a mentor on GrowthMentor and a guest lecturer at the University of Strathclyde, Lazarina dedicates her efforts to education and empowerment within the industry.