The process of splitting text into tokens (words or phrases) during pre-processing.
Tokenization is a fundamental preprocessing step in Natural Language Processing (NLP) where text is broken down into smaller, defined units called tokens. Tokens are typically individual words or phrases.
In LDA, tokenization is performed as part of creating the document corpus. In KeyBERT, a CountVectorizer is used as a tokenizer to split the input text into candidate keywords (N-grams) before they are embedded.
Sources & References
Explore other Core Concepts (AI/ML) terms
A
AI Overview
AI-generated summaries of highly informational, low-intent queries, offering quick answers to users, or generally, a…
A
Artificial Intelligence (AI)
The overarching concept related to the design and study of intelligent systems. Early systems relied…
A
Augmented Search Queries
Queries that expand or modify the original user query to improve search accuracy and relevance…
B
Bag of Words
A type of semantic representation of data, which can be extracted from page contents.
B
Bigram
A sequence of two adjacent words.
D
Deep Learning
A part of machine learning; Generative AI models like ChatGPT and LLM-based chatbots fall within…
D
Dimensionality Reduction
A process that reduces data, such as high-dimensional vectors, for visualization while preserving semantic structure…
E
Embedding
A numerical representation capturing the meaning of a document or data. Also referred to as…
E
Entity
A representation of real-world objects (people, products, places, concepts) that hold value from an SEO…
E
Entity Attribute (EAV Model)
Defining properties or characteristics of an entity (e.g., location, niche) used in the EAV model…
E
Entity Attribute Variable (EAV Model)
The concept encompassing entities, their attributes, and the specific values (variables) associated with those attributes.
E
Entity Variables (EAV)
Specific values an entity attribute can take (e.g., London, Paris for the Location attribute).
