An unsupervised machine learning approach for topic modeling that generates interpretable topics and performs dynamic clustering, suitable for large unstructured datasets. An embedding-based topic modeling algorithm that uses BERT embeddings, UMAP for dimensionality reduction, and HDBSCAN for clustering. Excels at semantic coherence, minimal preprocessing, and automatically detecting the number of topics. Effective for short text.
BERTopic is an unsupervised machine learning approach specifically employed for topic modeling and generating clusters of related keywords. It is noted for its ability to produce interpretable topics and perform dynamic clustering. BERTopic is an excellent tool for exploratory keyword analysis on large unstructured datasets where predefined topics are not available, offering an alternative to supervised methods like Sentence-BERT or rule-based string fuzzy matching.
Sources & References
Explore other ML Models & Algorithms terms
B
BERT (Bidirectional Encoder Representations from Transformers)
The foundational language model used for transformer-based embeddings in BERTopic.
B
BERTopic
An unsupervised machine learning approach for topic modeling that generates interpretable topics and performs dynamic…
B
BIRCH (Balanced Iterative Hierarchical Based Clustering)
A hierarchical clustering method efficient for large datasets and time series.
B
Boyer-Moore
An exact string-matching algorithm and one of the best-known pattern recognition algorithms.
C
c-TF-IDF
Class-based Term Frequency-Inverse Document Frequency; used by BERTopic for clearer topic representation and selection of…
D
DBSCAN
Density-Based Spatial Clustering of Applications with Noise; groups data points based on density. Useful for…
D
Decision Tree
An early, simple model for classification or regression.
D
Distance-based matching
Fuzzy matching methods focusing on "edit distance" rather than exact spelling.
D
DistilBERT (Refined Query Semantic Class Classifier)
A fine-tuned BERT model used for semantic class classification based on queries.
E
Encoder Model
A machine learning model used in Google's two-step process for building and maintaining the Knowledge…
F
Fuzzy Matching / Fuzzy String Matching
A string similarity assessment approach, typically relying on character distance rather than semantics, used to…
G
Gaussian Mixture Models (GMM)
A distribution-based model that summarizes a multivariate probability density function with a mixture of Gaussian…
