HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise)

A hierarchical version of DBSCAN; used by BERTopic to identify dense clusters.

HDBSCAN is a data-driven, density-based clustering algorithm that identifies dense regions or topics in embedding space. It is categorized as a density-based clustering method. In the architectural stack of BERTopic, HDBSCAN is used after dimensionality reduction via UMAP to identify dense clusters, where each dense cluster corresponds to a potential topic.
A significant application of HDBSCAN is in outlier detection. Algorithms like HDBSCAN are used to find anomalies or data points that do not belong to any cluster. This capability allows for the detection of unusual traffic spikes, spammy backlinks, or irregular customer behavior.
Beyond outlier detection, HDBSCAN is also used for grouping data points based on proximity in low-dimensional space. It can be employed in projects such as grouping competitors based on shared web performance metrics (like domain authority or organic traffic) or for classifying backlink quality.

Explore other ML Models & Algorithms terms