BERT (Bidirectional Encoder Representations from Transformers)

The foundational language model used for transformer-based embeddings in BERTopic.

BERT is a language model developed by Google, introduced by researchers in October 2018. It belongs to the category of transformer models. Its fundamental purpose is to understand unstructured text and extract insights from it. It represents a modern approach to neural embeddings, improving topic quality and flexibility. BERT embeddings capture contextual meanings, grouping synonyms and related concepts, which aligns the resulting topics closely with human comprehension of themes.
As a core component in advanced ML techniques, BERT (or similar pre-trained transformer models) is used to create dense vector representations of texts before they are clustered. For instance, BERTopic, a topic modeling technique, leverages BERT embeddings to create dense clusters for easily interpretable topics. BERT embeddings are also crucial in keyword extraction techniques like KeyBERT, which uses them to identify phrases and words most relevant to the content by calculating cosine similarity between the document embedding and each n-gram embedding.
BERT serves as a drop-in replacement for earlier methods for general language understanding tasks, such as text classification. Models based on BERT, like LinkBERT, are pre-trained to capture document links and citation links to improve knowledge that spans across multiple documents. The Multitask Unified Model (MUM), introduced by Google in 2021, is noted as being similar to BERT but using a more powerful T5 text-to-text framework to not only understand language but also generate it, making it 1,000 times more powerful than BERT.

Explore other ML Models & Algorithms terms