Similarity Score

Evaluation & Metrics

A score quantifying the likeness between two strings in fuzzy matching (e.g., in Levenshtein distance), typically ranging from 0 to 1.

A similarity score is a metric calculated in machine learning, fundamentally related to the String Matching problem. In this context, the machine learning process measures the distance or approximation match between two input strings. The resulting similarity score is then used to classify the strings as being either equivalent, similar, or distant.
Similarity scores are generated using various algorithms, such as Levenshtein Distance, Jaccard Similarity, or Cosine Similarity, depending on the specific application (e.g., ad copy analysis, competitor keyword matching). For example, in text analysis tasks, cosine similarity combined with TF-IDF is a common metric used for flexibly matching a query string with values in an attribute.
In advanced text clustering models like KeyBERT, similarity is calculated using Cosine Similarity Search between the embedding of the document and the embeddings of the candidate keywords (n-grams). Similarly, in internal linking identification using LinkBERT, an embedding-based similarity measure is used, outputting a similarity score for content pairs. These scores often serve as a basis for applying thresholds to suggest potential matches, such as pairing content with a similarity score above a certain threshold (e.g., 0.5).