Skip to contents

Divides documents into tokensets and calculates the c-TF-IDF similarity of each tokenset to each topic. For each outlier document, the similarity scores of each tokenset for each topic are summed together by topic and the outlier is redistributed to the topic with the highest similarity. Note that the purpose of this function is to obtain a new list of topics that can then be used to update the model, it does not make any changes to the model itself, the topic classification the model outputs does not change after running this function. The bt_update_topics function needs to be used to make the change to the model itself.

Usage

bt_outliers_tokenset_similarity(
  fitted_model,
  documents,
  topics,
  ...,
  window = 4L,
  stride = 1L,
  threshold = 0.3
)

Arguments

fitted_model

Output of bt_fit_model() or another bertopic topic model. The model must have been fitted to data.

documents

documents to which the model was fit

topics

current topics associated with the documents

...

Optional or additional parameters passed to approximate_distribution function, e.g. batch_size

window

size of the moving window which is the number of tokens in a tokenset

stride

how far the window should move at each step (number of words to skip when moving to next tokenset)

threshold

minimum probability for outlier to be reassigned

Value

df with document, old topic, new topic

Examples

if (FALSE) {
# Reducing outliers original clustering model identified
outliers <- bt_outliers_tokenset_similarity(fitted_model = topic_model, documents = docs, topics = topic_model$topics_)

# Using chain strategies to build on outliers identified by another reduction strategy to redistribute outlier docs
# using embeddings to redistribute outliers
outliers_embed <- bt_outliers_embedings(fitted_model = topic_model, documents = docs, topics = topic_model$topics_, embeddings = embeddings, threshold = 0.5)

# using tokenset similarity outlier reduction method on top of embeddings method to redistribute outliers
outliers_chain <- bt_outliers_tokenset_similarity(fitted_model = topic_model, documents = docs, topics = outliers_embed$new_topics, threshold = 0.2)

}