Topic Modelling without Optimisation • BertopicR

This vignette will show you how to get a topic model up and running using BertopicR relatively quickly. It should be noted that this approach significantly simplifies and generalises certain steps, and will rarely produce an optimised topic model. To get the most out of your topic modelling, you should refer to the Interacting with Individual Modules vignette.

Preparing the Data

library(BertopicR)
library(dplyr)
library(stringr)
library(tidyr)

First we should load the data to which we would like to fit the model.

sentences <- stringr::sentences

Compiling the Model

If you have read the Modular Approach vignette, you will have seen that we specified each individual component of our topic model (embedding_model, ctfidf_model etc.) and fed those to bt_compile_model. If we wish, we can use entirely default parameters (or a combination of default parameters and specified components) with the same function.

model <- bt_compile_model()
#> 
#> No embedding model provided, defaulting to 'all-mpnet-base-v2' model as embedder.
#> 
#> No reduction_model provided, using default 'bt_reducer_umap' parameters.
#> 
#> No clustering model provided, using hdbscan with default parameters.
#> 
#> No vectorising model provided, creating model with default parameters
#> 
#> No ctfidf model provided, creating model with default parameters
#> 
#> Model built & input model updated accordingly

Fitting the Model

Now that we have created a model that uses all default parameters, we can simply use the bt_fit_model function to fit the model to our sentences data. It is important to note that as we have not created document embeddings or reduced those embeddings, this will be done internally which can be quite a time consuming process if you choose to run the topic modelling process multiple times.

NOTE: The bertopic model you are working with is a pointer to a python object at a point in memory. This means that the input and the output model cannot be differentiated between without explicitly saving the model before performing this operation. We do not need to specify an output to the bt_fit_model function as the function changes the input model in place. See the Note under the Fit the Model section in the Interacting with Individual Modules vignette for more detail.

bt_fit_model(model, sentences)
#> UMAP(low_memory=False, min_dist=0.0, n_components=5, random_state=42, verbose=True)
#> Thu Sep 28 14:46:03 2023 Construct fuzzy simplicial set
#> Thu Sep 28 14:46:03 2023 Finding Nearest Neighbors
#> Thu Sep 28 14:46:04 2023 Finished Nearest Neighbor Search
#> Thu Sep 28 14:46:05 2023 Construct embedding
#> Thu Sep 28 14:46:06 2023 Finished embedding
#> 
#> Model is fitted

model$get_topic_info() %>% select(-Representative_Docs, - Representation)
#>    Topic Count                             Name
#> 1     -1   300       -1_bright_good_serve_clean
#> 2      0    72          0_wind_sun_beach_breeze
#> 3      1    69          1_boy_short_plans_tales
#> 4      2    48          2_tea_ripe_taste_served
#> 5      3    45        3_book_chart_pencil_seven
#> 6      4    37           4_porch_rug_hatch_tent
#> 7      5    32          5_cord_hoist_fasten_cap
#> 8      6    21           6_gold_ring_dress_silk
#> 9      7    20       7_broke_fell_storm_cracked
#> 10     8    19 8_sat_home_means game_swing fans
#> 11     9    16       9_wood_cement_board_boards
#> 12    10    15        10_corn_spring_peas_grain
#> 13    11    13              11_cat_cats_pup_saw
#> 14    12    13          12_chair_desk_couch_tan

That’s it, you have a topic model up and running! If you decided that you wanted to adjust factors, like the minimum size of a topic, or the number of topics you want, you should refer to the Interacting with Individual Modules vignette. You can also refer to the Manipulating the Model vignette to see how you can interpret the topics and reduce the number of outliers identified (if using hdbscan (default) clustering).