This function uses Python's sklearn for feature extraction and count vectorisation. It creates a CountVectorizer object with the specified parameters. CountVectorizer is a way to convert text data into vectors as model input. Used inside a BertopicR topic modelling pipeline.
Usage
bt_make_vectoriser(
...,
ngram_range = c(1L, 2L),
stop_words = "english",
min_frequency = 0.1,
max_features = NULL
)
Arguments
- ...
Additional parameters passed to sklearn's CountVectorizer
- ngram_range
A vector of length 2 (default c(1, 2)) indicating the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted as features. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
- stop_words
String (default 'english'). If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the default.
- min_frequency
Integer or float (default 0.1). When building the vocabulary ignore terms that have a corpus frequency strictly lower than the given threshold. If min_frequency is explicitly defined to be an integer, it is assumed to represent the absolute count. If min_frequency is not explicitly specified as an integer and is between 0 and 1, it is assumed to represent a proportion of documents, if it is a whole number it is assumed to represent the absolute count.
- max_features
Integer or NULL (default NULL). If not NULL, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.
Examples
# vectoriser model that converts text docs to ngrams with between 1 - 2 tokens
vectoriser <- bt_make_vectoriser(ngram_range = c(1, 2), stop_words = "english")
# vectoriser model that converts text docs to ngrams with between 1 - 3 tokens
vectoriser <- bt_make_vectoriser(ngram_range = c(1, 3), stop_words = "english")
# You can implement custom stopwords or stopwords from other sources
if (FALSE) {
stopwords_cat <- tm::stopwords(kind = "catalan")
vectoriser <- bt_make_vectoriser(ngram_range = c(1, 3), stop_words = stopwords_cat)
}
custom_stopwords <- c("these", "words", "are", "not", "helpful")
vectoriser <- bt_make_vectoriser(ngram_range = c(1,2), stop_words = custom_stopwords)