textTopics().textTopics().textEmbed() get slightly different embeddings for longer texts, because we are using a sliding window when there are two many tokens for the LLM.)textrpp-py 0.1.0 by default to set up the Python environment in a robust and reproducible way.device argument to GPU when available, to take advantage of hardware acceleration.remove_non_ascii parameter in textEmbed().textTrainExamples() to textExamples() and improving the filter_word function.textTrainExamples().textTopics().save_output = "no_plot" in textTrainRegression() for "logistic" and "multinomial" to reduce model size of saved objects.word_embeddings and model requirements in the textPredict() function.
This is controlled via the new check_matching_word_embeddings parameter, which validates compatibility of model type, layers, and aggregation settings.textDimName() function, allowing users to specify or change the name suffix for word embedding dimensions.dim_names = FALSE behavior in the textDimName() function to also ignore model-required dimension suffixes.
Now includes clearer and more informative warnings when dimension mismatches occur.rsample::function validation_split() to initial_validation_split(). However,
this changes some results in textTrainRegression() and textTrainRandomForrest().textLBAM() to take construct_start parameter.textTrainRegression() to reduce saved model sizes.textEmbedRawLayers() (when using default -2, layer 11 was selected even for large models). This was never a problem in textEmbed().dlatk_method to the textEmbed() function.cv_method = "group_cv" in the textTrainRegression() function.plot_n_word_random and legend_number_colour in textPlot.nltk warning when running the functions requiring pyhon.textProjection() function.textProjection() functiontextTrainExamples()highest_parameter and lowest_parameter when parameters are tied.textPredict(), textAssess() and textClassify().textLBAM().textClean() (removing common personal information).textLBAM() returns the library as a dataframtextPredict() detects model_type.textFindNonASCII() function and feature in textEmbed() to
warn and clean non-ASCII characters. This may change results slightly.type parameter in textPredict() and instead giving both probability and class.textClassify() is now called textClassifyPipe() textPredict() is now called textPredictR()textAssess(), textPredict() and textClassify() works the same, now taking the parameter method with the string "text" to using textPredict(), and
"huggingface" to using textClassifyPipe().hg_gated, hg_token, and trust_remote_code.return_incorrect_results to force_return_resultsfunction_to_apply = NULL instead of "none"; this
is to mimic huggingface default.textWordPrediction since it is under development and note tested.textTrainN() including subsets sampling (new: default change from random to subsets), use_same_penalty_mixture (new:default change from FALSE to TRUE) and std_err (new output).textTrainPlot()textPredict() functionality.textTopics()textTopics() trains a BERTopic model with different modules and returns the model, data, and topic_document distributions based on c-td-idftextTopicsTest() can perform multiple tests (correlation, t-test, regression) between a BERTopic model from textTopics() and datatextTopicsWordcloud() can plot word clouds of topics tested with textTopicsTest()textTopicsTree() prints out a tree structure of the hierarchical topic structuretextEmbed() is now fully embedding one column at the time; and reducing word_types for each column. This can break some code; and produce different results in plots where word_types are based on several embedded columns.textTrainN() and textTrainNPlot() evaluates prediction accuracy across number of cases.textTrainRegression() and textTrainRandomForest now takes tibble as input in strata.textTrainRegression()textPredictTest() can handle auctextEmbed() is faster (thanks to faster handling of aggregating layers)sort parameter in textEmbedRawLayers().Possibility to use GPU for MacOS M1 and M2 chip using device = "mps" in textEmbed()
textFineTune() as an experimental function is implemented
max_length implemented in textTranslate()
textEmbedReduce() implementedtextEmbed(decontextualize=TRUE), which gave error.textSimialirtyTest() for version 1.0 because it needs more evaluations.model, so that layers = -2 works in textEmbed().set_verbosity.sorting_xs_and_x_append from Dim to Dim0 when renaming x_appended variables.first to append_first and made it an option in textTrainRegression() and textTrainRandomForest().textEmbed() layers = 11:12 is now second_to_last.textEmbedRawLayers default is now second_to_last.textEmbedLayerAggregation() layers = 11:12 is now layers = "all".textEmbed() and textEmbedRawLayers() x is now called texts.textEmbedLayerAggregation() now uses layers = "all", aggregation_from_layers_to_tokens, aggregation_from_tokens_to_texts.textZeroShot() is implemented.textDistanceNorm() and textDistanceMatrix()textDistance() can compute cosine distance.textModelLayers() provides N layers for a given modelmax_token_to_sentence in textEmbed()
aggregate_layers is now called aggregation_from_layers_to_tokens.aggregate_tokens is now called aggregation_from_tokens_to_texts.
single_word_embeddings is now called word_types_embeddingstextEmbedLayersOutput() is now called textEmbedRawLayers()textDimName()textEmbed(): dim_name = TRUEtextEmbed(): single_context_embeddings = TRUEtextEmbed(): device = "gpu"explore_words in textPlot()x_append_target in textPredict() functiontextClassify(), textGeneration(), textNER(), textSum(), textQA(), and textTranslate().x_add to x_append across functions
set_seed to language analysis tasksx' in training and predictiontextPredict does not take word_embeddings and x_append (not new_data)textClassify() (under development)
textGeneration() (under development)textNER() (under development)textSum() (under development)textQA() (under development)textTranslate() (under development)textSentiment(), from huggingface transformers models.textEmbed(), textTrainRegression(), textTrainRandomForest() and textProjection().dim_names to set unique dimension names in textEmbed() and textEmbedStatic().textPreictAll() function that can take several models, word embeddings, and variables as input to provide multiple outputs.textTrain() functions with x_append.textPredict related functions are located in its own filetext_version numbertextEmbedLayersOutput and textEmbed can provide single_context_embeddingsreturn_tokens option from textEmbed (since it is only relevant for textEmbedLayersOutput)$single_we when decontexts is FALSE.Logistic regression is default for classification in textTrain.model_max_length in textEmbed().textModels() show downloaded models.textModelsRemove() deletes specified models.textSimilarityTest() when uneven number of cases are tested.textDistance() function with distance measures.textSimilarity().textSimilarity() in textSimilarityTest(), textProjection() and textCentrality() for plotting.textTrainRegression() concatenates word embeddings when provided with a list of several word embeddings.word_embeddings_4$singlewords_we.textCentrality(), words to be plotted are selected with word_data1_all$extremes_all_x >= 1 (rather than ==1).textSimilarityMatrix() computes semantic similarity among all combinations in a given word embedding.textDescriptives() gets options to remove NA and compute total scores.textDescriptives()textrpp_initiate()tokenization is made with NLTK from python.textWordPredictions() (which has a trial period/not fully developed and might be removed in future versions); p-values are not yet implemented.textPlot() for objects from both textProjection() and textWordPredictions()textrpp_initiate() runs automatically in library(text) when default environment exitstextSimilarityTest().stringr to stringi (and removed tokenizer) as imported packagetextrpp_install() installs a conda environment with text required python packages.textrpp_install_virtualenv() install a virtual environment with text required python packages.textrpp_initialize() initializes installed environment.textrpp_uninstall() uninstalls conda environment.textEmbed() and textEmbedLayersOutput() support the use of GPU using the device setting.remove_words makes it possible to remove specific words from textProjectionPlot()textProjetion() and textProjetionPlot() it now possible to add points of the aggregated word embeddings in the plottextProjetion() it now possible to manually add words to the plot in order to explore them in the word embedding space.textProjetion() it is possible to add color or remove words that are more frequent on the opposite "side" of its dot product projection.textProjection() with split == quartile, the comparison distribution is now based on the quartile data (rather than the data for mean)textEmbed() with decontexts=TRUE.textSimilarityTest() is not giving error when using method = unpaired, with unequal number of participants in each group.textPredictTest() function to significance test correlations of different models. 0.9.11This version is now on CRAN.
step_centre and step_scale in training.textTrainRegression() and textTrainRandomForrest() have two options cv_folds and validation_split. (0.9.02)NA in step_naomit in training.DistilBert model works (0.9.03)textProjectionPlot() plots words extreme in more than just one feature (i.e., words are now plotted that satisfy, for example, both plot_n_word_extreme and plot_n_word_frequency). (0.9.01)textTrainRegression() and textTrainRandomForest() also have function that select the max evaluation measure results (before only minimum was selected all the time, which, e.g., was correct for rmse but not for r) (0.9.02)id_nr in training and predict by using workflows (0.9.02).