The flexibility of text
provides researchers
with a lot of freedom to select different options. For example, a
researcher can select among many different layers (in BERT base 12 and
in BERT large 24); and these layers can be aggregated in different ways
including using mean, minimum or maximum. It is also possible to use
different number of PCA components (or not use PCA at all) in training;
as well as selecting different regression algorithms including (multiple
linear regression or ridge). All these options are great for learning
more about these methods. However, when hypotheses testing is important
to not fall pray for researcher degrees of freedom and avoid
the risk of (unconsciously) p-hacking (e.g., see Simmons, Nelson, &
Simonsohn, 2011).
Researcher degrees of freedom refers to the inherent flexibility
involved in conducting research including carrying out experiments as
well as analyzing the data. Researchers can choose among many ways of
analyzing their data, and these ways can, for example, be selected
arbitrarily or on the basis that certain ways result in more desirable
outcomes such as a statistically significant result (Simmons, Nelson,
& Simonsohn, 2011). Or put another way, the flexibility in
text
is a double edged sward where abusing the options
leads to p-hacking: the analytic process of consciously or unconsciously
trying several types of analyses until achieving the desired
results.
Specify language model , specify which layers that will be used and how they will be aggregated.
Example of aspects to consider in a pre-registration of
hypotheses testing
This is not an exhaustive list; rather think through your analyses as
carefully as possible and consider which decisions that can be
appropriately be made in advance. For example,
Type of model (e.g., BERT-base, BERT-large, multilingual BERT, RoBERTa, XLnet, etc.)
Which layers (e.g., all, 11 and 12 etc.)
Layer aggregation method (e.g., mean, minimum, and maximum)
Exclusion of some token (e.g., [CLS] and [SEP])
Type of ML algorithm (e.g., ridge, Random Forest etc.)
Number of cross validation folds in textTrain
Criteria for plotting (e.g., number of words to significance test, plots etc.)
Number of permutations (e.g., in textSimilarityTest, textProjection)
Not(ing) change of random seed. In computer science literature it has recently been discussed that different random seeds can give very different results (e.g., see Mosbach et al., 2020). So perhaps even consider pointing out that seeds will not be changed or commit to a specific seed
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science. Mosbach, M., Andriushchenko, M., & Klakow, D. (2020). On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines.