The article investigates tokenization in language models, focusing on the identification of under-trained tokens using various indicators. Through the analysis of models utilizing the GPT-2 tokenizer, it highlights specific issues related to control characters and non-English tokens. Results demonstrate the effectiveness of these indicators in detecting untrained tokens, emphasizing the prevalence of certain under-trained tokens across multiple models. The research reinforces the need for improved normalization processes to enhance model performance, as certain tokens remain problematic, affecting multilingual capabilities and overall accuracy.
We confirm previous findings with a significant number of tokens related to (fragments of) usernames (e.g. _TheNitrome, _RandomRedditor, StreamerBot). Although the model is aimed at English text, there are a few under-trained non-English tokens, including the Japanese token _ãµã¼ãã£. In addition to the 13 bytes unused in UTF-8, we detect that all ASCII characters in the 0-31 range, except for the newline character, appear untrained.
Most notably this means that the horizontal tab character \t as well as the carriage return \r are out of distribution for the models. We also evaluated a few models that base their tokenizer on GPT-2, including Phi-2 and GPT-J 6B. These models share many of the same under-trained tokens.
Collection
[
|
...
]