fastBERTtokens: Tokenizing for BERT in parallel

This function simply divides your text into batches, and tokenizes in parallel. Provides significant speed-up.

Ahora está siguiendo esta publicación

Function to use Matlab BERT tokenizer in parallel
This function simply divides your text into batches, and tokenizes in parallel. As the Matlab tokenizer is very slow when run on a single processor for large data, this provides a significant speed-up. On an i7-10875H laptop with 8 logical units, tokenizing 76k sentences takes about 100 seconds.
Also note that providing the Matlab BERT model is important, as different BERT models use different encodings for the special BERT tokens like [SEP] etc.

Citar como

Ralf Elsas (2026). fastBERTtokens: Tokenizing for BERT in parallel (https://es.mathworks.com/matlabcentral/fileexchange/125295-fastberttokens-tokenizing-for-bert-in-parallel), MATLAB Central File Exchange. Recuperado .

Agradecimientos

Inspirado por: Transformer Models

Información general

Compatibilidad con la versión de MATLAB

  • Compatible con cualquier versión desde R2021a

Compatibilidad con las plataformas

  • Windows
  • macOS
  • Linux
Versión Publicado Notas de la versión Action
1.0.0