fastBERTtokens: Tokenizing for BERT in parallel
Versión 1.0.0 (1,43 KB) por
Ralf Elsas
This function simply divides your text into batches, and tokenizes in parallel. Provides significant speed-up.
Function to use Matlab BERT tokenizer in parallel
This function simply divides your text into batches, and tokenizes in parallel. As the Matlab tokenizer is very slow when run on a single processor for large data, this provides a significant speed-up. On an i7-10875H laptop with 8 logical units, tokenizing 76k sentences takes about 100 seconds.
Also note that providing the Matlab BERT model is important, as different BERT models use different encodings for the special BERT tokens like [SEP] etc.
Citar como
Ralf Elsas (2024). fastBERTtokens: Tokenizing for BERT in parallel (https://www.mathworks.com/matlabcentral/fileexchange/125295-fastberttokens-tokenizing-for-bert-in-parallel), MATLAB Central File Exchange. Recuperado .
Compatibilidad con la versión de MATLAB
Se creó con
R2022b
Compatible con cualquier versión desde R2021a
Compatibilidad con las plataformas
Windows macOS LinuxEtiquetas
Agradecimientos
Inspirado por: Transformer Models
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Descubra Live Editor
Cree scripts con código, salida y texto formateado en un documento ejecutable.
Versión | Publicado | Notas de la versión | |
---|---|---|---|
1.0.0 |