BERT encoding is very slow - Help

4 visualizaciones (últimos 30 días)
Zzz
Zzz el 7 de Mayo de 2021
Respondida: Ralf Elsas el 26 de Feb. de 2023
I've been following this github: https://github.com/matlab-deep-learning/transformer-models which is the MATLAB implementation of BERT.
While trying to encode my text using the tokenizer, following this script, I realize that BERT encoding takes very long to work on my dataset.
My dataset contains 1000+ text entries, each of which is ~1000 in length. I noticed that the example csv used in the github contains very short description text. My question is: how can we perform text preprocessing using BERT encoding? And how we can speed up the encoding process?
Thanks!

Respuesta aceptada

Divya Gaddipati
Divya Gaddipati el 13 de Mayo de 2021
Here are a few things that you can try to speed up the tokenizer, which were suggested by the GitHub repo author (you can also find this information here):
1. Remove redundant white-space tokenization in BasicTokenizer
2. Convert basic tokenized tokens to UTF32 in one call in FullTokenizer, and modify WordPieceTokenizer to accept UTF32 as input.
3. Only call sub.string() once in WordPieceTokenizer.
4. Remove input validation in WhitespaceTokenizer which may be called many times.
If the issue still exists, you could also create a new issue on the GitHub page itself.

Más respuestas (1)

Ralf Elsas
Ralf Elsas el 26 de Feb. de 2023
Hello! For everybody dealing with this issue - it can be easily solved: fastBERTtokens

Categorías

Más información sobre Modeling and Prediction en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by