speed up renamecats/categorical multiple columns

2 visualizaciones (últimos 30 días)
Peng Li
Peng Li el 12 de Mayo de 2020
Comentada: Peng Li el 9 de Oct. de 2020
I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.
I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.
See below an example
tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
["123", "234", "345", "456", "567"]', ...
'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
["apple", "banana", "cat", "dog", "elephont"]', ...
'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
["East", "West", "North", "South", "Middle"]', ...
'VariableNames', {'Code', 'Meaning'});
Vars = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
dictionary.(Vars{iC}).Meaning);
end
Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).
Thank you!

Respuestas (1)

Campion Loong
Campion Loong el 9 de Oct. de 2020
Hi Peng,
It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).
In this case, why not create the table with categorical to begin with:
tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"], dictionary.A.Code, dictionary.A.Meaning),...
categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
'VariableNames', {'A', 'B'});
There is no loop, faster and much more readable.
  3 comentarios
Campion Loong
Campion Loong el 9 de Oct. de 2020
If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.
If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:
Peng Li
Peng Li el 9 de Oct. de 2020
Hi Campion, thanks again for you attention. I've actually tried different options -- tall array, datastore, transform a datastore, mapreduce, or readall in a server (over 380G ram) a while ago. This is easily handlable.
The issue is with this de-coding part. It is simply too slow to do a loop. And ImportOptions couldn't help with the decoding of the actual data I guess, as i have to load the data first and do the decoding.
I've tried a way using transform datastore. Basically in the transform function, I do the decoding, and then write the datastore to disk. It works, but slow too.
I have several workable solutions now but just no one gives me the best speed. The single file is around 20G in cvs format, with over half a million rows and almost 10 thunsands of columns. With my server this tasks takes over 24 hours so I guess i just need to be a bit patient to let the server work while i'm doing something else.

Iniciar sesión para comentar.

Categorías

Más información sobre Images en Help Center y File Exchange.

Productos


Versión

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by