CSV Variable Names truncated on 2 rows how to append

9 visualizaciones (últimos 30 días)
Umberto Voci
Umberto Voci el 13 de En. de 2020
Editada: Umberto Voci el 22 de En. de 2020
Hello!
I am trying to import a large number CSV files generated from another piece of software that has an unfortunate bug.
Instead of writing all the variable names (in this example 'Var') on one row, it will eventually truncate a variable name and continue on a new line. Example below:
Var1,Var2,Var3,V
ar4,Var5,Var6
Dat1,Dat2,Dat3,Dat4,Dat5,Dat6
Assuming I have no control over the data origin, how do I append/attach the 2nd row of variable names to end of the first to restore data continuity? (so that I may eventually use readtable or importdata without errors)
The row numbers for the trunkated variable names is always the same as they follow a header. It seems like for the files that I'm currently processing the truncation position is also consistant but I can't guarantee that it will stay like that forever
The end goal is to create a structure array X.Var1 ... VarN. A workaround is also acceptable.
Doing it manually is not an option as there are too many of them
Thank you!
  1 comentario
Umberto Voci
Umberto Voci el 13 de En. de 2020
Editada: Umberto Voci el 13 de En. de 2020
So far I have managed to fgetl to the lines I want to restore using this bit of code:
fid = fopen(filename);
m = 1;
for n = 1:X
line_1 = fgetl(fid);
m = m+1;
n = m;
end
line2= fgetl(fid);
So now that I have the two strings saved, I managed to concatenate them together using strcat
newline = strcat(line_1,line_2);
So all that´s left to do is delete the old lines and replace with the new one.. still trying to figure this out..
OR
since now I have a complete string of variable names, find a way to use this string as the headers without manipulating the files themselves.
delimiter = {';',','};
headernames = strsplit(newline,delimiter);

Iniciar sesión para comentar.

Respuesta aceptada

Umberto Voci
Umberto Voci el 22 de En. de 2020
Editada: Umberto Voci el 22 de En. de 2020
I´m going to submit a formal answer to my own question just in case someone else runs into this problem.
The problem is twofold here, getting the variable names and getting the data out.
Lets start with the names.
What I had to do at first was run detectimportoptions to get the object "opts" containing a lot of relevant information about the file
Then use fopen and fgetl to assign the fragmented variable names to two different variables, called line1 and line2.
opts = detectImportOptions('file');
fid = fopen('file');
%Go to line 5 (first line of variable names)
g = 1;
for f = 1:5
line1 = fgetl(fid);
g = g+1;
f = g;
end
%Go to line 6 (second line of variable names)
line2 = fgetl(fid);
fid = fclose(fid)
Now that the broken variable names have been detected and saved, it´s time to join them into one string. This being a CSV file delimited by a variable delimiter it is important to remove the delimiters from the end of the strings or when we will come to split them, the extra delmiters will cause issues.
This is where the opts object becomes useful as detectimportoptions will find the delimiter for you
Once we have defined the delimiter string we run a strfind to find the first delimiter in the series of delimiters that are at the end of the line. strfind returns the index of the 1st delimiter that matches the pattern. we remove 1 from that index to make sure we delete it.
% find the semicolons at the end of each line and remove them
delimiter = opts.Delimiter;
delimiters = [delimiter delimiter delimiter delimiter delimiter delimiter delimiter];
delimiters = strjoin(string(delimiters));
delimiters = strrep(delimiters, ' ', '');
linefind = strfind(line1,semicolons);
line1 = line1(1:(linefind(1)-1));
We do this for both lines.
linefind = strfind(line2,delimiters);
line2 = line2(1:linefind(1));
We then combine the strings of line1 and line2 then use strsplit to divide each name into cells and ~isempty to remove the blanks. We are left with an array of variable names.
%Combine strings of the two lines to what the correct line should be
newline = strcat(line1,line2);
%Generate header names
headernames = strsplit(newline,delimiter);
headernames = headernames(~cellfun('isempty',headernames));
Now for the data.
We can use the opts object to our advantage here by editing select properties where possible. For instance it was not possible (i dont know why) to actually set in opts what the variable names are by changing the content of opts.variablenames, but I had success in specifying the start of the datalines by changing opts.datalines. This will cause readtable to ignore every line above this specified line, while maintaining the import properties of each column. opts. datalines has 2 values for start and end of data. We only need to change one.
opts.DataLines(1) = 7;
F = readtable('file',opts);
From here we get a table of data that (hopefully) is the same length as headernames.
Now, Kissssssssssss
All we need to do from here is create a loop the same length as headernames to change each table variable name to the corresponding headername. Since we maintained the order of the names when we split the string, this is a simple 1 to 1 replacement.
for i = length(headernames)
F.Properties.VariableNames(i) = headernames(i);
end
That´s it.

Más respuestas (1)

Jyotsna Talluri
Jyotsna Talluri el 22 de En. de 2020
You can read the CSV files using readtable function and then you can modify the data as required
Refer to the below link
  1 comentario
Umberto Voci
Umberto Voci el 22 de En. de 2020
Editada: Umberto Voci el 22 de En. de 2020
Hi Jyotsna, as I explained in the question this won´t work due to the weird bug that I´m trying to correct. Using readtable results in garbled data at best, or errors in recognising the type of file at worst. using straight import also didnt work so I decided the only smooth way to do it was to tweak the file somehow.
I managed to work around this in the end by runnning detectimportoptions on the file that I want to import, and then tweaking the opts object to import only data without the headernames by manually changing the first line of data using the command opts.DataLines = X
Once I had a table containing ONLY data I used the fopen to open the csv file, fgetl commands to copy the lines and strcat to join them into 1 line and strsplit using opts.delimiter to split the string into variable names.
(See my comment)
Then it was a simple case to assign the names to the variables in a loop. It was a mindf*** but I got it done, a step at the time.

Iniciar sesión para comentar.

Categorías

Más información sobre Large Files and Big Data en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by