How to ignore special characters and retrieve the data prior to the character

34 visualizaciones (últimos 30 días)
I have 40 years of data. Unfortunately, each text file has special characters # or * in them representing the highest or lowest temperatures of that specific day and month. My code works (outside regexp(minT_tbl,'#*','match') and its counterpart). However, the special characters is confusing the program making data wrong. Any help would be great!
close all;
clear all;
clc;
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll.Year = year(dataAll.Day);
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
%delete leap year
LY = (dataAll.Month(:)==2 & dataAll.DD(:)==29);
dataAll(LY,:) = [];
% Unstack variables
minT_tbl = unstack(dataAll,"MinT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
maxT_tbl = unstack(dataAll,"MaxT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
yrs =str2double(minT_tbl.Properties.VariableNames(3:end))';
%ignore special characters
regexp(minT_tbl,'#*','match')
regexp(maxT_tbl,'#*','match')
% find min
[Tmin,idxMn] = min(minT_tbl{:,3:end},[],2,'omitnan');
Tmin_yr = yrs(idxMn);
% find max
[Tmax,idxMx] = max(maxT_tbl{:,3:end},[],2,'omitnan');
Tmax_yr = yrs(idxMx);
% find low high
[lowTMax,idxMx] = min(maxT_tbl{:,3:end},[],2,'omitnan');
LowTMax_yr = yrs(idxMx);
% find high low
[highlowTMn,idxMn] = max(minT_tbl{:,3:end},[],2,'omitnan');
HighLowT_yr = yrs(idxMn);
% find avg high
AvgTMx = round(mean(table2array(maxT_tbl(:,3:end)),2,'omitnan'));
% find avg low
AvgTMn = round(mean(table2array(minT_tbl(:,3:end)),2,'omitnan'));
% Results
tempTbl = [maxT_tbl(:,["Month","DD"]), table(Tmax,Tmax_yr,AvgTMx,lowTMax,LowTMax_yr,Tmin,Tmin_yr,AvgTMn,highlowTMn,HighLowT_yr)]
tempTbl2 = splitvars(tempTbl)
FID = fopen('Meda 05 Temperature Climatology.txt','w');
report_date = datetime('now','format','yyyy-MM-dd HH:MM');
fprintf(FID,'Meda 05 Temperature Climatology at %s \n', report_date);
fprintf(FID,"Month DD Temp Max (°F) Tmax_yr AvgTMax (°F) lowTMax (°F) LowTMax_yr TempMin (°F) TMin_yr AvgTMin (°F) HighlowTMin (°F) HighlowT_yr \n");
fprintf(FID,'%3d %6d %7d %14d %11d %11d %15d %11d %13d %10d %13d %17d \n', tempTbl2{:,1:end}');
fclose(FID);
winopen('Meda 05 Temperature Climatology.txt')
function Tbl = readMonth(filename)
opts = detectImportOptions(filename)
opts.ConsecutiveDelimitersRule = 'join';
opts.MissingRule = 'omitvar';
opts = setvartype(opts,'double');
opts.VariableNames = ["Day","MaxT","MinT","AvgT"];
Tbl = readtable(filename,opts);
Tbl = standardizeMissing(Tbl,{999,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
Tbl = standardizeMissing(Tbl,{-99,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
[~,basename] = fileparts(filename);
nameparts = regexp(basename, '\.', 'split');
dateparts = regexp(nameparts{end}, '_','split');
year_str = dateparts{end}
d = str2double(extract(filename,digitsPattern));
Tbl.Day = datetime(d(3),d(2),Tbl.Day)
end
  6 comentarios
Cris LaPierre
Cris LaPierre el 7 de Feb. de 2024
Test it out. It doesn't elminate them because month does not equal 2 anymore, and day does not equal 29. They are now 3 and 1.
dataAll = table();
dataAll.Day = datetime(1981,2,29) % Feb 29, 1981, which is a non-leap year
dataAll = table
Day ___________ 01-Mar-1981
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
dataAll = 1×3 table
Day Month DD ___________ _____ __ 01-Mar-1981 3 1
% Remove all Feb 29 dates from the table
LY = (dataAll.Month(:)== 2 & dataAll.DD(:) == 29);
dataAll(LY,:) = [ ]
dataAll = 1×3 table
Day Month DD ___________ _____ __ 01-Mar-1981 3 1
As you can see, the current LY code did not remove the data.
Jonathon Klepatzki
Jonathon Klepatzki el 7 de Feb. de 2024
Well I am stuck then. Because I just tried it different ways and I continue to get the same result.

Iniciar sesión para comentar.

Respuesta aceptada

Voss
Voss el 6 de Feb. de 2024
Editada: Voss el 6 de Feb. de 2024
The following code replaces any * or # characters in a text file with spaces (note that this replaces the existing file with a new file of the same name):
% read the file
fid = fopen(filename,'r');
str = fread(fid,[1 Inf],'*char');
fclose(fid);
% replace any * or # with a space (empty char vector should also work)
str = regexprep(str,'[*#]',' ');
% write the new file
fid = fopen(filename,'w');
fwrite(fid,str);
fclose(fid);
If you don't mind losing the original files that have the * and/or # characters in them, you can run this code for each of your text files before running your code or you can incorporate this code into your readMonth function.
If you want to preserve the original files, make a separate copy of them first, or modify the above code to write to a different file, e.g.:
% write the new file
[fp,fn,ext] = fileparts(filename);
fid = fopen(fullfile(fp,[fn '_modified' ext]),'w');
fwrite(fid,str);
fclose(fid);
and tell fileDatastore to use the modified files only, e.g.:
Datafiles = fileDatastore("temp_summary*_modified.txt","ReadFcn",@readMonth,"UniformRead",true);
  20 comentarios
Star Strider
Star Strider el 13 de Feb. de 2024
I am lost. My code seems to work correctly when I run it, without any other modifications to it or to the tables or files it creates.
Mentioning me using ‘@’ flags me and I look to see what I need to attend to, if anything, since sometimes it’s just a reference.
Cris LaPierre
Cris LaPierre el 13 de Feb. de 2024
@Jonathon Klepatzki, you can specify the NumHeaderLines, VariableNamesLine, VariableUnitsLine, VariableDescriptionsLine, and the DataLines import arguments to correctly import a file that has non-data lines between the variable names and data.
However, where you are using a datastore to import your files, the same import options are used to read in all files. Therefore, all files must be formattted the same or you will get errors like the one you saw.

Iniciar sesión para comentar.

Más respuestas (3)

Sulaymon Eshkabilov
Sulaymon Eshkabilov el 6 de Feb. de 2024
Here is one possible solution, to get the data correctly from the data file:
% Open the data file for reading
FID = fopen('temp_summary.05.03_1998.txt', 'r');
% Initialize a cell array to store the cleaned data
C_Lines = {};
% Read the file line by line
N_line = fgetl(FID);
while ischar(N_line)
% Remove '*' and '#' characters from the line
C_Line = strrep(N_line, '*', '');
C_Line = strrep(C_Line, '#', '');
% Store the cleaned line if it is not empty
if ~isempty(C_Line)
C_Lines{end+1} = C_Line;
end
% Read the next line
N_line = fgetl(FID);
end
% Close the file:
fclose(FID);
% Convert the cell array of cleaned lines to a character array:
C_Data = char(C_Lines)
C_Data = 32×49 char array
' Day Maximum Temp Minimum Temp Average Temp' ' 01 66 28 47.0 ' ' 02 65 29 47.0 ' ' 03 62 36 49.0 ' ' 04 63 31 47.0 ' ' 05 52 36 44.0 ' ' 06 53 28 40.5 ' ' 07 62 26 44.0 ' ' 08 65 27 46.0 ' ' 09 69 27 48.0 ' ' 10 76 28 52.0 ' ' 11 74 29 51.5 ' ' 12 62 44 53.0 ' ' 13 65 43 54.0 ' ' 14 75 32 53.5 ' ' 15 73 35 54.0 ' ' 16 73 34 53.5 ' ' 17 64 37 50.5 ' ' 18 69 27 48.0 ' ' 19 74 34 54.0 ' ' 20 77 31 54.0 ' ' 21 76 36 56.0 ' ' 22 83 37 60.0 ' ' 23 82 50 66.0 ' ' 24 64 49 56.5 ' ' 25 60 43 51.5 ' ' 26 54 47 50.5 ' ' 27 52 34 43.0 ' ' 28 51 34 42.5 ' ' 29 60 29 44.5 ' ' 30 57 31 44.0 ' ' 31 50 32 41.0 '

Cris LaPierre
Cris LaPierre el 7 de Feb. de 2024
Editada: Cris LaPierre el 8 de Feb. de 2024
I think another rather straightforward approach is to treat * and # as delmiters.
I've simplified the read function for readability
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll = 93×4 table
Day MaxT MinT AvgT ___ ____ ____ ____ 1 66 28 47 2 65 29 47 3 62 36 49 4 63 31 47 5 52 36 44 6 53 28 40.5 7 62 26 44 8 65 27 46 9 69 27 48 10 76 28 52 11 74 29 51.5 12 62 44 53 13 65 43 54 14 75 32 53.5 15 73 35 54 16 73 34 53.5
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end
  5 comentarios
Cris LaPierre
Cris LaPierre el 7 de Feb. de 2024
Movida: Cris LaPierre el 8 de Feb. de 2024
Hmm. Works here. Have you shared the full error message (all the red text)?
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll = 124×4 table
Day MaxT MinT AvgT ___ ____ ____ ____ 1 65 12 38.5 2 68 28 48 3 65 17 41 4 57 22 39.5 5 46 24 35 6 61 18 39.5 7 62 25 43.5 8 58 12 35 9 64 11 37.5 10 65 14 39.5 11 54 22 38 12 58 40 49 13 64 27 45.5 14 65 19 42 15 59 19 39 16 62 23 42.5
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end
Jonathon Klepatzki
Jonathon Klepatzki el 7 de Feb. de 2024
Movida: Cris LaPierre el 8 de Feb. de 2024
Everything that was said were provided.

Iniciar sesión para comentar.


Walter Roberson
Walter Roberson el 8 de Feb. de 2024
To answer the original question:
An alternative way to read the files is to use FixedWidthImportOptions together with readtable() https://www.mathworks.com/help/matlab/ref/matlab.io.text.fixedwidthimportoptions.html

Categorías

Más información sobre File Operations en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by