Converting unformatted text to formatted text

I asked this question before and neglected some info, so I want to start fresh to avoid confusion.
clear all;
close all
clc
projectdir = 'C:\Users\me\data.psr';
newdir = 'C:\Users\me\Desktop\Test1';
fid=fopen(projectdir,'r');
T=textscan(fid, '%s');
fclose(fid);
for i=8:107
a=T{1,1}{i,1};
b= a(30:48);
matrix(i).r = b(2);
matrix(i).c = b(5);
matrix(i).info = b(8:13);
end
A = zeros(9,9)
for j=8:107
A(matrix(j).r, matrix(j).c) = matrix(j).info;
end;
The error:
Assignment has more non-singleton rhs dimensions than non-singleton subscripts
Error in Untitled2 (line 23)
A(matrix(j).r, matrix(j).c) = matrix(j).info;
This answer by user Stephen Cobeldick might help, although it was created only to deal with the histogram. It gives an error when ran however.
str = fileread('temp.txt');
% identify digits:
rgx = '[A-Z]+\[(\d+)\]\[(\d+)\]:*(\d+)';
C = regexp(str,rgx,'tokens');
% convert digits to numeric:
M = cellfun(@str2double,vertcat(C{:}));
M(:,1:2) = 1+M(:,1:2);
% convert to linear indices:
out = nan(max(M(:,1)),max(M(:,2)));
idx = sub2ind(size(out),M(:,1),M(:,2));
% allocate values:
out(idx) = M(:,3)
Error using cellfun
Input #2 expected to be a cell array, was double instead.
Error in Untitled3 (line 12)
M = cellfun(@str2double,vertcat(C{:}));

7 comentarios

Stephen23
Stephen23 el 25 de Nov. de 2015
Editada: Stephen23 el 25 de Nov. de 2015
@Ibro Tutic: you have edited your other question and removed all information from it, making our answers useless. Some might consider this to be extremely rude. On this forum it is certainly not considered to be helpful.
You have copied my code, but did not make the single small gesture of acknowledging this with a vote, or by accepting my answer, even though it perfectly resolved your original question.
Your comment that it "It gives an error when ran however" is incorrect: it works perfectly, without error, for the file that I had to create (because you did not provide any sample data).
Ibro Tutic
Ibro Tutic el 25 de Nov. de 2015
I removed the entire question and linked this one. I honestly don't know what your problem is. Yes your answer was correct, and I needed more. I will go ahead and give you best answer for the last question, but I obviously stated that I had asked this question before and needed to ask it again to make up for the mistakes I made with the last one. I gave you credit for your code and provided it so if anyone wanted to go off of your code, they could. I could have just posted your code and said here it doesn't work fix it, but I didn't. I legitimately tried to fix the mistakes in my last question by posting a new question to avoid the confusion that I know would have resulted from trying to add in information and change what I needed.
I am actually trying to learn how to use the program and not just take peoples code. I apologize if you take it personally.
Stephen23
Stephen23 el 25 de Nov. de 2015
I am just pointing out that you are not the only person involved in this forum, and yet when you unilaterally decide to delete your question text it affects everyone who was involved, especially those who volunteered their time to develop valid answers. Imagine if everyone decided to delete their questions after they got an answer: this forum would be useless as a store of information available for everyone to use. That is all.
It is not just you, other users do it too. You might be interested to read what other volunteers feel about this behavior:
per isakson
per isakson el 25 de Nov. de 2015
Stephen, Thanks for making me aware that I'm wasting my time!
Ibro Tutic
Ibro Tutic el 25 de Nov. de 2015
Editada: John Kelly el 10 de Nov. de 2017
Yes, I probably should have left the question, but I was under the impression that if I ask the same question again you would feel that your answers weren't good enough. Adding in more and more info that I missed will confuse the person answering the question and me, as I probably don't remember what exactly I had posted before. This was the simplest solution to a small problem, now what I will do is rewrite the original question and put it in there to solve what seems to be the biggest issue in the history of this forum.
I completely understand where you are coming from, but I am not sure that you understand what I am trying to accomplish by posting this new question. Sorry, I guess? I am trying to rectify my mistake and it seems that people are more worried about the fact that I deleted a question rather than trying to "help" with my other questions, judging from what isakson just commented. I am legitimately trying to learn how to do this and people are making MASSIVE deals out of problems that shouldn't be that important (yes, if I just deleted the entire question I would understand, but I clearly stated my intentions). Like I said, yea, I probably messed up deleting the question, but I'm not sure if arguing about that rather than actually helping with the question is the mature thing to do.
It's not like I am consistently deleting every question that I get answered to cover up a trail or something. It was my first time doing it and now I realize that I screwed up in doing so. I remain respectful in every aspect of my questions, giving credit to people who wrote certain code, etc.
With that said, thanks for any help you/isakson/dpd provided.
Stephen23
Stephen23 el 25 de Nov. de 2015
I hope that you get the help and information that you need, and have fun learning MATLAB! We do put a lot of effort in when people need it, so please come and ask more questions :)

Iniciar sesión para comentar.

 Respuesta aceptada

per isakson
per isakson el 24 de Nov. de 2015
Editada: per isakson el 28 de Nov. de 2015
I have assumed that the size of the resulting arrays are known
fid = fopen( 'c:\m\cssm\test4.txt' );
rows = textscan( fid, '%s', 'Delimiter', '\n' );
fclose( fid );
rows = rows{:};
str = 'RainflowCycleCounterHistogram'; % avoid magic number
len = length( str );
is_counter = strncmp( str, rows, len );
counter_rows = rows( is_counter );
%
str = 'RainflowCycleMeanBreakpoints';
len = length( str );
is_mean = strncmp( str, rows, len );
mean_rows = rows( is_mean );
%
str = 'RainflowCycleRangeBreakpoints';
len = length( str );
is_range = strncmp( str, rows, len );
range_rows = rows( is_range );
%
counter_matrix = nan( 10, 10 );
for jj = 1 : length( counter_rows )
%
cac = textscan( counter_rows{jj}, '%*s%d%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
counter_matrix( cac{1}+1, cac{2}+1 ) = cac{3}; % one based
end
mean_vector = nan( 1, 10 );
for jj = 1 : length( mean_rows )
%
cac = textscan( mean_rows{jj}, '%*s%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
mean_vector( 1, cac{1}+1 ) = cac{2}; % one based
end
range_vector = nan( 1, 10 );
for jj = 1 : length( range_rows )
%
cac = textscan( range_rows{jj}, '%*s%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
range_vector( 1, cac{1}+1 ) = cac{2}; % one based
end
&nbsp
or maybe better - no assumptions regarding sizes
fid = fopen( 'c:\m\cssm\test4.txt' );
rows = textscan( fid, '%s', 'Delimiter', '\n' );
fclose( fid );
rows = rows{:};
str = 'RainflowCycleCounterHistogram'; % avoid magic number
len = length( str );
is_counter = strncmp( str, rows, len );
counter_rows = rows( is_counter );
%
str = 'RainflowCycleMeanBreakpoints';
len = length( str );
is_mean = strncmp( str, rows, len );
mean_rows = rows( is_mean );
%
str = 'RainflowCycleRangeBreakpoints';
len = length( str );
is_range = strncmp( str, rows, len );
range_rows = rows( is_range );
%
CRS = permute( char( counter_rows ), [2,1] );
cac = textscan( CRS, '%*s%f%f%f' ...
, 'Delimiter' , '[]: '...
, 'MultipleDelimsAsOne' , true ...
, 'CollectOutput' , true );
num = cac{1};
%
sz1 = min( num(:,1:2), [], 1 );
sz2 = max( num(:,1:2), [], 1 );
sz = sz2-sz1+[1,1];
ix_linear = sub2ind( sz, num(:,1)+1, num(:,2)+1 ); % one based
counter_matrix( ix_linear ) = num(:,3);
counter_matrix = reshape( counter_matrix, sz );
MRS = permute( char( mean_rows ), [2,1] );
cac = textscan( MRS, '%*s%f%f' ...
, 'Delimiter' , '[]: '...
, 'MultipleDelimsAsOne' , true ...
, 'CollectOutput' , true );
num = cac{1};
%
mean_vector( num(:,1)+1 ) = num(:,2); % one based
RRS = permute( char( range_rows ), [2,1] );
cac = textscan( RRS, '%*s%f%f' ...
, 'Delimiter' , ' []:'...
, 'MultipleDelimsAsOne' , true ...
, 'CollectOutput' , true );
%
range_vector( num(:,1)+1 ) = num(:,2); % one based
hope they return identical results :-)
&nbsp
and another iteration
Comments:
  • A function is superior to a script. It doesn't mess with the base workspace. It's easier to debug and it's easier to call from a script or function.
  • This function is readable. It's fairly straightforward to add new keywords and row formats.
  • The switch case can be replaced by a feval construct. But why do that?
  • The subfunctions, f1, f2 and f3, have large parts of their code in common. That asks for further refactoring.
  • Allocating a separate sub-function to each type of row makes testing easier.
  • If speed becomes a problem analyze the code with the profiler.
>> S = cssm( 'c:\m\cssm\text4.txt' )
S =
RainflowCycleCounterHistogram: [10x10 double]
RainflowCycleMeanBreakpoints: [-111 100 300 330 360 380 390 400 410 420]
RainflowCycleRangeBreakpoints: [0 35 70 100 135 170 200 230 260 300]
RainflowCycleReversalTolerance: 20
PowerCylinderTemperature: 0
PowerCylinderTemperatureHistogram: [1x12 double]
PowerCylinderTemperatureHistogramBreakpoints: [0 150 175 200 220 250 300 320 350 370 400]
>>
where
function S = cssm( filespec )
fid = fopen( filespec );
rows = textscan( fid, '%s', 'Delimiter', '\n' );
fclose( fid );
rows = strtrim( rows{:} );
type_list = {
... format keyword
'f1', 'RainflowCycleCounterHistogram'
'f2', 'RainflowCycleMeanBreakpoints'
'f2', 'RainflowCycleRangeBreakpoints'
'f3', 'RainflowCycleReversalTolerance'
'f3', 'PowerCylinderTemperature'
'f2', 'PowerCylinderTemperatureHistogram'
'f2', 'PowerCylinderTemperatureHistogramBreakpoints'
};
for jj = 1 : size( type_list, 1 )
switch type_list{jj,1}
case 'f1'
S.(type_list{jj,2}) = f1( type_list{jj,2}, rows );
case 'f2'
S.(type_list{jj,2}) = f2( type_list{jj,2}, rows );
case 'f3'
S.(type_list{jj,2}) = f3( type_list{jj,2}, rows );
otherwise
error( 'The format, "%s", is not yet implemented', type_list{jj,1} )
end
end
end
function matrix = f1( keyword, rows )
ism = is_member( keyword, rows );
cur_rows = rows( ism );
%
str = permute( char( cur_rows ), [2,1] );
cac = textscan( str, '%*s%f%f%f' ...
, 'Delimiter' , '[]: '...
, 'MultipleDelimsAsOne' , true ...
, 'CollectOutput' , true );
num = cac{1};
%
sz1 = min( num(:,1:2), [], 1 );
sz2 = max( num(:,1:2), [], 1 );
sz = sz2-sz1+[1,1];
ix_linear = sub2ind( sz, num(:,1)+1, num(:,2)+1 ); % one based
matrix( ix_linear ) = num(:,3);
matrix = reshape( matrix, sz );
end
function matrix = f2( keyword, rows )
ism = is_member( keyword, rows );
cur_rows = rows( ism );
%
str = permute( char( cur_rows ), [2,1] );
cac = textscan( str, '%*s%f%f' ...
, 'Delimiter' , '[]: '...
, 'MultipleDelimsAsOne' , true ...
, 'CollectOutput' , true );
num = cac{1};
%
matrix( num(:,1)+1 ) = num(:,2); % one based
end
function matrix = f3( keyword, rows )
ism = is_member( keyword, rows );
cur_rows = rows( ism );
%
str = permute( char( cur_rows ), [2,1] );
cac = textscan( str, '%*s%f', 'Delimiter',':' );
matrix = cac{:};
end
function ism = is_member( keyword, rows )
% the keyword is followed by either ":" or "["
cac = regexp( rows, ['^',keyword,'(?=(:|\[))'], 'once' );
ism = not( cellfun( @isempty, cac ) );
end

12 comentarios

Ibro Tutic
Ibro Tutic el 24 de Nov. de 2015
Cool, thanks a lot.
Ibro Tutic
Ibro Tutic el 24 de Nov. de 2015
Editada: Ibro Tutic el 25 de Nov. de 2015
I do have another question, how do you know what formatspec to use? That is my main issue when I try to textscan any document really.
Also, when I try to format the temperature data, the PowerCylinderTemperatureHistogram data includes data from PowerCyl...HistogramBreakPoints. How can I exclude this?
dpb
dpb el 24 de Nov. de 2015
Basically, there are two cases...the first is if the data are all numeric and regular with at most header lines, you can forget the format spec and use an empty string; textscan (and its red-haired stepchild cousin textread will then return the same shape of the input file automagically.
Other than that, you basically have to know what the format of the input file is and as mine and Per's answers show, use the "features" of the file structure to be able to parse specific formats. In this case, there were no blanks in the lines to be parsed, hence the delimiters could become the non-characters of interest, the square brackets and the colon.
Since, however, there were adjoining [] that if are considered delimiters are indicators for an empty field and that isn't the way wanted the record to be interpreted, the need for the 'MultipleDelimsAsOne' parameter to be set so that sequence would be treated as only one. Other than that, it was '%*s' to skip the first string field up to the first bracket and three numeric fields. I didn't differentiate between the integer and floating point fields, altho one can do so but when do, textscan will return a separate cell for each type which is more hassle generally to use when done.
All in all, it takes some "time in grade" to be able to figure out all the gyrations inherent in the C format parsing and there's much that is, simply put, essentially magic, particularly when it comes to fixed-width fields.
per isakson
per isakson el 24 de Nov. de 2015
Editada: per isakson el 25 de Nov. de 2015
"how do you know what formatspec to use?"
  • In the best of all worlds a specifications of file format comes together with the file
  • I practise one often has to rely on inspection of the file
  • textscan requires that all rows have identical format, with exception of a specified number of header lines. test4.txt doesn't meet this requirement.
  • AD HOC : for the particular end or case at hand without consideration of wider application [Merriam-Webster]. I tend to make a special piece of code for every different type of file. Practise helps to find solutions within a reasonable time.
"How can I exclude this?"
str = 'PowerCylinderTemperatureHistogram'; % or any other string
len = length( str );
is_power_cylinder = strncmp( str, rows, len );
%
rows( is_power_cylinder ) = [];
Ibro Tutic
Ibro Tutic el 25 de Nov. de 2015
Cool thanks for the explanations. And about my last question, I meant to say that PowerCylinderHistogram is including data from PowerCylinderHistogramBreakpoints, I guess it really doesn't matter but I would like to keep everything consistent with how the data is read and I would prefer it if everything was in its own array.
per isakson
per isakson el 25 de Nov. de 2015
The string "PowerCylinderHistogram" doesn't appear in the sample file, test4.txt'. Isn't your recent comment the first time the string, "PowerCylinderHistogram" appears in this thread? Thus, we are not in the best of all worlds!
Looking at test4.txt I guess
  • the files are not huge, because with huge (several GB) files those long names would be an enormous waste of memory.
  • there may a large number of different row types, i.e. different names. Are all possible names known beforehand?
dpb
dpb el 25 de Nov. de 2015
"PowerCylinderHistogram is including data from PowerCylinderHistogramBreakpoints, "
Don't use strncmp but strcmp for exact match including length. The test as written specifically will make a match of the two strings because it finds only the N characters that are in the match string in another string of that length or longer.
Read the doc's and the "See Also" sections and think about what you're after...
Ibro Tutic
Ibro Tutic el 25 de Nov. de 2015
Editada: Ibro Tutic el 25 de Nov. de 2015
Yes, all possible names are known before hand. The files are small because they are pulled manually from the ECU whenever we need them and they are specific to whatever we are looking for. dpb's answer to use strcmp instead of strncmp looks like it will solve my problem.
I had a file called test4 and text4, I attached the wrong one. I went ahead and attached the new file.
I tried to use strcmp to only look for the exact text, but it returns a value of 0 for every row (I assume this is because it is looking ONLY for that exact string and the fact that there is more characters behind the string causes it to return a 0). I went ahead and attached a text file with ALL of the data I am looking at (not actual data, modified numbers) and the code I added to per isakson's original code. I am using the actual file to test the code against, so that will need to be changed to account for the text4.txt file.
clear all;
close all
clc
projectdir = 'C:\Users\it58528\Documents\Power Cylinder Temp and Rainflow Cycle Counter - After 16500 Cycles - 2015-10-12.prm';
fid = fopen(projectdir);
rows = textscan( fid, '%s', 'Delimiter', '\n' );
fclose( fid );
rows = rows{:};
str = 'RainflowCycleCounterHistogram'; % avoid magic number
len = length( str );
is_counter = strncmp( str, rows, len );
counter_rows = rows( is_counter );
%
str = 'RainflowCycleMeanBreakpoints';
len = length( str );
is_mean = strncmp( str, rows, len );
mean_rows = rows( is_mean );
%
str = 'RainflowCycleRangeBreakpoints';
len = length( str );
is_range = strncmp( str, rows, len );
range_rows = rows( is_range );
%
str = 'PowerCylinderTemperatureHistogram';
len = length (str);
is_temp = strcmp ( str, rows );
temp_rows = rows ( is_temp );
%
str = 'PowerCylinderTemperatureHistogramBreakpoints';
len = length (str);
is_break = strncmp ( str, rows, len );
break_rows = rows ( is_break);
counter_matrix = nan( 10, 10 );
for jj = 1 : length( counter_rows )
%
cac = textscan( counter_rows{jj}, '%*s%d%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
counter_matrix( cac{1}+1, cac{2}+1 ) = cac{3}; % one based
end
mean_vector = nan( 1, 10 );
for jj = 1 : length( mean_rows )
%
cac = textscan( mean_rows{jj}, '%*s%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
mean_vector( 1, cac{1}+1 ) = cac{2}; % one based
end
range_vector = nan( 1, 10 );
for jj = 1 : length( range_rows )
%
cac = textscan( range_rows{jj}, '%*s%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
range_vector( 1, cac{1}+1 ) = cac{2}; % one based
end
temp_matrix = nan ( 1, 12 );
for jj = 1 : length ( 12 )
%
cac = textscan( temp_rows{jj}, '%*s%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
temp_matrix( 1, cac{1}+1 ) = cac{2}; %one based
end
temp_vector = nan ( 1, 11 );
for jj = 1 : length ( break_rows )
%
cac = textscan( break_rows{jj}, '%*s%d%f' ...
, 'Delimiter' , ' []:' ...
, 'MultipleDelimsAsOne', true );
%
temp_vector( 1, cac{1}+1 ) = cac{2};
end
per isakson
per isakson el 25 de Nov. de 2015
Editada: per isakson el 25 de Nov. de 2015
Long time ago I was told "the longer you keep away from the keyboard the better a program". Modern computers invite to experimenting. However, some up front planning is useful.
I use strncmp on the complete rows, before the rows are parsed. strcmp cannot be used for that purpose. ( regexp or strfind can be used with the full keyword on the full row.) On the other hand, when the rows are parsed strcmp is the natural choice. "PowerCylinderHistogram and PowerCylinderHistogramBreakpoints", one being part of the other pose a problem to my solution, which obviously is based on too little information on the problem.
IMO: one should start with a full list of possible keywords together with the related format strings.
Ibro Tutic
Ibro Tutic el 25 de Nov. de 2015
Sounds good, I'll see what I can figure out, thanks!
dpb
dpb el 25 de Nov. de 2015
Editada: dpb el 25 de Nov. de 2015
What is the desired output again? I'd approach it a little more generically but not sure where am headed as for what, precisely to do with the end result but I'll note that from your file one can do the following--
>> S=textread('test4.txt','%s','delimiter','\n','whitespace','','headerlines',3); % read into cell array of strings
>> tok=cellfun(@(x) tokens(x,'[]:'),S,'uniformoutput',0); % find tokens each line
>> whos tok
Name Size Bytes Class Attributes
tok 52x1 13660 cell
>> tok{1} % sample what looks like
ans =
RainflowCycleCounterHistogram
0
0
1.0000000000
>> ntok=cellfun(@(x) size(x,1),tok); % number in each row
>> [min(ntok) max(ntok)] % range overall in file
ans =
2 4
>> for n=min(ntok):max(ntok) % build specific format string
fmt=['%s' repmat('[%d]',1,n-2) ':%f']
end
fmt =
%s:%f
fmt =
%s[%d]:%f
fmt =
%s[%d][%d]:%f
>> [u,iu]=unique(cellfun(@(x) x(1,:),tok,'uniform',0),'stable') % what's in file and where???
u =
'RainflowCycleCounterHistogram'
'RainflowCycleMeanBreakpoints'
'RainflowCycleRangeBreakpoints'
'RainflowCycleReversalTolerance'
'PowerCylinderTemperature'
'PowerCylinderTemperatureHistogram'
'PowerCylinderTemperatureHistogramBreakpoints'
iu =
1
8
18
28
29
30
42
>>
From the above pieces one can write a general parser for each possible data line format as long as they follow the form of
String[Index1][Index2]: Value
where the number of indices can be 0,1,2. The above actually will hand N-dimensional arrays; just that 2's the largest seen to date.
With the above it's simple enough to write a routine that loops over the elements in the U array , build the proper format string and select and parse the given lines without any specific testing for matching strings at all unless and until a user asks for only a given one or set at which time those can be returned from the general result.
But, you don't need to parse the individual lines at all; simply convert the fields within the token array for the ones of choice from the corollary tok array; ntok gives the info on how many elements there are corresponding to the fields.
function tok = tokens(s,d)
% Simple string parser returns tokens in input string s
%
% T=TOKENS(S) returns the tokens in the string S delimited
% by "white space". Any leading white space characters are ignored.
%
% TOKENS(S,D) returns tokens delimited by one of the
% characters in D. Any leading delimiter characters are ignored.
% DPBozarth (Rev 1 1998)
% Get initial token and set up for rest
if nargin==1
[tok,r] = strtok(s);
while ~isempty(r)
[t,r] = strtok(r);
tok = strvcat(tok,t);
end
else
[tok,r] = strtok(s,d);
while ~isempty(r)
[t,r] = strtok(r,d);
tok = strvcat(tok,t);
end
end
Also, of course, regexp can return tokens if one's got the patience to figure out the proper expression needed...
per isakson
per isakson el 25 de Nov. de 2015
Now I added a new piece of code to the answer.

Iniciar sesión para comentar.

Más respuestas (1)

dpb
dpb el 24 de Nov. de 2015
>> fmt='%*s%f%f%f';
>> fid=fopen('test4.txt');
>> c=cell2mat(textscan(fid,fmt,'headerlines',3,'delimiter','[]:','collectoutput',1,'multipledelimsAsOne',1));
>> v(sub2ind(sz,c(:,1)+1,c(:,2)+1))=c(:,3)
v =
Columns 1 through 10
1 0 1 1000 0 0 0 1 0 0
Columns 11 through 20
0 0 0 1 0 0 0 0 0 0
>> fid=fclose(fid);

Categorías

Más información sobre Data Type Conversion en Centro de ayuda y File Exchange.

Preguntada:

el 24 de Nov. de 2015

Comentada:

el 26 de Dic. de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by