How do i use textscan to extract some of the numbers with certain pattern (not all the numbers) from one sentence in text file?

6 visualizaciones (últimos 30 días)
New to Matlab and really struggling on this. I am trying to extract some numbers at certain location or with patterns from the sentences in a text file, but i have no clue of how to filter out other numbers. I only know how to extract all the numbers from it. For example, i have the data in text file named ABC.txt to be:
TJX was in top-20 3 times and got higher 2 times within 1 day(s), 66.67%. It went 6.23% higher on average
TJX was in top-100 32 times and got higher 22 times within 1 day(s), 68.75%. It went 2.80% higher on average
TJX was in top-200 56 times and got higher 43 times within 1 day(s), 76.79%. It went 2.63% higher on average
Your choice on 2021-03-19: TJX(-) 1599/1962
Your choice on 2021-03-18: TJX(-) 1365/2029
Your choice on 2021-03-17: TJX(+) 497/1898
Your choice on 2021-03-16: TJX(-) 1721/1973
Your choice on 2021-03-15: TJX(+) 369/2039
Your choice: AMT since 2020-01-14
AMT was in top-20 1 times and got higher 0 times within 1 day(s), 0.00%. It went 0.00% higher on average
AMT was in top-100 11 times and got higher 8 times within 1 day(s), 72.73%. It went 1.31% higher on average
AMT was in top-200 20 times and got higher 16 times within 1 day(s), 80.00%. It went 2.03% higher on average
Your choice on 2021-03-19: AMT(+) 437/1962
Your choice on 2021-03-18: AMT(N) 1818/2029
Your choice on 2021-03-17: AMT(-) 1738/1898
Your choice on 2021-03-16: AMT(-) 1807/1973
Your choice on 2021-03-15: AMT(N) 259/2039
And i want to extract all the informaion underlined above (those are done by myself manually) and get the sorted data to be like this in a text file named ABC_Reduced.txt:
TJX.. 20:03/67% 100:32/69% 200:56/77% 0319(-)0318(-)0317(+)0316(-)0315(+)
AMT.. 20:01/00% 100:11/73% 200:20/80% 0319(+)0318(N)0317(-)0316(-)0315(N)
Any help or hint would be appreciated.
Thanks,
Esther

Respuesta aceptada

Mathieu NOE
Mathieu NOE el 22 de Mzo. de 2021
hello esther
that was not my easiest code of the day, but finally managed to get it done !
attached my input / output text files
hope it helps !
clc
clearvars
Filename_in = 'dataABC.txt';
Filename_out= 'dataABC_reduced.txt';
[Names,str_all] = extract_data(Filename_in)
% export to text file
writecell(str_all, Filename_out, "FileType", "text");
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [Names,str_all] = extract_data(Filename)
fid = fopen(Filename);
tline = fgetl(fid);
% initialization
k = 0;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 1st loop to collect all the names
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
while ischar(tline)
if contains(tline,'was in top')
k = k+1; % loop over line index
Name{k} = deblank(extractBefore(tline,'was in top'));
end
tline = fgetl(fid); % lower make matlab not case sensitive
end
Names = unique(Name,'stable');
Names = Names';
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 2nd loop to do the hard work
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% initialization
k = 0;
q = 0;
fid = fopen(Filename);
tline = fgetl(fid);
str1 = [];
str2 = [];
Name_old ='bbb';
row = 1;
while ischar(tline)
% retrieve line
if contains(tline,'was in top') % lower make matlab not case sensitive
k = k+1; % loop over line index
Name = deblank(extractBefore(tline,'was in top'));
if k>1 && strcmp(Name,Name_old) == 0
str_all{row} = [Names{row} '..' str1 ' ' str2]; % first concatenations (last one is done at the very end of the file)
str1 = []; % reset
str2 = []; % reset
row = row+1; % increment
end
% retrieve all numeraical contents
x = regexp(tline, '.*?(\d+(\.\d+)*)', 'tokens' );
A = [x{:}];
str1 = [str1 ' ' A{1} ':' A{2} '/' num2str(round(str2num(A{5}))) '%'];
end
if contains(tline,'Your choice on ') % lower make matlab not case sensitive
q = q+1; % loop over line index
date = extractBetween(tline,'Your choice on',':');
month = extractBetween(date,'-','-');
tmp = extractAfter(date,'-');
day = extractAfter(tmp,'-');
sign = extractBetween(tline,'(',')');
str2 = [str2 char(month) char(day) '(' char(sign) ')'];
end
Name_old = Name; % for the check of name change (increment row index)
tline = fgetl(fid); % lower make matlab not case sensitive
end
% last and final concatenation
str_all{row} = [Names{row} '..' str1 ' ' str2]; % last and final concatenation
str_all = str_all';
fclose(fid);
end
  2 comentarios
Shiyu Yang
Shiyu Yang el 23 de Mzo. de 2021
This is exactly what i needed and it helps a lot! Thank you soooo much!
I didn't expect it to be like this diffcult though.
You saved my day!

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Large Files and Big Data en Help Center y File Exchange.

Productos


Versión

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by