How to extract numerical values from a text in an array or table?

43 visualizaciones (últimos 30 días)
I have a data table, with many rows of values. Some of these values contain useful numerical information, alongside text. I want to extract only the numerical information, and turn this into a new table.
The trouble is that the data is messy, and ideally I would write some expressions that neatly extract the important information, without requiring manual processing.
  1. In some columns, the values are string like '15 units' or '15-25 units'. I want to be able to create a new column with only numbers. In the former, this should obviously convert to 15. In the latter I want this to also be 15 (i.e. only extract the first number before the hyphen). Any ideas on how to do this?
  2. In other columns, I have long messy strings that sometimes contain valuable numbers, and sometimes don't. When the numbers are valuable, they are usually followed by either " or 'secs'. Is there a way to extract only the numbers in these strings that immediately precede this?
Thanks!
  4 comentarios
Walter Roberson
Walter Roberson el 12 de Nov. de 2024
Are the numbers only ever integers? If not, then are the numbers ever in floating point format? If there are numbers with abs() less than 0, then do the number start with a leading 0 before the decimal point, or do they omit the leading 0? 0.3 compared to .3
Rahul
Rahul el 13 de Nov. de 2024
Editada: Rahul el 13 de Nov. de 2024
Definitely appreciate that questions like this are difficult to solve resolutely. I can try to attach something representative if this does start to look interminable. For now:
>Are there cases where you need to extract more than one number from the same string?
Never, I only ever want a single final number.
>If not, then are there cases where the number you need to extract is not the first number in the string?
In the case of the columns that sometimes has value ranges (e.g. 15-20) then the number I need to extract is the first number in the string.
However, in the other columns, the number I want to extract is only identifiable by the fact it is followed by either a " or the word 'secs'. It is not always the first number in the string. Sometimes there is space in between the number and 'secs'.
>Are the numbers only ever integers?
Yes
>If there are numbers with abs() less than 0
There are no numbers with an absolute value less than 1 (I presume this is what you mean).

Iniciar sesión para comentar.

Respuesta aceptada

Stephen23
Stephen23 el 13 de Nov. de 2024
Editada: Stephen23 el 13 de Nov. de 2024
No need for slow superfluous CELLFUN. Simpler and more efficient:
C = {'xxx'; 'yyy 15 units'; '15-25 units'; 'zzz'};
D = str2double(regexp(C,'\d+','match','once'))
D = 4×1
NaN 15 15 NaN
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
C = {'xxx'; '20 yyy 30 secs'; '45"'; 'zzz 99'};
D = str2double(regexp(C, '\d+(?=\s*("|secs))', 'match','once'))
D = 4×1
NaN 30 45 NaN
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
  2 comentarios
Rahul
Rahul el 13 de Nov. de 2024
Thanks Stephen. This works perfectly, and helps me get started with learning regular expressions more broadly. Very helpful.
Stephen23
Stephen23 el 13 de Nov. de 2024
"...helps me get started with learning regular expressions more broadly"
There is no shortcut, you just have to read the documentation again and again and again and again... :)
If it helps:
\d+(?=\s*("|secs))
^^^ match one or more digit characters
^^^ ^ lookahead
^^^ match zero or more whitespace characters
^ ^ group
^ match literal double quote
^ or
^^^^ match literal "secs"
You might also like to download this tool:

Iniciar sesión para comentar.

Más respuestas (1)

Umar
Umar el 13 de Nov. de 2024

Hi @Rahul ,

To achieve the extraction of numerical values from your data table in MATLAB, you can utilize regular expressions along with the cellfun function for efficient processing. Below is a concise approach to handle both scenarios you described.

Extracting Numbers from '15 units' or '15-25 units': You can use the following code snippet to create a new column with only the first number:

% Sample data
data = {'15 units'; '15-25 units'; '20 units'; 'No data'};
% Extracting the first number
extractedNumbers = cellfun(@(x) regexp(x, '(\d+)', 'match', 'once'), data, 
'UniformOutput', false);
% Convert to numeric
numericValues = cellfun(@str2double, extractedNumbers);

Extracting Numbers Preceding " or 'secs": For this, you can modify the regular expression to capture numbers that precede the specified suffixes:

% Sample data
data2 = {'30 secs'; '45"'; 'No value'; '20 mins'};
% Extracting numbers before " or 'secs'
extractedSecs = cellfun(@(x) regexp(x, '(\d+)(?=\s*["'']|secs)', 'match', 'once'),     
data2, 'UniformOutput', false);
% Convert to numeric
numericSecs = cellfun(@str2double, extractedSecs);

Please see attached.

By applying these methods, you can efficiently clean your data and create new columns with the desired numerical values, minimizing manual processing.

  2 comentarios
Rahul
Rahul el 13 de Nov. de 2024
Thanks Umar. This broadly seems to work. The only problem is, that the second piece of code that you suggest does not extract the numbers that appear before the string 'secs' but only the numbers that appear before ".
You can see this in your example: numericsecs should contain the value '30', but it misses this from that cell.
Any ideas?
Umar
Umar el 13 de Nov. de 2024

Hi @Rahul,

After reviewing your comments, I comprehend your concern. In my previous code, I noticed that the original regular expression (\d+)(?=\s*["]|secs) is designed to match one or more digits (\d+) that are followed by either a double quote or the word "secs". However, the use of the alternation operator (|) within the lookahead assertion was not correctly capturing both cases. So, I adjusted the regular expression to ensure it captures numbers followed by either a double quote or the word "secs". Here is the complete and updated MATLAB code that addresses the extraction of numeric values from the specified strings:

% Sample data
data2 = {'30 secs'; '45"'; 'No value'; '20 mins'; '50 secs'};
% Extracting numbers before " or 'secs'
extractedSecs = cellfun(@(x) regexp(x, '(\d+)(?=\s*["]|(?=\s*secs))', 'match', 
'once'), data2, 'UniformOutput', false);
% Convert to numeric
numericSecs = cellfun(@str2double, extractedSecs);
% Display results
disp(numericSecs);

Please see attached.

In the above modified code, the variable data2 contains a cell array of strings, some of which include numeric values followed by "secs" or a double quote.The cellfun function applies the regexp function to each element of data2. The updated regular expression captures the desired numeric values.The extracted string values are converted to numeric format using str2double. This function will return NaN for any non-numeric strings.Finally, the results are displayed using disp, which will show the numeric values extracted from the strings.

Hope this should help resolve your problem.

Iniciar sesión para comentar.

Categorías

Más información sobre Data Type Conversion en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by