Clean Messy and Missing Data
This example shows how to find, clean, and delete observations with missing data in a dataset array.
Load sample data.
Import the data from the spreadsheet messy.xlsx
.
messyData = dataset('XLSFile',fullfile(matlabroot,'help/toolbox/stats/examples','messy.xlsx'))
messyData = var1 var2 var3 var4 var5 'afe1' '3' 'yes' '3' 3 'egh3' '.' 'no' '7' 7 'wth4' '3' 'yes' '3' 3 'atn2' '23' 'no' '23' 23 'arg1' '5' 'yes' '5' 5 'jre3' '34.6' 'yes' '34.6' 34.6 'wen9' '234' 'yes' '234' 234 'ple2' '2' 'no' '2' 2 'dbo8' '5' 'no' '5' 5 'oii4' '5' 'yes' '5' 5 'wnk3' '245' 'yes' '245' 245 'abk6' '563' '' '563' 563 'pnj5' '463' 'no' '463' 463 'wnn3' '6' 'no' '6' 6 'oks9' '23' 'yes' '23' 23 'wba3' '' 'yes' 'NaN' 14 'pkn4' '2' 'no' '2' 2 'adw3' '22' 'no' '22' 22 'poj2' '-99' 'yes' '-99' -99 'bas8' '23' 'no' '23' 23 'gry5' 'NA' 'yes' 'NaN' 21
When you import data from a spreadsheet, dataset
reads any
variables with nonnumeric elements as a cell array of character vectors. This is
why the variable var2
is a cell array of character vectors.
When importing data from a text file, you have more flexibility to specify which
nonnumeric expressions to treat as missing using the option
TreatAsEmpty
.
There are many different missing data indicators in
messy.xlsx
, such as:
Empty cells
A period (
.
)NA
NaN
-99
Find observations with missing values.
Display the subset of observations that have at least one missing
value using ismissing
.
ix = ismissing(messyData,'NumericTreatAsMissing',-99,... 'StringTreatAsMissing',{'NaN','.','NA'}); messyData(any(ix,2),:)
ans = var1 var2 var3 var4 var5 'egh3' '.' 'no' '7' 7 'abk6' '563' '' '563' 563 'wba3' '' 'yes' 'NaN' 14 'poj2' '-99' 'yes' '-99' -99 'gry5' 'NA' 'yes' 'NaN' 21
By default, ismissing
recognizes the following
missing value indicators:
NaN
for numeric arrays''
for character arrays<undefined>
for categorical arrays
Use the NumericTreatAsMissing
and StringTreatAsMissing
options
to specify other values to treat as missing.
Convert character arrays to double arrays.
You can convert the char
variables that should
be numeric using str2double
.
messyData.var2 = str2double(messyData.var2); messyData.var4 = str2double(messyData.var4)
messyData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' -99 'yes' -99 -99 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21
var2
and var4
are
numeric arrays. During the conversion, str2double
replaces
the nonnumeric elements of the variables var2
and var4
with
the value NaN
. However, there are no changes to
the numeric missing value indicator, -99
.When applying the same function to many dataset array variables,
it can sometimes be more convenient to use datasetfun
.
For example, to convert both var2
and var4
to
numeric arrays simultaneously, you can use:
messyData(:,[2,4]) = datasetfun(@str2double,messyData, ... 'DataVars',[2,4],'DatasetOutput',true);
Replace missing value indicators.
Clean the data so that the missing values indicated by the code -99
have
the standard MATLAB® numeric missing value indicator, NaN
.
messyData = replaceWithMissing(messyData,'NumericValues',-99)
messyData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' NaN 'yes' NaN NaN 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21
Create a dataset array with complete observations.
Create a new dataset array that contains only the complete observations—those without missing data.
ix = ismissing(messyData); completeData = messyData(~any(ix,2),:)
completeData = var1 var2 var3 var4 var5 'afe1' 3 'yes' 3 3 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'bas8' 23 'no' 23 23
See Also
dataset
| ismissing
| replaceWithMissing