MATLAB Answers

Replacing -999 in a table to NaN regardless of the type of the column

20 views (last 30 days)
Wesso
Wesso on 29 Jan 2021
Commented: Siddharth Bhutiya on 29 Jan 2021
Hi,
I have table Final with -999 and -9999 representing missing values (Data attached). I would like to replace these -999 with NaN regardless of the type of the column (double or string).I tried
Final= convertvars(Final, @isnumeric, @no999);
function x = no999(x)
x(x==-999) = nan;
end
But still it shows -999 values. Any hint about how to do it?
  3 Comments
dpb
dpb on 29 Jan 2021
I'd strongly suggest to first clean up the categorical variables to normalize the categories, eliminating the instances of misspellings and spacing as well as inconsistency in encoding...for example, the "I don't know" category for variable org04b looks the same but
>> sort(categories(Final.org04b))
ans =
46×1 cell array
{'-999' }
{'-9999' }
{'150 million USD or above' }
{'150 million usd or above' }
{'Betweeen 1 million and 4.9 million USD' }
{'Between 1 million USD and 4.9 million USD'}
{'Between 1 million and 4 million usd' }
{'Between 1 million and 4.9 million USD' }
{'Between 1 million and 4.9 million uSD' }
{'Between 1 million and 4.9 million usd' }
{'Between 10 million and 24 million usd' }
{'Between 10 million and 24.9 million USD' }
{'Between 10 million and 24.9 million usd' }
{'Between 10 million and 249.9 million USD' }
{'Between 100 000 USD and 249 999 USD' }
{'Between 100 000 usd and 249 999 USD' }
{'Between 100 USD and 249 999 USD' }
{'Between 100 million and 149 million usd' }
{'Between 100 million and 149.9 million USD'}
{'Between 100,000 usd and 249,000 usd' }
{'Between 25 million and 49 million usd' }
{'Between 25 million and 49.9 million USD' }
{'Between 25 million and 49.9 million usd' }
{'Between 250 000 USD and 499 999 USD' }
{'Between 250 000 usd and 499 999 USD' }
{'Between 250,000 usd and 499,000 usd' }
{'Between 250,000 usd and 499,999 usd' }
{'Between 5 million and 9 million usd' }
{'Between 5 million and 9.9 million USD' }
{'Between 5 million and 9.9 million usd' }
{'Between 50 000 USD and 99 999 USD' }
{'Between 50 000 usd and 99 999 USD' }
{'Between 50 million and 99 million usd' }
{'Between 50 million and 99.9 million USD' }
{'Between 50 million and 99.9 million usd' }
{'Between 50,000 usd and 99,000 usd' }
{'Between 50,000 usd and 99,999 usd' }
{'Between 500 000 USD and 999 999 USD' }
{'Between 500 000 usd and 999 999 USD' }
{'Between 500,000 usd and 999,000 usd' }
{'Between 500,000 usd and 999,999 usd' }
{'I don't know' }
{'I don’t know' }
{'Under 50 000 USD' }
{'Under 50,000 usd' }
{'org04a' }
>> tmp=ans(42:43)
tmp =
2×1 cell array
{'I don't know'}
{'I don’t know'}
>> strcmp(tmp(1),tmp(2))
ans =
logical
0
>> [double(tmp{1});double(tmp{2})]
ans =
73 32 100 111 110 39 116 32 107 110 111 119
73 32 100 111 110 8217 116 32 107 110 111 119
>>
I'd create them first from the numerical data and have the matching to the missing value indicators taken care of there. Then the ismissing function will work.
As this is, you'll play the devil in trying to retrieve all elements of a given level because the categories aren't unique for what are supposed to be the same thing.

Sign in to comment.

Accepted Answer

Steven Lord
Steven Lord on 29 Jan 2021
Let's take some sample data.
x = randi([-10 10], 1, 10);
Change three locations to -999. I also want to make a copy of x to keep around so I can show two techniques and have the original data for comparison.
x(randperm(10, 3)) = -999
x = 1×10
-999 -999 -9 -9 4 6 8 7 0 -999
xcopy = x;
You could use logical indexing.
x(x == -999) = NaN
x = 1×10
NaN NaN -9 -9 4 6 8 7 0 NaN
You could also use standardizeMissing.
y = standardizeMissing(xcopy, -999)
y = 1×10
NaN NaN -9 -9 4 6 8 7 0 NaN
Now compare the original data and the two modified copies.
[xcopy; x; y]
ans = 3×10
-999 -999 -9 -9 4 6 8 7 0 -999 NaN NaN -9 -9 4 6 8 7 0 NaN NaN NaN -9 -9 4 6 8 7 0 NaN
  3 Comments
Siddharth Bhutiya
Siddharth Bhutiya on 29 Jan 2021
You would have to pass the indicator for all type. When you display the table a double (-999), a string ("-999") and a categorical (-999) would display the same but they are different types. So you will have to pass all three as your indicator. (String "-999" will also match the categorical so you only need "-999" for both string and categorical types)
standardizeMissing(t,{-999,"-999"})
Also note that, you cannot have a table variable that is a categorical but one of the values is a double NaN, you would have to put the missing value corresponding to categorical which is <undefined>. If you use standardizeMissing, then it will take care of this for you and fill each type with its corresponding missing value.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by