Detecting gaps in time series data and replacing with NaN

Hi there! I recently asked a question about detecting gaps in time series data and producing the start and end times of the gaps. My newest task is detecting these gaps and replacing them with NaN.
The time series information is stored as a [nx1 double] array in a cell array called neptune. The corresponding data for each time point is also stored as a separate [nx1 double] array in the neptune cell array.
What I would like to be able to do is find gaps in the time data and replace these gaps with NaN. The time stamps are stored as 734556.750208611 for example (serial date numbers?) but they do not record gaps as anything, they simply skip to the next date for which data was recorded.
The code I have so far is:
sp=getSamplePeriod; %each instrument has a unique sampling period (in this example sp=0.25 seconds)
t=datenum(neptune.time);
t=t*86400;
idx=find(diff(t)>(2*sp))'+1 %to detect gaps greater than twice the sampling period
This gives me back (for my current example):
idx =
830 8949
This is correct so far and the gaps would be found between (idx-1) and (idx). From here however I am stuck. I don't know how to insert NaN into these gaps and also how to have NaN in the same locations in the neptune.data [nx1 double] array. If anyone could offer any suggestions it would be very much appreciated! Additionally, I'm not sure how easy it would be to insert NaN every 0.25 s interval in these gaps to make the data more accurately represented. If this is too complicated then I will be satisfied with simply inserting NaN into the single gap. Thanks!

Respuestas (3)

I assume that your measurement times are not necessarily exact, so I've built in some tolerance for error in the example below. But I'm assuming the time measurement error is smaller than the time measurement interval.
% Expected time spacing and number of data points
sp = 0.25;
n = 20;
% Expected measurement times
dntoday = datenum(date);
texact = dntoday + (0:n-1).*sp;
% Some fake measured data
maxerr = (1/24)*.5; % +/- 30 minutes
tmeasured = texact + 2*maxerr*(rand(size(texact))) - maxerr;
ismissing = rand(size(tmeasured)) > .6;
tmeasured = tmeasured(~ismissing);
ymeasured = sin(tmeasured);
%---------------
% Important part [Edited 7/22/2011 with additional comments]
%---------------
% Construct new timeseries, with NaN for missing
% Round each measured time to nearest sp
tround = round(tmeasured/sp)*sp;
% loc tells me the index of texact that matches
% each element of tround
[tf, loc] = ismember(tround, texact);
% Create a vector of NaNs the same size as texact
% for both y and t
tfinal = nan(size(texact));
yfinal = tfinal;
% Fill in the measured values where they matched an exact value
tfinal(loc) = tmeasured;
yfinal(loc) = ymeasured;

5 comentarios

Please forgive me if I ask a lot of silly questions. I'm very new at Matlab and have never had to manipulate data that wasn't generated by myself or read much of other peoples code. This makes it a challenge for me to understand exactly what you have written for code. I really want to be able to understand the code I am writing so I hope that I am not burdening you with my questions.
I understand the method you've taken to find texact (essentially by taking the start date of my data and adding intervals every sp). I can also manipulate the maxerr code to my specifics. However, I am confused by the next line of code for tmeasured. Why is it "+ 2*maxerr*(rand(size(texact))) + maxerr". Could it not be simply texact + 2*maxerr? When I try running this code with size(texact) instead of (rand(size(texact))) I get an error saying that the Matrix dimensions must agree.
If I code it so that tmeasured=texact+2*maxerr; followed by ismissing=size(tmeasured)>2*sp Matlab gives me:
ismissing =
1 1
I guess I am just confused about what you are asking Matlab to do at each step. ismissing should be identifying points where the time gap is larger than the threshold, correct? What is ymeasured representing? Sorry for all these questions and if you don't have time to elaborate on the code you provided for me I thank you for the time you've already put into helping me!
Cheers
First, sorry, I had a typo in there (I meant to have - maxerr instead of + maxerr) , which I've now corrected above. That little section of code has nothing to do with analyzing the data, just constructing some fake data that may look similar to yours. The tmeasured line just adds some random error value between -maxerr and +maxerr to your intended measurement times. The equation "a + (b-a).*rand(n)" will generate random numbers on the interval of [a b]; that's all I've done here, where a is texact-maxerr and b is texact+maxerr. In the ismissing line, I'm just throwing away about 40% of the data, to simulate the gaps in the timeseries.
The important part of the code above is the last part. I assume you already have tmeasured and ymeasured; that's the data in your neptune structure. You also have sp and, I assume, a start time, so you can construct texact as I did. I've added comments to the last part of the code so you can see exactly what each line does.
Thanks so much for elaborating on what you did. The only problem I am running into now is for the last part of the code I keep getting an error message from Matlab saying "Subscript indices must either be real positive integers or logicals"
I don't know why this message is occurring because I'm fairly certain that the values I have for tmeasured (neptune.time values) and ymeasured (neptune.dat values) are all real positive integers.
Here is the code I have written, perhaps you can see where my error is:
t=datenum(neptune.time);
t=t*86400;
sp=0.25;
startdate=datenum('22-Feb-2011 06:00:00 PM');
n=length(neptune.time);
texact=startdate+(0:n-1).*sp;
tmeasured=t;
ymeasured=neptune.dat;
tround = round(tmeasured/sp)*sp;
[tf, loc] = ismember(tround, texact);
tfinal = nan(size(texact));
yfinal = tfinal;
tfinal(loc) = tmeasured;
yfinal(loc) = ymeasured;
Thanks again for all of your help.
The error is probably occurring on the second to last line; if any of your tmeasured don't match a texact (i.e. if ~(all(tf))), then loc will contain some 0's, which can't work as indices.
I think you're mixing up units. Because you added in the
t = t*86400
line, your texact timeseries is in days, while your tmeasured timeseries is in seconds. I assume your spacing interval is in days, not seconds, so get rid of that line (or fix accordingly, if I assume incorrectly).
Also, you'll probably need a larger value of n when you construct texact. Remember that neptune.time has data gaps, so your final series will need to be longer than that. I was assuming you knew off the top of your head how many points to expect, but if you don't, this should get you close:
n = ceil((max(neptune.time) - startdate)/sp);
My units were definitely mixed up that was the problem. The interval I was using is 0.25 seconds so I needed to change everything into seconds. Thank you for all of your help and patience!

Iniciar sesión para comentar.

Your time stamp is always incremental and the smallest gap is sp. So it might worth to take a different approach.
%%Raw Data
neptune = struct('time',{'20110222T180326.761'
'20110222T180327.011'
'20110222T180844.239'
'20110222T180844.444'
'20110222T180844.665'
'20110222T180944.665'});
Temp=datenum({neptune.time}, 'yyyymmddTHHMMSS.FFF');
sp=0.25/86400;
TimeIndex=round((Temp-Temp(1))/sp)+1;
CompleteIndex=1:max(TimeIndex);
NewNeptune=struct('time',repmat({nan},max(TimeIndex),1));
SearchIndex=ismember(CompleteIndex,TimeIndex);
[NewNeptune(SearchIndex).time]=neptune.time;

9 comentarios

Hi again! Everything works great in this code until I reach the point where I am assigning NewNeptuneTime(SearchIndex)=neptune.time
I get an error message telling me that "In an assignment A(I) = B, the number of elements in B and I must be the same."
I see what this should do is replace all the indexs found that are a member of completeIndex and TimeIndex with the actual neptune.time values that I have and the missing values will be filled with NaN correct? I'm not sure why this isn't working or how I can fix it. Maybe you can help! You've been a lot of help already, thank you!
The problem is that in your real data, your neptune is a structure. You have neptune(1).time, neptune(2).time, etc. till neptune(6).time. In the code above though, it is neptune.time(1), neptune.time(2) till neptune.time(6). I will update the code to see if it can mimic your real data structure.
I am still having trouble with the last line of code [NewNeptune(SearchIndex).time]=neptune.time;
Matlab says there is an insufficient number of outputs from right hand side of equal sign to satisfy assignment. Could this be because the neptune.time values are not EXACTLY every 0.25 seconds? Sometimes they are every 0.249 or 0.251 seconds for example.
I'm not sure but there also seems to be a problem changing neptune.time to datenum. Using the code Temp=datenum({neptune.time}, 'yyyymmddTHHMMSS.FFF'); doesn't work so I have simply used t=datenum(neptune.time). Is this ok?
My code looks like so,
neptune = struct('time',{neptune.time}) %my neptune.time values are in serial date format I believe
t=datenum(neptune.time); %if I have {} brackets I get an error saying the input to DATENUM was not an array of strings.
t=t*86400; %changes t to seconds
sp=0.25; %sample period is every 0.25 seconds
TimeIndex=round((t-t(1))/sp)+1;
CompleteIndex=1:max(TimeIndex);
NewNeptune=struct('time',repmat({nan},max(TimeIndex),1));
SearchIndex=ismember(CompleteIndex,TimeIndex);
[NewNeptune(SearchIndex).time]=neptune.time
Is your neptune data still a 1xn structure like you said in your last question.
neptune =
1x4 struct array with fields:
sensorID
units
sensorType
name
code
dat
time
Yes it is still the same data I was using in my previous question.
Don't use this: neptune = struct('time',{neptune.time})
It's going to over-write your variable neptune.
If your neptune looks like the result of the first struct() command, the rest of the code should work.
what is the result of the following:
class(neptune(1).time)
size(neptune)
neptune(1).time
ans =
double
ans =
1 1
ans =
7.3456
7.3456
7.3456
7.3456 etc.
[NewNeptune(SearchIndex).time]=t says that there are too many output arguments and [NewNeptune(SearchIndex).time]=neptune.time says that there is insufficient output arguments..
You are confusing me. In one comment, you confirmed that your neptune is 1 by 4 structure. Now you said the size of neptune is 1 by 1. Tell me the data structure of neptune before you do any of the above processing, your original data structure.

Iniciar sesión para comentar.

Chris Miller
Chris Miller el 15 de Sept. de 2011
By now you probably have a solution. I came across the same need when trying to visualize data, and have submitted my insertNaN.m function to the 'File Exchange' if you haven't built your own solution. File ID: #32897

Categorías

Más información sobre Data Type Identification en Centro de ayuda y File Exchange.

Preguntada:

el 21 de Jul. de 2011

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by