Generate hourly timeseries data from "START,END,VALUE" data

2 views (last 30 days)
Lukas
Lukas on 9 May 2022
Commented: dpb on 10 May 2022
Hello everyone,
I got the attached data file of events with the following structure:
  • START_DATE (dd.mm.yyyy)
  • START_TIME (hh:mm:ss) [24h]
  • END_DATE (dd.mm.yyyy) can be > than START_DATE (event goes form one day to another)
  • END_TIME (hh:mm:ss) [24h]
  • VALUE for the event [< 0]
I assume that the Value is equal distributed during the time. I would like the transform the Data in hourly data and have one sum of value for every hour. Maybe this makes it a little bit more clear.
The duration of the events can be very short (not less than one minute) and long (more than one day overnight).
Has anyone here any Idea how to do that with matlab? I tried to generate a timeseries with every possible hour and then sum if the Data points fits in that timeslot but this doesn’t really work out. I’m really looking forward to your answers.
Best wishes,
Lukas
  2 Comments
dpb
dpb on 9 May 2022
There is more time on an hourly basis that doesn't have an event than there is that does so your graph above is somewhat misleading -- the following is the beginning of the above idea for the first eight (8) records in the data file (I just used head(ttT) as a working dataset after creating a time table from each row by using the start/end times and duplicating the VALUE.
tT=readtable('exemple_Data_points.xlsx'); % read as a table
tmp=head(tT); % just save a few for testing ideas
tmp.RT1=tmp.START_DATE+days(tmp.START_TIME); % got to turn into full datetimes
tmp.RT2=tmp.END_DATE+days(tmp.END_TIME);
for i=1:height(tmp) % turn each record into a timetable of two times
ttT{i}=table2timetable( ...
table([tmp.RT1(i);tmp.RT2(i)],repmat(tmp.VALUE(i),2,1), ...
'VariableNames',{'Time','Value'}),'RowTimes','Time');
end
ttTT=ttT{1}; % now start with the first to combine/synchronize
for i=2:numel(ttT) % add all the others at their times; catenates VALUE horizontally
ttTT=synchronize(ttTT,ttT{i});
end
ttTT.Properties.VariableNames=compose('V%d',1:numel(ttTT)); % just some short variable names
stackedplot(ttTT,'Color','k','LineWidth',3) % show what the first eight look like
Which shows the sizable gaps between events...
So, that raises Q? of what to do in between when we retime (which does work, btw, to fill in between the start/stop times) -- the default filler is missing value or NaN which is what shows up as empty space on above plot; a zero would give a solid line at zero for each set.
Also, how much granularity is needed? The original Q? asked for hourly; that will truncate items of less length than that with the normal synchronize method I believe although the shortest in the above data is 2 hr
>> tmp.DURATION=tmp.RT2-tmp.RT1
tmp =
8×8 table
START_DATE START_TIME END_DATE END_TIME VALUE RT1 RT2 DURATION
___________ __________ ___________ ________ _____ ____________________ ____________________ ________
02-Apr-2013 0.54167 02-Apr-2013 0.66667 800 02-Apr-2013 13:00:00 02-Apr-2013 16:00:00 03:00:00
04-Apr-2013 0.10417 04-Apr-2013 0.33333 485 04-Apr-2013 02:30:00 04-Apr-2013 08:00:00 05:30:00
04-Apr-2013 0.10417 04-Apr-2013 0.33333 495 04-Apr-2013 02:30:00 04-Apr-2013 08:00:00 05:30:00
05-Apr-2013 0.5625 05-Apr-2013 0.80208 288 05-Apr-2013 13:30:00 05-Apr-2013 19:15:00 05:45:00
06-Apr-2013 0.083333 06-Apr-2013 0.33333 475 06-Apr-2013 02:00:00 06-Apr-2013 08:00:00 06:00:00
06-Apr-2013 0.11458 06-Apr-2013 0.25 325 06-Apr-2013 02:45:00 06-Apr-2013 06:00:00 03:15:00
06-Apr-2013 0.25 06-Apr-2013 0.33333 100 06-Apr-2013 06:00:00 06-Apr-2013 08:00:00 02:00:00
07-Apr-2013 0.072917 07-Apr-2013 0.20833 16 07-Apr-2013 01:45:00 07-Apr-2013 05:00:00 03:15:00
>>
is the created initial table from which the timetables were created.

Sign in to comment.

Accepted Answer

dpb
dpb on 9 May 2022
Well, in the end it turned out not so bad after all -- and, somewhat surprisingly, it didn't seen to be too bad on performance...I thought it might really bog down but didn't seem to...
Here's the script I developed; as always, "salt to suit!"
tT=readtable('exemple_Data_points.xlsx'); % read as a table
tT.RT1=tT.START_DATE+days(tT.START_TIME); % got to turn into full datetimes
tT.RT2=tT.END_DATE+days(tT.END_TIME);
for i=1:height(tT) % turn each record into a timetable of two times
ttT{i}=table2timetable( ...
table([tT.RT1(i);tT.RT2(i)],repmat(tT.VALUE(i),2,1), ...
'VariableNames',{'Time','Value'}),'RowTimes','Time');
end
ttTT=ttT{1}; % now start with the first to combine/synchronize
for i=2:numel(ttT) % add all the others at their times; catenates VALUE horizontally
ttTT=synchronize(ttTT,ttT{i});
end
ttTT.Properties.VariableNames=compose('V%d',1:numel(ttT)); % just some short variable names
ttTT=retime(ttTT,'hourly',"previous",'EndValues',0);
I did the plot then with
stairs(ttTT.Time,ttTT.Value)
ylabel('Cumulative Value')
title('Hourly Cumulative Value Events')
I didn't do exhaustive testing, but the looking I did do all looked to be reasonable things happening...I did not do the search for the short time frames to see what may have happened/been in the data overall, so that's something can look for/into.
This as you can see with the retime call above, did the fill with zeros in order to do the sums; otherwise NaN reigns supreme. nansum woud end up doing the same, however, if want to keep the missing values.
  2 Comments
dpb
dpb on 10 May 2022
I thought it might be something like that...my consulting career was supporting the US electric utilities, starting out as nuclear engineer for one of the reactor vendors...
As far as performance of this way to go at, I don't really have any further ideas other than except that it is possible to inspect individual event data by the use of the first timetable (ttT) cell array which might be of some interest in certain cases, one could avoid the memory overhead of keeping those and wrap their creation into the call to synchronize instead of having two separate loops and keeping the duplicate data.
As coded, it will end up with a column in the final timetable for every event so indeed, 50K events will become quite large and possibly end up with memory limitations as is. I didn't look into it, but with some logic one could merge the non-overlapping events into a single column after each synchronization step and remove that last column to reduce the growth somewhat. Not sure what percentage that might be overall, I didn't delve into the patterns all that much although as the first few showed, those initial 8-10 events didn't overlap at all so that would be a sizable reduction on the small sample so may be worth looking into.
Alternatively, if you were to retime at each iteration, then you could build the overall summation as you went and keep only it. That would be most efficient memory-wise, but adds the computational overhead so may not be a wise choice. Perhaps one could synchronize some number of events and then retime every 100 or 500 or so...
Good luck! -- it will be interesting to see what you get in the end, keep us posted...

Sign in to comment.

More Answers (0)

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by