Computing average absolute time from date strings without years

Hi there, I have a very large table with a column of (12 hr) times saved as strings in a cell array, and a column of folder names also saved as strings in a cell array. So for example, say my table was called 'data', then if I type:
K>> data.folder
ans =
1150×1 cell array
{'160115/f1/f2/f3'}
I can access a string directly by typing:
K>> data.folder{1}
ans =
'160115/f1/f2/f3'
The first part of the folder string has the date in YYMMDD format. Another column in the table has the time in a string (12 hour clock):
K>> data.end_time{1}
ans =
'11:24:17 AM'
The actual table is enormous. I have some index for each row which fits into some supercategory (eg taken on some date), and I want to average the absolute time for each row in that category. How can I do this?
I have a brute force solution for the first part which works but I don't like how it's a loop. I would prefer to do it in a more compact way. Anyways, here it is:
exact_times = datetime(ramsey_data.end_time);
date = ramsey_data.folder;
for i=1:numel(exact_times) ymd = date{i};
ymd = ymd(1:6);
year = ['20' ymd(1:2)];
month = ymd(3:4);
day = ymd(5:6);
exact_times(i).Year = str2num(year);
exact_times(i).Month = str2num(month);
exact_times(i).Day = str2num(day);
end
I still need to find a way to average the resulting values in exact_times properly. I want to select only some of the rows in the table to average. If I could use accumarray I would use something like: average_time = accumarray(subs, exact_times, [], @mean); but I can't use that on datetime variables

4 comentarios

"I want to average the absolute time for each row in that category."
I'm not sure I understand what this means...can you show an example of some inputs and the desired output for the same input?
Also, it would help immensely if you would attach a (small but) representative part of the dataset that folks can play with.
It strikes me that probably the thing would be to convert the time to durations, but need more problem definition to be certain. Alternatively, parse the date from field 1 and combine with time from field 2 to get the absolute date/time for each entry.
Rearranging the input to not have combined data in the fields could make the job much simpler if you have any control at all on how this is created.
supernoob
supernoob el 21 de Mzo. de 2018
Editada: supernoob el 21 de Mzo. de 2018
Hi, I've attached an example table. To clarify: by "absolute time" I mean something like time in seconds since the bigbang; eg UNIX time. So for example I would like to compute the average absolute time for the rows in the attached table.
supernoob
supernoob el 21 de Mzo. de 2018
Editada: dpb el 21 de Mzo. de 2018
I have a brute force solution for the first part which works but I don't like how it's a loop. I would prefer to do it in a more compact way. Anyways, here it is:
exact_times = datetime(ramsey_data.end_time);
date = ramsey_data.folder;
%%%BRUTE FORCE%%%%really sorrryyyyy
for i=1:numel(exact_times)
ymd = date{i};
ymd = ymd(1:6);
year = ['20' ymd(1:2)];
month = ymd(3:4);
day = ymd(5:6);
exact_times(i).Year = str2num(year);
exact_times(i).Month = str2num(month);
exact_times(i).Day = str2num(day);
end
I still need to find a way to average the resulting values in exact_times properly. I want to select only some of the rows in the table to average. If I could use accumarray I would use something like:
average_time = accumarray(subs, exact_times, [], @mean);
but I can't use that on datetime variables
What's wrong with the posted solution? Just turn whatever is the selection criterion into a logical addressing vector. ismember or whatever works on datetime.

Iniciar sesión para comentar.

 Respuesta aceptada

It's really just three lines to do this. I've shown scalar values, but it works just as well on string or cellstr arrays:
>> date = '160115/f1/f2/f3';
>> time = '11:24:17 AM';
>> date = extractBefore(date,'/')
date =
'160115'
>> date = datetime(date,'InputFormat','yyMMdd')
date =
datetime
15-Jan-2016
>> time = timeofday(datetime(time,'Format','hh:mm:ss aa'))
time =
duration
11:24:17
>> dt = date + time
dt =
datetime
15-Jan-2016 11:24:17

11 comentarios

dpb
dpb el 23 de Mzo. de 2018
Editada: dpb el 23 de Mzo. de 2018
DOH! slap!!! :) Why didn't I just use the 'InputFormat' parameter to save the text substrings??? I'd just finished Answer'ing another respondents similar query and used it for parsing an unusual data format...senility, I guess. :(
I'd never heard of ExtractXXX family, though...that's cute.
Awesome! This is much better than my brute force approach!
Indeed although I'll note I kept pointing you this way other than the above note about the input format eliminating the substring selection.
I'm still curious about why you have time strings associated with the days and aren't apparently using those? If that's the case; I don't see so much the point in averaging N of the same value.
dpb (and anyone), extractBefore and a bunch of its siblings came along with the new string data type. There are a BUNCH of new functions to do common things with text. They appeared as part of introducing string. They also work for the most part with cellstrs, although of course adopting string for new work is a really good idea that brings a lot of benefits.
I saw that extractBefore returned a string object when subsequently playing with it here, Peter. Nice additional "syntax sugar" touch...
Thanks so much Peter, this is all really useful!
dpb
dpb el 28 de Mzo. de 2018
Editada: dpb el 29 de Mzo. de 2018
Peter, you clearly have far better grasp on what can be done and how things should be done with the new(ish) time-related objects; one subset of which I've run into a couple of times and don't find a good solution without the datenum implementation wherein one can separate off the date from the time by separating the whole number from the fraction.
AFAICT, there's no way to store just a generic time without an associated date in a datetime object; if one tries one gets the current data as well and irregardless of the presentation format, of course, the actual date is still part of the value.
An example of the issue is the plotting of combinations of times/dates as in <How plot vs season> where need just the time-of-day independent of the day itself. There was another just a short time ago that I can't remember enough context on now but same kind of issue in that the built-in date got in the way with anything I could figure out to do without reverting back to datenum as did here.
ADDENDUM Thought perhaps duration or in this case calendarDuration might work but "Nope!". They suffer similar issues; there's no way to format the duration for higher granularity than hours; PLOT() is not implemented for the calendarDuration and its output format isn't all that flexible, anyway.
In the end for the subject purpose linked to above it appears that datenum for both the date within year and the observation time vector are best handled as datenum and subsequently fixup the x-axis format with datetick despite its many warts.
Seems like a missing functionality with the new classes unless there's some other way to use them than I'm seeing.
dpb, I agree that there is a missing data type, but I'm not entirely sure you've put your finger on it. If the timestamps are all within one day, I would argue that duration is a good way to represent them. You might argue that durations are about elapsed time, and so they can be larger than 24 hours, and they have no notion of AM vs. PM. Still, if you are willing to live with a 24hr clock, and don't expect 22:00:00 + 4hr to be 02:00:00, duration does the trick
>> dt = datetime
dt =
datetime
02-Apr-2018 18:42:57
>> tod = timeofday(dt)
tod =
duration
18:42:57
If you want AM/PM, and want the time to wrap around at 24hr, one possibility is to use datetime, with a format that only shows the day. To me that seems like a mistake waiting to happen, but it could be fine in some cases.
You say, "no way to format the duration for higher granularity than hours", and I'm not sure what you mean. duration has two different kinds of display formats ("timer" and "single unit") at a variety of precisions:
>> tod.Format = 'hh:mm:ss.SSS'
tod =
duration
18:42:57.174
>> tod.Format = 'd'
tod =
duration
0.77983 days
>> tod.Format = 's'
tod =
duration
67377 sec
Are any of those what you are looking for?
It may be that what you are asking for is what might be described as "unanchored time of day", and a duration is the closest thing there is right now.
dpb
dpb el 2 de Abr. de 2018
Editada: dpb el 3 de Abr. de 2018
Thank you for commenting, Peter. I think your last phrase could be the best term we've (collectively) coined yet, agreed; it does describe the problem.
I agree if it is only the span that is significant then duration solves the problem; in the subject problem one needs both the duration and the time of day at which the event began which is, perhaps rather than an "unanchored time of day", an "anchored duration" instead.
I was trying to see if I could somehow use the builtin datetime-aware plot function to avoid the occasionally distracting side issues of the datenum/datetick pair for a newer user in preparing the desired display of the duration occurring at a particular time of day and day of year and met up with the limitation that can't be done with either the duration nor calendarDuration objects as presently implemented.
The comment about "granularity" of the duration has to do with the limited formatting options for the format string; it is one of 'y,d,h,m,s'; 'MMM' or other calendar-like abbreviations aren't allowed as it is considered a calendar-agnostic time span in which a year is the average year length rather than a specific 365/366 day year. That's ok for what it is; wasn't what was looking for here, however.
I then tried calendarDuration in that I thought perhaps if I turned the datetime time axis of time over an arbitrary year (arbitrary as the datetime time can't exist without a specific year yet the data displayed were actually from multiple years to be displayed as times within a generic calendar year)* to the calendarDuration of 1:365 days and formatted it as MMM I could get the effect without having to choose an explicit year. But, besides plot not being able to use a calendarDuration array as input, the format string won't accept anything but a single 'm' character as numeric display instead of the desired letter short month name and it won't accept just the month by itself, anyway...
>> cd=calendarDuration(0,0,1:365,'Format','m').';
Warning: The format character vector 'm' is not valid. A calendarDuration display
format must include the identifiers 'm', 'd', and 't'.
Setting format character vector to 'mdt'.
>>
So, this experiment also didn't work for multiple reasons, most fundamentally that PLOT() wouldn't let me use it as the x-axis variable anyway, regardless of the formatting.
I think this sidebar has resolved the questions I had of whether there were some clever way to utilize the existing objects I just wasn't recognizing as being "no; not as presently implemented; that takes another set of features".
Thanks again for the feedback; several times you have shown me I just wasn't being clever-enough in visualizing how to make use of what was builtin or how the functions were intended to be used-- retime in a couple of instances comes to mind as one example.
(*) Oh, one last issue I discovered regarding formatting a datetime ruler object--While I could display the data correctly on a plot using datetime for an arbitrary reference year and only display months as the tick labels, I could find no way to hide the reference year that was placed at the RHS of the time axis which had no meaning for the specific case as the data were a compilation of some ten years' observations displayed at the time of year of each series of observations within a calendar year. Hence I was eventually forced back to the venerable datenum/datetick pair to draw the desired axis without the irrelevant year being shown. <Answer_312320> illustrates that "feature".
That's a lot to digest. I confess I'm not entirely following the points you've made, so lemme take a shot at what I think you are talking about. This is kind of long, but perhaps helpful to someone.
Here's a plot of some hoked-up data that follows a diurnal pattern, over five days:
dt = datetime(2018,4,4) + days(sort(5*rand(1,100)));
y = sin(2*pi*hour(dt)/24) + .1*randn(size(dt));
plot(dt,y,'o')
If I get the time-of-day as a duration, I can plot all of the values vs. that. I can even mess with the ticks and their labels to get the look of a 12hr clock:
tod = timeofday(dt);
plot(tod,y,'o')
set(gca,'XLim',hours([0 24]),'XTick',hours(0:6:24),'XTickLabel',{'12:00am' '6:00am' '12:00pm' '6:00pm' '12:00am'});
For the purposes of a plot, that seems to capture what you are asking for. I might want color to indicate different days, but that can be done with scatter. I could also have left the automatically-chosen tick locations alone but set their tick label format to something like 'hh:mm'.
For the purposes of calculations, I could imagine wanting to have addition wrap at 0 and 24hrs. In many cases one should be doing datetime arithmetic rather than wrapped time-of-day arithmetic, but the wrapping without dates involved is what I meant by an "unanchored time of day".
I'd be interested to hear if I've just missed your point. In particular, I'm not following, "needs both the duration and the time of day at which the event began". Those seem like almost identical things that can both be represented with durations.
The analogous thing for a cyclic annual plot is a bit trickier. Here's some hoked-up data that follow an annual pattern, over five years.
dt = datetime(2018,1,1) + years(sort(5*rand(1,100)));
y = sin(2*pi*month(dt)/12) + .1*randn(size(dt));
plot(dt,y,'o')
There is no "timeofyear" function. There are however month and day functions that can return month-of-year and day-of-year. There's also the possibility of getting the "elapsed time within year" as a combination of dateshift and subtraction. The issue with plotting values vs. day-of-year or time-within-year is leap years. I don't know that there's standard way to make those plots, at least not as one continuous 1-Jan to 31-Dec time line. I'm no expert. One could imagine adding 29-Feb and 3/4ths of the data would have a gap. I've also seen plots with 12 separate day-of-month axes. For simplicity, let's stick with a plot of values vs. month-of-year.
moy = month(dt);
plot(moy,y,'o')
monthNames = month(datetime(2018,1:12,1),'shortname');
set(gca,'XTick',1:12,'XTickLabel',monthNames,'XTickLabelRotation',-45);
This plot looks a bit different than the time-of-day plot, because the x values are discrete. It's also a few more lines of code, because while duration serves pretty well as an unanchored time of day, there's no data type for unanchored months of year. So get the month number, and force the tick labels.
Again, I'd be interested to hear if I've just missed your point.
A couple of more specific responses:
1) There's no way to plot vs. calendarDurations, because in general, they can't even be ordered unless they are expressable in only one unit (months or days, usually). And if that's true, one could just as easily plot vs. the numeric values, e.g. vs. 1:3 instead of vs. caldays(1:3).
2) You are correct in saying that duration is, "a calendar-agnostic time span in which a year is the average year length rather than a specific 365/366 day year". I would just add the small clarification that "years" in the sense of durations is just so the scale of the values doesn't become completely unfamiliar for large spans. Everyone knows how long 10 years is; how long is 315569520 seconds? So something like 10.051 yrs is not something one can pin down on a calendar, it's just giving a more familiar scale.
3) You already know this, but it would not make sense to have MMM as a calendarDuration format -- calendarDurations are lengths of time. 3 months is very different than "March". Which leads to ...
4) ... What's missing in the ecosystem are unanchored time periods. "March", as opposed to "1-March-2018". I think that's what you were looking for when trying to use a calendarDuration as the x axis of a plot. Not being able to make the "reference year" disappear in the datetime plot reflects that difference. For now, I think my example above is your best bet. I'd stay away from datenums if possible.
I hope this helps clarify things, and thanks for the useful conversation.
dpb
dpb el 5 de Abr. de 2018
Editada: dpb el 6 de Abr. de 2018
You captured the gist of the issue, Peter. The annual plot on day-of-year is the one, specifically. I'm no expert in the field of the OP who asked the question, either; his proposed "solution" to the leap year problem was to shift the dates by +1 for leap year dates after February. My contention was, looking at the published paper from which he got the form of the plot he was trying to duplicate is that the plot axis is labelled as and represents calendar time and so the actual date is the calendar time for the year in which it occurs but I gave him the option to do the shift if he so chose to do so, anyway.
In the end, given the the need for the time-within-year vis-a-vis your example by month, I found the datenum representation of the integer fraction holding the day more convenient in that then could use datetick to do the labeling instead of writing the tick labels. That's a choice probably as much from being an old fogey who's well-versed with datenum and its ilk in many other incarnations in Fortran libraries, etc., as well as Matlab. As noted, I started down datetime/duration path as the query came from a new user and I really didn't want to revert to deprecated features but thought having to write the tick labels was at least as confusing if not more so for him so ended up with what I thought the lesser evil.
The only additional complication over your example is that the y-axis is also a duration rather than a single value in the subject case; I ended up drawing broken lines over the extent of each duration by introducing NaT/NaN between each observation period of a given year to control color by year.
Again, I sincerely appreciate your willingness to discourse over these issues; I gain some appreciation for the thought processes behind some of the implementations I can subsequently pass on...and, I'll continue to ponder and perhaps play with your implementation a little and see if I can make that direction more straightforward but yours with dateshift/subtraction seems inherently more awkward owing to the internal representation in datetime object not having a 1:1 correlation to DOY as does the integer portion of a datenum. Writing a "dayofyear" wrapper could be the trick; I was trying to minimize user functions for the OP in the question response as a secondary objective as well.

Iniciar sesión para comentar.

Más respuestas (2)

dpb
dpb el 21 de Mzo. de 2018
Editada: dpb el 22 de Mzo. de 2018
t=readtable('example.csv'); % read the data
t.folder=char(t.folder); % to char array
t.date=datetime(t.folder(:,1:6),'inputformat','yyMMdd'); % convert conglomeration to time
[~,~,~,h,m,s]=datevec(t.end_time); % get wanted h,m,s
[Y,M,D]=datevec(t.end_time); % and corresponding date
dt=datetime(Y,M,D,h,m,s) % build the composite date/time
dt =
10×1 datetime array
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:25:36
15-Jan-2016 11:25:36
>> mean(dt)
ans =
datetime
15-Jan-2016 11:24:32
>>
Could also parse the string data first and read the time as string and piece them together as text input to datetime. Will leave as "exercise for the student" :)
ADDENDUM
Per sidebar comments using your list of G:
>> splitapply(@mean,dt,G)
ans =
4×1 datetime array
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:25:09
>>
ADDENDUM 2
If the issue you're having is the nested cell array; there's another path besides that above avoiding the intermediary of the table --
>> c=textscan(fid,'%s%s','delimiter',',','headerlines',1);
>> times=char(c{1}); folder=char(c{2});
>> dt=datetime([folder(:,1:6) blanks(10).' times],'inputformat','yyMMdd hh:mm:ss a')
dt =
10×1 datetime array
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:24:17
15-Jan-2016 11:25:36
15-Jan-2016 11:25:36
>>
splitapply with the grouping vector as above from here.
Unfortunately, Matlab doesn't have the syntax to be able to use sub-indexing of doubly-nested cellstr arrays without the intermediary variable to dereference them to singly-nested as above; there's probably the place where you're having addressing issues. (Of course, that I inadvertently deleted the conversion step in initial posting didn't help...sorry about that :P)
You don't show how you're getting the data into the app; as above textscan has the penchant to put everything into that two-level structure that is convenient for variables returned from it being succinct but not necessarily so much when comes to using the results.

6 comentarios

supernoob
supernoob el 21 de Mzo. de 2018
Editada: supernoob el 21 de Mzo. de 2018
Thanks this is a good start, but it doesn't solve how to selectively average only some of the rows, (and the given rows that I average will change every time I run the analysis). I have a vector (subs) which has a given number for each corresponding row in 'exact_times' which groups the times so that they can be averaged. For example, with the table given above, subs could look like (1,1,1,2,3,4,1,2,4,4), I want to average all the rows in 'exact_times' with a '1' together (rows 1,2,3,7) then average the rows which correspond to the '2' together (rows 4,8), etc.
Also the version of Matlab I am using doesn't recognize t.Date (or t.date) as a valid property, so I still need to do the year, month and day separately. Finally t.folder and t.date are cell arrays so this indexing doesn't work. (I noted that in my original question). I can't 'export' a cell array to a csv however. Thanks for the first part though, it will be nice to clean up my more onerous version.
dpb
dpb el 21 de Mzo. de 2018
Editada: dpb el 21 de Mzo. de 2018
Use grouping variables and splitapply...I thought the problem was the combination of year with time.
You didn't explain how to get which given rows you want to average; the sample vector above seems totally random given the list of times. If that vector is generated by some other logic, then just use it...see amplified Answer on that assumption.
Sorry I was having trouble phrasing the question clearly, it probably wasn't clear in my original question.
dpb
dpb el 21 de Mzo. de 2018
Editada: dpb el 22 de Mzo. de 2018
So, does not the amplified Answer then solve all the issues? BTW, you seem to misinterpret what t is; it's the table object created from the .csv file; folder is the auto-generated field name based on the file header content...
>> type example.csv % the file content used...
end_time,folder
11:24:17 AM,160115/baseline1/block1/ramsey1
11:24:17 AM,160115/baseline1/block1/ramsey1
11:24:17 AM,160115/baseline1/block1/ramsey1
11:24:17 AM,160115/baseline1/block1/ramsey1
11:24:17 AM,160115/baseline1/block1/ramsey1
11:24:17 AM,160115/baseline1/block1/ramsey1
11:24:17 AM,160115/baseline1/block1/ramsey1
11:24:17 AM,160115/baseline1/block1/ramsey1
11:25:36 AM,160115/baseline1/block1/ramsey1
11:25:36 AM,160115/baseline1/block1/ramsey1
>>
>> t=readtable('example.csv') % read as table...
t =
end_time folder
___________ _________________________________
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:24:17 AM '160115/baseline1/block1/ramsey1'
11:25:36 AM '160115/baseline1/block1/ramsey1'
11:25:36 AM '160115/baseline1/block1/ramsey1'
>>
Everything in the Answer proceeds directly from there; no magic needed; any release from R2013b on is adequate.
No it does not solve the issues. I understand t is from the table object created from the csv. The actual table in my code consists of cell arrays, so none of the indexing you have posted will work. This is noted in my original question and in a previous comment to you. The answer I posed and accepted solves all of my issues. Thank you very much for your help, I'm sorry if my question was not clear but I have solved it now and posted the answer.
dpb
dpb el 22 de Mzo. de 2018
Editada: dpb el 22 de Mzo. de 2018
Oh, I see when I pasted from workspace and cleaned up the superfluous outputs I inadvertently deleted one line..converting the cell array to char()...
t.folder=char(t.folder);
before the
t.date=...
step....

Iniciar sesión para comentar.

Ok, I've got it all working, though it isn't quite as compact as I hoped. Here is how it goes:
for i=1:numel(exact_times)
ymd = date{i};
ymd = ymd(1:6);
year = ['20' ymd(1:2)];
month = ymd(3:4);
day = ymd(5:6);
exact_times(i).Year = str2num(year);
exact_times(i).Month = str2num(month);
exact_times(i).Day = str2num(day);
end
%get the average absolute time for rows which match indices in subs
average_time = accumarray(subs, datenum(exact_times), [], @mean);
avg_data.average_time = datetime(average_time, 'ConvertFrom', 'datenum');

1 comentario

dpb
dpb el 22 de Mzo. de 2018
Editada: dpb el 22 de Mzo. de 2018
What happened to the time strings; this only looks at the dates?
Going back to your original where you show you have
K>> data.folder
ans =
1150×1 cell array
{'160115/f1/f2/f3'}
K>> data.end_time{1}
ans =
'11:24:17 AM'
then
folder=char(data.folder{:});
times=char(data.end_time{:});
would get you to the point at which I started after reading the csv file into the table I named t excepting w/ standalone variables as in the latter example instead of inside the table.

Iniciar sesión para comentar.

Categorías

Productos

Preguntada:

el 20 de Mzo. de 2018

Editada:

dpb
el 6 de Abr. de 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by