Efficient ways to match observations

5 visualizaciones (últimos 30 días)
Daniel Pinto
Daniel Pinto el 1 de Abr. de 2021
Respondida: Nipun el 16 de Mayo de 2024
Suppose I have a table "FILE A" that has two variables "Name,ID, Date".
Name is the name of an individual, e.g. "John The Cowboy".
ID identifies an individual. Eg. 'AVD4345'.
Date is a datetiime field with a date. There is a data point for each ID for each month, hence, each month-year only appears once for each ID.
A given individual,with a give name, might have one ID in one year and another ID in another. IDs are not reused, though.Hence, ID "AVD4345" always refers to "John The Cowboy" but "John The Cowboy" might have other IDs for some other dates.
Then I have another table "FILE B" that has three variables "ID, Date1,Date2,VariableX".
ID is the same as above. VariableX is a variable that I want to assign to FileA from FileB (can be double, character, etc).
Date1 and Date2 are datetime variables that indicate, respectively, the beginning and end of the date range for which a given ID is active. For instance, 'AVD4345' might be active from the January 5th 1983 (Date1) until March 24th 1987 (Date2) with a given value of VariableX, and then be active from March 25th 1987 (Date1) until November 28th 2010 (Date2) with a new value of VariableX
I want to find the match of each observation in File A (an ID-date combination) in File B to retrieve the correct value of VariableX.
It can happen that there is no match because File B is incomplete, for instance.
One approach is to loop through File A, and for each ID (in each row), find all the ID's
FileA.VariableX = repmat({'[]'},size(FileA,1),1); % pre-allocates
for kk = 1 : size(FileA,1)
FileBTemp = FileB(ismember(FileB.ID,FileA.ID(kk)),:);
FileBTemp = FileBTemp(FileA.Date(kk) >= FileBTemp.Date1 & FileA.Date(kk) <= FileBTemp.Date2,:);
if size(FileBTemp,1)==1 % if match unique, then assign value
FileA.VariableX(kk) = FileBTemp.VariableX;
elseif isempty(FileBTemp) % do nothing (i.e., leave {'[]'})
else
error('error') % size(FileBTemp,1)>1 should not happen, so if it does I'd have to check what's going on in the database
end
end
This is in principle fine. However, it is not efficient for a large value of
size(FileA,1)
What is a more efficient way of doing this?
One option would be to just transform "File B" so that an ID that is valid from January 5th 1983 (Date1) until March 24th 1987 (Date2) is repeated several times from January 1983 until March 1987 for each month, and then try to apply ismember() or intersection() on ID-month. However, this generates the problem that some ID-month in File A would have more than one match in File B (when variable X changes within the month in File B). Another option that would work would be to transform file B and repeat dates on a day by day basis (e.g., from January 5th 1983 (Date1) until March 24th 1987 I would repeat that row in File B once for each day in the data range and just change the date to account for all possible days). The issue with this is that it would generate a horribly tall array. I could in principle work with tall arrays, of course. Another way is to loop through File B, which is smaller. But I am wondering whether there's an overall more straighforward way of solving this kind of issue efficiently without loops and / or tall arrays.

Respuestas (1)

Nipun
Nipun el 16 de Mayo de 2024
Hi Daniel,
I understand that you are trying to efficiently match and retrieve values for each ID-date combination in "File A" from "File B" based on overlapping date ranges, without resorting to inefficient loops or creating excessively large arrays. A straightforward and efficient approach to achieve this in MATLAB involves using table operations and logical indexing. Here is how you can do it:
Step 1: Perform an Inner Join on ID
First, merge "File A" and "File B" based on the 'ID' field. This operation aligns rows from both tables that have matching IDs.
% Assuming FileA and FileB are MATLAB tables
mergedTable = innerjoin(FileA, FileB, 'Keys', 'ID');
Step 2: Filter Rows Based on Date Range
Next, filter the merged table to keep only those rows where 'Date' from "File A" falls within the 'Date1' and 'Date2' range from "File B".
% Logical indexing to filter rows
isWithinRange = mergedTable.Date >= mergedTable.Date1 & mergedTable.Date <= mergedTable.Date2;
filteredTable = mergedTable(isWithinRange, :);
Step 3: Extract and Assign 'VariableX' to FileA
For each ID-date combination in "File A" that matches the criteria, assign the corresponding 'VariableX' value from "File B". Assuming each ID-date combination in "File A" uniquely matches a single row within the date range in "File B", you can directly assign 'VariableX'.
% Assuming 'VariableX' needs to be added to FileA and is initially not present
FileA.VariableX = NaN(size(FileA, 1), 1); % Initialize VariableX in FileA with NaN or another placeholder
% Map 'VariableX' from filteredTable back to FileA
% This step assumes you have a way to uniquely identify each row in FileA to a row in filteredTable, possibly via an index or by ensuring the rows are in the same order.
% For demonstration, if rows are aligned and each ID-date from FileA has a unique match in filteredTable:
FileA.VariableX = filteredTable.VariableX; % Direct assignment if rows are aligned and uniquely matched
This approach leverages MATLAB's powerful table operations to efficiently match and retrieve the necessary data without using explicit for-loops or creating large, unwieldy arrays. Remember to adjust the final step based on how you can uniquely identify and map each row from "File A" to its corresponding row in the filtered table, which might involve additional steps if the rows are not directly aligned.
Hope this helps.
Regards,
Nipun

Categorías

Más información sobre Variables en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by