quantifying the similarity between data sets

20 visualizaciones (últimos 30 días)
Daniel Mella
Daniel Mella el 14 de Jul. de 2017
Comentada: Kafayat Olayinka el 29 de Mayo de 2020
Hi, I implemented an algorithm that tracks a particle in space and time. I applied it to two experiments and I got two data sets A=[X,Y] and B=[X,Y] of 8399 coordinate points each. The experiments were exactly the same. I ploted A and B and there are clear differences between them but overall, the points are within similar limits. Of course, they are never going to be exactly the same due to errors in the tracking algorithm. Still, given a certain criteria, Is there any method that quantify the difference between data sets in which I can say "ok, they are close enough" or "no, they are too much difference between them"?
Ps. I attached the data set I am currently analysing. Thank you

Respuestas (2)

Image Analyst
Image Analyst el 14 de Jul. de 2017
  2 comentarios
Daniel Mella
Daniel Mella el 16 de Jul. de 2017
Thanks for your answer.
I tried it but it is not what I am looking for. I need a way to quantify how similar or different my plots are.
I have been thinking on applying FFT to A and B using the pwelch function and then calculate the cross correlation between spectras. I think that will give me the similarity in X and Y.
Image Analyst
Image Analyst el 16 de Jul. de 2017
Methods like sift and surf first identify a bunch of "salient points" and then they use point matching algorithms to find subsets of points that seem to align fairly well. If you don't like the ones in the Computer Vision System Toolbox, you can use some other one: https://www.google.com/#q=point+matching+algorithm
Or look into how "optical flow" (also in the CVSToolbox) works.

Iniciar sesión para comentar.


Star Strider
Star Strider el 16 de Jul. de 2017
I can’t find anything online that address your problem, and there may be no consensus. Some exploration of your data reveals that the x-coordinates in both are (essentially) identically-distributed, and the y-coordinates in both are (essentially) identically distributed. The x- and y-coordinates have different distributions, and none of them are normally distributed.
One approach therefore could be to do a Wilcoxon Rank Sum or Mann-Whitney U test separately on the x-coordinates of the two data sets and the y-coordinates of the two data sets. This tests the null hypothesis that the medians are the same, against the alternate hypothesis that they are different.
AB = load('data_sets.mat');
A = AB.A;
B = AB.B;
[p1,h1,stats1] = ranksum(A(:,1),B(:,1));
[p2,h2,stats2] = ranksum(A(:,2),B(:,2));
These results indicate that the medians are not different with respect to both the x- and y-coordinates.
To demonstrate that the distributions of the x- and y-coordinates are not different would require a different test, such as a chi-square goodness-of-fit test of one x-coordinate distribution against the other, and similarly for the y-coordinates. (Use histogram or histcounts to generate the distributions.) You would have to write that code yourself, and then use the appropriate chi squared distribution function to calculate the p-values based on your calculated chi-square statistics and degrees-of-freedom.
Since a definitive discussion on this does not seem to exist, or at least has evaded my search for it, this is the best I can come up with.
  3 comentarios
Star Strider
Star Strider el 17 de Jul. de 2017
My pleasure,
I experimented with the chi-square idea in the interim:
Xedges = linspace(min([A(:,1);B(:,1)]),max([A(:,1);B(:,1)]), 20);
Yedges = linspace(min([A(:,2);B(:,2)]),max([A(:,2);B(:,2)]), 20);
[HXA,edgesx] = histcounts(A(:,1),Xedges);
[HXB,edgesx] = histcounts(B(:,1),Xedges);
[HYA,edgesy] = histcounts(A(:,2),Yedges);
[HYB,edgesy] = histcounts(B(:,2),Yedges);
FXA = HXA/sum(HXA)+sqrt(eps);
FXB = HXB/sum(HXB)+sqrt(eps);
FYA = HYA/sum(HYA)+sqrt(eps);
FYB = HXA/sum(HYB)+sqrt(eps);
QX = (FXA(:)-FXB(:)).^2./FXA(:);
Chi2_X = sum((FXA(:)-FXB(:)).^2./FXA(:));
Chi2_Y = sum((FYA(:)-FYB(:)).^2./FYA(:));
df = size(FXA(:),1)-1;
P1 = chi2cdf(Chi2_X, df);
P2 = chi2cdf(Chi2_Y, df);
I believe this is correct. I’ve not written code to calculate chi-square statistics in a while. Adding ‘sqrt(eps)’ prevents Inf and NaN values in the chi-square calculations, since some of the bins have zero values.
Unfortunately, the p-values are vanishingly small, meaning that the distributions are different (the probability of their being the same is essentially zero).
I would be hesitant to use pwelch on random spatial data. You might want to experiment with the fft2 function instead, and the image processing functions.
Yours appears to be a relatively new problem. I am not certain how to approach it, and the literature search I did turned up no relevant results.
Kafayat Olayinka
Kafayat Olayinka el 29 de Mayo de 2020
Can you show us how to plot this and what it'll look like? Thanks

Iniciar sesión para comentar.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by