AUC between different size curves

Question

Suzuki el 7 de Sept. de 2021

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1448174-auc-between-different-size-curves

Editada: Suzuki el 11 de Sept. de 2021

Respuesta aceptada: Julius Muschaweck

How can I calculate the area between 2 curves of unequal data size, as in the photo?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Julius Muschaweck el 7 de Sept. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1448174-auc-between-different-size-curves#answer_782364

Abrir en MATLAB Online

I would use interp1 to interpolate both functions so they have the same high resolution x support. You can easily replace the 0.01 step by a 1e-6 step size in the code below.

Look for interp1 options, to control the interpolation algorithm (you might prefer 'linear' to avoid overshoots), and to control extrapolation (in your probability distribution case, you might want to set the value for extrapolation to 1 if one of your distributions has shorter x range).

Then use trapz to compute the signed integral of the difference or unsigned area between the curves. See code.

x1 = 0:0.2:1.4;

y1 = x1.^2;

x2 = 0:0.14:1.4;

y2 = sqrt(x2);

figure();

hold on;

plot(x1,y1,'-x');

plot(x2,y2,'-x');

xq = 0:0.01:1.4;

y1q = interp1(x1,y1,xq);

y2q = interp1(x2,y2,xq);

my_signed_integral = trapz(xq,y2q-y1q)

my_signed_integral = 0.1701

my_unsigned_area = trapz(xq,abs(y2q-y1q))

my_unsigned_area = 0.4631

test_signed = trapz(xq,sqrt(xq) - xq.^2)

test_signed = 0.1894

test_unsigned= trapz(xq,abs(sqrt(xq) - xq.^2))

test_unsigned = 0.4768

11 comentarios
Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos

Julius Muschaweck el 8 de Sept. de 2021

Abrir en MATLAB Online

What you need for final area depends on what question that final area should answer.

The code approximates this underlying math: You have functions

and

, both defined on an interval

. The signed area

is just the integral of the difference between the functions,

This is very useful in physics. For example, if f is the force of someone pushing a vehicle , g is the force of friction, and x the location, then

is the change of kinetic energy of the vehicle after it has been moved from a to b.

The unsigned area

is the integral of the absolute value of the difference between the functions,

If you want to know how "different" two functions are, then the unsigned area might be what you want: It is the metric on

, induced by the 1-norm of the vector space of absolute integrable functions on

. See https://en.wikipedia.org/wiki/Lp_space.

However, your original plot image shows "cumulative distribution". If you want to measure how different the two underlying probability density distributions are (whose integrals are the cumulative distributions), you might want to look at the Kolmogorov-Smirnov test, which uses not the area between the curves, but the maximum absolute vertical distance. See https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

As an example, think of two normal probability distributions, both with mean zero, but with different standard deviations.

x = linspace(-3,3,1000);

dx = x(2) - x(1);

p = @(x,mu,sigma) 1 / (sigma * sqrt(2 * pi)) * exp( -0.5 * ((x - mu) / sigma).^2 );

p1 = p(x,0, 0.5); % mean 0, sigma 0.5

p2 = p(x,0, 0.25); % mean 0, sigma 0.25

figure();

clf; hold on;

plot(x,p1,'--r');

plot(x,p2,'--b');

% the cumulative distributions

plot(x, cumsum(p1) * dx,'r');

plot(x, cumsum(p2) * dx,'b');

Now, the "signed area" between the two solid curves is zero, whereas the unsigned area is, well, the sum of the two symmetric areas between the curves.

Walter Roberson el 10 de Sept. de 2021

As a style note:

When you choose to interp1() using a fixed step size, then you run into a couple of possibilities:

The original data is closer together than the fixed step size, and you end up effectively discarding information;
The original data is further apart than the fixed step size, and you end up interpolating multiple times between original datapoints, leading to getting mislead by artificats created by the interpolation method

Because of this, what I tend to do is take the set of points that is the union() of the two sets of points, and interpolate from there. At each datapoint in the union that occurs exactly in of the sets, interp1() will reproduce that exact value, and each datapoint in the union that is not in a particular set, then it is because it is in the other set, and interp1() is interpolating because there is a specific reason to do so at that location.

However, in any of these approaches, you run into the possibility that the two sets of points do not completely overlap, such as if one set of data is at [1 2 3 4 5] and the other is at [1.1 2.1 3.1 4.1 5.1] then the second function has no data for the 1 from the first function, and the first function has no data for the 5.1 from the second function. You need to decide whether to restrict your interpolation to the parts they have in common, or if you wish instead to do extrapolation. You need to specifically tell interp1() that you want extrapolation for the most common interpolation method that it uses by default.

The choice of interpolation points can certainly affect the function values.

In the case where you need to find the distance enclosed by two curves, you have the difficulty that you need to figure out where the curves intersect in order to find the endpoints for interpolation. If you use all of the data for the curves, then you would be including data that is outside the enclosed area.

Julius Muschaweck el 10 de Sept. de 2021

Thank you Walter. For a general solution, the "interweaving" you describe is 100% the way to go. It's exactly what I did in my Spectrum library, https://uk.mathworks.com/matlabcentral/fileexchange/98044-jmo_spectrum which I use to do colorimetry with LED spectra.

However, when all you want is a quick result for such an area, and you know that the data has some discretization error anyway (and Suzuki's noisy cumulative distribution certainly looks like that), splitting two x arrays with 100 values each into, say, 10,000 intervals is getting you there in no time -- that's all I tried to suggest here.

When you're looking at the details as closely as you do, another question pops up. So we have samples of a function, at tabulated values. What exactly do these samples mean, even in the absence of noise? I tend to say "to be linearly interpolated", because that safely avoids things like negative interpolation overshoots of an inherently positive quantity, and because, frankly, I tend to not know any better. 'makima' has a lot of esthetic appeal, but do I really know anything that justifies the use of one interpolation method over another? And then, what happens if I integrate a sampled function? Assuming the sampled function is to be linearly interpolated, then the integral function consists of quadratic segments, analytically -- but do I really want to model it that way? I have used Matlab's mkpp, ppval and associated functions, analytically integrating and differentiating the piecewise polynomials, but I do that only when I know the problem at hand justifies the effort. Otherwise, 'linear' is my friend.

Julius Muschaweck el 11 de Sept. de 2021

@Walter Roberson: @Cleve Moler has written this very nice piece about makima interpolation: https://blogs.mathworks.com/cleve/2019/04/29/makima-piecewise-cubic-interpolation/

I am using analytic differentiation using the coefficients in piecewise polynomials in my spectrum library. The so-called Planck locus curve is given by the color points of blackbody radiation of varying temperature. The Planck locus curve is very smooth, but there is no analytic formula for it, and it's expensive to compute directly: A smooth polynomial fit to the Planck locus makes sense. For computing the color temperature of some white light spectrum, whose color point is near the Planck locus, we need perpendicular projection of that color point onto the Planck locus curve -- that is where differentiating the smoothly interpolated curve comes in.

If I had to implement myself the error function, as an example of a non-elementary function whose derivative is easy to compute, analytically integrating a piecewise polynomial fit of the derivative would be the first thing I'd try: constant time, fewer multiplications but still better accuracy than a single interpolating function.

Julius Muschaweck el 11 de Sept. de 2021

Abrir en MATLAB Online

@Suzuki:

The trouble in your example is that you extrapolate y1i linearly to the right, which makes the extrapolated values grow to almost 4. This is not what you want. You have cumulative distributions, which are inherently zero on the left, all the way to minus infinity, and one to the right, all the way to plus infinity. So you should extrapolate to the right with constant values of 1, and to the left with constant values of zero.

You run into this problem no matter if you use my initial idea or Walter's actually superior suggestion of using union().

Like this:

clear;
load("bins.mat")
load("cdist.mat")
load("Zbins.mat")
load("Zcdist.mat")
%%
% I would try to control the step size in your xq array
dx = 0.01; % some small step size to decrease discretization error
xmin = min([Zbins';bins']);
xmax = max([Zbins';bins']);
nsteps = ceil((xmax-xmin) / dx) + 1;
xq = linspace(xmin, xmax, nsteps);
% but now that we're at it, let's try also what Walter Roberson suggested:
xq = union(Zbins, bins, "sorted");
% xq = linspace(min([Zbins';bins']), max([Zbins';bins']))'; % New %x  Vector For Interpolation
 % i check max and min in both and get the max and min among the 2 sets, so I thought this covers the
 % whole data range
% extrapolation does a bad job for you here. Look at the values in y1i: They go up to almost 3.9 !
%y1i = interp1(Zbins', Zcdist', xq, 'linear','extrap');   % Interpolate To %xqT
%y2i = interp1(bins', cdist', xq, 'linear','extrap');   % Interpolate To %xq%
%my_unsigned_area = trapz(xq,abs(y2i-y1i))
% without 'extrap', interp1 "extrapolates" to NaN (not a number)
y1i = interp1(Zbins', Zcdist', xq, 'linear');   % Interpolate To %xqT
y2i = interp1(bins', cdist', xq, 'linear');   % Interpolate To %xq%
%now set "outliers" on the left to 0, and on the right to 1
% logical indexing is the non-for-loop way to do this:
idx1_left = xq < Zbins(1); % assuming Zbins and bins are strictly ascending
y1i(idx1_left) = 0;
idx1_right = xq > Zbins(end);
y1i(idx1_right) = 1;
idx2_left = xq < bins(1); % assuming Zbins and bins are strictly ascending
y2i(idx2_left) = 0;
idx2_right = xq > bins(end);
y2i(idx2_right) = 1;
% check if we caught all outliers
assert(~any(isnan(y1i)));
assert(~any(isnan(y2i)));
my_unsigned_area = trapz(xq,abs(y2i-y1i))
my_unsigned_area = 17.8458

Suzuki el 11 de Sept. de 2021

Editada: Suzuki el 11 de Sept. de 2021

@Julius Muschaweck @Walter Roberson

thank you very much for the kind and detailed explanation.

@Julius Muschaweck I really appreciate your help, I would have not done it without your generous help.

Iniciar sesión para comentar.

AUC between different size curves

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

11 comentarios
Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

AUC between different size curves

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

11 comentarios Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

11 comentarios
Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos