**You are now following this question**

- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.

# AUC between different size curves

4 views (last 30 days)

Show older comments

How can I calculate the area between 2 curves of unequal data size, as in the photo?

##### 0 Comments

### Accepted Answer

Julius Muschaweck
on 7 Sep 2021

I would use interp1 to interpolate both functions so they have the same high resolution x support. You can easily replace the 0.01 step by a 1e-6 step size in the code below.

Look for interp1 options, to control the interpolation algorithm (you might prefer 'linear' to avoid overshoots), and to control extrapolation (in your probability distribution case, you might want to set the value for extrapolation to 1 if one of your distributions has shorter x range).

Then use trapz to compute the signed integral of the difference or unsigned area between the curves. See code.

x1 = 0:0.2:1.4;

y1 = x1.^2;

x2 = 0:0.14:1.4;

y2 = sqrt(x2);

figure();

hold on;

plot(x1,y1,'-x');

plot(x2,y2,'-x');

xq = 0:0.01:1.4;

y1q = interp1(x1,y1,xq);

y2q = interp1(x2,y2,xq);

my_signed_integral = trapz(xq,y2q-y1q)

my_unsigned_area = trapz(xq,abs(y2q-y1q))

test_signed = trapz(xq,sqrt(xq) - xq.^2)

test_unsigned= trapz(xq,abs(sqrt(xq) - xq.^2))

##### 11 Comments

Suzuki
on 8 Sep 2021

Thanks for the quick reply.

So, I replaced xq and used --> xq = linspace(min([x1';y1']), max([x2';y2']))';

and then used linear and extrap with interp1, seems to be working.

But the final area that I need is "my_signed_integral", isn't it ??

Thank you vey much.

Julius Muschaweck
on 8 Sep 2021

What you need for final area depends on what question that final area should answer.

The code approximates this underlying math: You have functions and , both defined on an interval . The signed area is just the integral of the difference between the functions,

This is very useful in physics. For example, if f is the force of someone pushing a vehicle , g is the force of friction, and x the location, then is the change of kinetic energy of the vehicle after it has been moved from a to b.

The unsigned area is the integral of the absolute value of the difference between the functions,

If you want to know how "different" two functions are, then the unsigned area might be what you want: It is the metric on , induced by the 1-norm of the vector space of absolute integrable functions on . See https://en.wikipedia.org/wiki/Lp_space.

However, your original plot image shows "cumulative distribution". If you want to measure how different the two underlying probability density distributions are (whose integrals are the cumulative distributions), you might want to look at the Kolmogorov-Smirnov test, which uses not the area between the curves, but the maximum absolute vertical distance. See https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

As an example, think of two normal probability distributions, both with mean zero, but with different standard deviations.

x = linspace(-3,3,1000);

dx = x(2) - x(1);

p = @(x,mu,sigma) 1 / (sigma * sqrt(2 * pi)) * exp( -0.5 * ((x - mu) / sigma).^2 );

p1 = p(x,0, 0.5); % mean 0, sigma 0.5

p2 = p(x,0, 0.25); % mean 0, sigma 0.25

figure();

clf; hold on;

plot(x,p1,'--r');

plot(x,p2,'--b');

% the cumulative distributions

plot(x, cumsum(p1) * dx,'r');

plot(x, cumsum(p2) * dx,'b');

Now, the "signed area" between the two solid curves is zero, whereas the unsigned area is, well, the sum of the two symmetric areas between the curves.

Suzuki
on 9 Sep 2021

I see. I understand what you describe here, however, when I applied that to my data, I got 70.406 and 461.32 for the following curves (attached photo), dose that make sense?! because the 1st area seems bigger but with numbers, it not the case.

Any advice?!

Julius Muschaweck
on 10 Sep 2021

Looking at the picture, both answers are clearly wrong (70 is too high, and 461 is way too high). It's a little hard to say more without your code and data.

However, I think the problem might be in the x range. In the right image, the dark blue curve ranges only from about 100 to 270. You need to extend the x range such that it covers the full range, in a way that the y values are zero to the left and one on the right.

Walter Roberson
on 10 Sep 2021

As a style note:

When you choose to interp1() using a fixed step size, then you run into a couple of possibilities:

- The original data is closer together than the fixed step size, and you end up effectively discarding information;
- The original data is further apart than the fixed step size, and you end up interpolating multiple times between original datapoints, leading to getting mislead by artificats created by the interpolation method

Because of this, what I tend to do is take the set of points that is the union() of the two sets of points, and interpolate from there. At each datapoint in the union that occurs exactly in of the sets, interp1() will reproduce that exact value, and each datapoint in the union that is not in a particular set, then it is because it is in the other set, and interp1() is interpolating because there is a specific reason to do so at that location.

However, in any of these approaches, you run into the possibility that the two sets of points do not completely overlap, such as if one set of data is at [1 2 3 4 5] and the other is at [1.1 2.1 3.1 4.1 5.1] then the second function has no data for the 1 from the first function, and the first function has no data for the 5.1 from the second function. You need to decide whether to restrict your interpolation to the parts they have in common, or if you wish instead to do extrapolation. You need to specifically tell interp1() that you want extrapolation for the most common interpolation method that it uses by default.

The choice of interpolation points can certainly affect the function values.

In the case where you need to find the distance enclosed by two curves, you have the difficulty that you need to figure out where the curves intersect in order to find the endpoints for interpolation. If you use all of the data for the curves, then you would be including data that is outside the enclosed area.

Julius Muschaweck
on 10 Sep 2021

Thank you Walter. For a general solution, the "interweaving" you describe is 100% the way to go. It's exactly what I did in my Spectrum library, https://uk.mathworks.com/matlabcentral/fileexchange/98044-jmo_spectrum which I use to do colorimetry with LED spectra.

However, when all you want is a quick result for such an area, and you know that the data has some discretization error anyway (and Suzuki's noisy cumulative distribution certainly looks like that), splitting two x arrays with 100 values each into, say, 10,000 intervals is getting you there in no time -- that's all I tried to suggest here.

When you're looking at the details as closely as you do, another question pops up. So we have samples of a function, at tabulated values. What exactly do these samples mean, even in the absence of noise? I tend to say "to be linearly interpolated", because that safely avoids things like negative interpolation overshoots of an inherently positive quantity, and because, frankly, I tend to not know any better. 'makima' has a lot of esthetic appeal, but do I really know anything that justifies the use of one interpolation method over another? And then, what happens if I integrate a sampled function? Assuming the sampled function is to be linearly interpolated, then the integral function consists of quadratic segments, analytically -- but do I really want to model it that way? I have used Matlab's mkpp, ppval and associated functions, analytically integrating and differentiating the piecewise polynomials, but I do that only when I know the problem at hand justifies the effort. Otherwise, 'linear' is my friend.

Walter Roberson
on 11 Sep 2021

I encounter a number of people wanting to use linear interpolation of coefficients inside ode functions to be used with variable-step solvers such as ode45() .

Mathematically that does not work: the step sizes and step weights used by Runge-Kutta methods depend upon the first two derivatives of the function being continuous.

The mathematical continuity conditions are met for ode*() variable-solver purposes if spline fitting is used instead of linear. But exactly as you note, spline fitting and other polynomial fitting can overshoot into regions known to be inappropriate, especially if sample points are close together.

(I posted a diagram a few months ago showing how that can happen, but I am not certain I could locate that posting now. If I recall correctly, the context was a surface fit that was generating some spikes near boundaries.)

I have not investigated makima interpolation; I just know the name.

Suzuki
on 11 Sep 2021

thanks for the detailed explanations. I'm not a mathematician, so it's a bit hard to fully understand, but here I attach my code and one figure data (the one with area = 461)

xq = linspace(min([Zbins';bins']), max([Zbins';bins']))'; % New ‘x’ Vector For Interpolation

% i check max and min in both and get the max and min among the 2 sets, so I thought this covers the

% whole data range

y1i = interp1(Zbins', Zcdist', xq, 'linear','extrap'); % Interpolate To ‘xq’

y2i = interp1(bins', cdist', xq, 'linear','extrap'); % Interpolate To ‘xq’

my_unsigned_area = trapz(xq,abs(y2i-y1i))

Julius Muschaweck
on 11 Sep 2021

@Walter Roberson: @Cleve Moler has written this very nice piece about makima interpolation: https://blogs.mathworks.com/cleve/2019/04/29/makima-piecewise-cubic-interpolation/

I am using analytic differentiation using the coefficients in piecewise polynomials in my spectrum library. The so-called Planck locus curve is given by the color points of blackbody radiation of varying temperature. The Planck locus curve is very smooth, but there is no analytic formula for it, and it's expensive to compute directly: A smooth polynomial fit to the Planck locus makes sense. For computing the color temperature of some white light spectrum, whose color point is near the Planck locus, we need perpendicular projection of that color point onto the Planck locus curve -- that is where differentiating the smoothly interpolated curve comes in.

If I had to implement myself the error function, as an example of a non-elementary function whose derivative is easy to compute, analytically integrating a piecewise polynomial fit of the derivative would be the first thing I'd try: constant time, fewer multiplications but still better accuracy than a single interpolating function.

Julius Muschaweck
on 11 Sep 2021

The trouble in your example is that you extrapolate y1i linearly to the right, which makes the extrapolated values grow to almost 4. This is not what you want. You have cumulative distributions, which are inherently zero on the left, all the way to minus infinity, and one to the right, all the way to plus infinity. So you should extrapolate to the right with constant values of 1, and to the left with constant values of zero.

You run into this problem no matter if you use my initial idea or Walter's actually superior suggestion of using union().

Like this:

clear;

load("bins.mat")

load("cdist.mat")

load("Zbins.mat")

load("Zcdist.mat")

%%

% I would try to control the step size in your xq array

dx = 0.01; % some small step size to decrease discretization error

xmin = min([Zbins';bins']);

xmax = max([Zbins';bins']);

nsteps = ceil((xmax-xmin) / dx) + 1;

xq = linspace(xmin, xmax, nsteps);

% but now that we're at it, let's try also what Walter Roberson suggested:

xq = union(Zbins, bins, "sorted");

% xq = linspace(min([Zbins';bins']), max([Zbins';bins']))'; % New %x Vector For Interpolation

% i check max and min in both and get the max and min among the 2 sets, so I thought this covers the

% whole data range

% extrapolation does a bad job for you here. Look at the values in y1i: They go up to almost 3.9 !

%y1i = interp1(Zbins', Zcdist', xq, 'linear','extrap'); % Interpolate To %xqT

%y2i = interp1(bins', cdist', xq, 'linear','extrap'); % Interpolate To %xq%

%my_unsigned_area = trapz(xq,abs(y2i-y1i))

% without 'extrap', interp1 "extrapolates" to NaN (not a number)

y1i = interp1(Zbins', Zcdist', xq, 'linear'); % Interpolate To %xqT

y2i = interp1(bins', cdist', xq, 'linear'); % Interpolate To %xq%

%now set "outliers" on the left to 0, and on the right to 1

% logical indexing is the non-for-loop way to do this:

idx1_left = xq < Zbins(1); % assuming Zbins and bins are strictly ascending

y1i(idx1_left) = 0;

idx1_right = xq > Zbins(end);

y1i(idx1_right) = 1;

idx2_left = xq < bins(1); % assuming Zbins and bins are strictly ascending

y2i(idx2_left) = 0;

idx2_right = xq > bins(end);

y2i(idx2_right) = 1;

% check if we caught all outliers

assert(~any(isnan(y1i)));

assert(~any(isnan(y2i)));

my_unsigned_area = trapz(xq,abs(y2i-y1i))

my_unsigned_area = 17.8458

Suzuki
on 11 Sep 2021

thank you very much for the kind and detailed explanation.

@Julius Muschaweck I really appreciate your help, I would have not done it without your generous help.

### More Answers (0)

### See Also

### Categories

### Tags

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!**An Error Occurred**

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)

### Asia Pacific

- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)