Goodness of fit for a Zipf distribution

Question

z8080 el 30 de Mzo. de 2023

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1938269-goodness-of-fit-for-a-zipf-distribution

Editada: Torsten el 4 de Abr. de 2023

I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.

From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:

x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)'); 
plot(f,'k') % plot model curve

However, this returns

rho = 28.05 (-Inf, Inf)

and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.

My questions:

1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?

2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.

3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?

Many thanks for any help!

2 comentarios
Mostrar NingunoOcultar Ninguno

Torsten el 30 de Mzo. de 2023

I don't understand why you use curve fitting when you have to do distribution fitting.

Read about the difference and the adequate functions:

https://de.mathworks.com/help/stats/curve-fitting-and-distribution-fitting.html

z8080 el 31 de Mzo. de 2023

Thanks for pointing this out: indeed, seems I was oblivious about the difference between those two, and it's really distribution fitting that I need, since my x axis is merely rank.

The documentation you pointed to suggests to me it's fitdist that I should use, however that doesn't have the distribution I need (Zipf) among its possible distname values. Any more advice?

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Torsten el 31 de Mzo. de 2023

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1938269-goodness-of-fit-for-a-zipf-distribution#answer_1205984

Editada: Torsten el 2 de Abr. de 2023

Abrir en MATLAB Online

I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):

https://en.wikipedia.org/wiki/Zeta_distribution

https://en.wikipedia.org/wiki/Zipf%27s_law

If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace

fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / zeta(smax);

by

fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / sum(1./x.^(smax));

x = [1:6]';

y = [80 40 20 10 5 2]';

X = [];

for i=1:numel(x)

X = [X,x(i)*ones(1,y(i))];

end

hold on

histogram(X,'Normalization','pdf')

%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

s = 1.5:0.001:3;

fs = fun(s);

[~,index] = max(fs);

smax = s(index)

smax = 2.1160

zipf = 1./x.^smax / zeta(smax);

%zipf = 1./x.^smax / sum(1./x.^(smax));

plot(x,zipf)

hold off

grid on

11 comentarios
Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos

Torsten el 2 de Abr. de 2023

Editada: Torsten el 2 de Abr. de 2023

1) My data points are discrete, but I'm still not sure which of the two expressions you suggested (Zipf of Zeta) I'd need to choose. Using one or other when running the code, their respective plots look very similar to me.

If all of the positive integers can be taken as values for the rank, choose zeta, if only ranks up to a maximum value can be taken, use Zipf with N being this maximal rank.

2) As I change values in my example x vector, smax stays the same regardless of how good a fit the Zipf curve is. Which variable/parameter tells me how well the Zipf distribution fits the values in x?

If you change x, you also have to change the maximum likelihood function I gave you.

There is no direct parameter that can tell you how well the fitting was done. You might want to make a graph of the empirical and the theoretical cumulative distribution functions to get at least a visual impression.

3) Can fitdist not help here?

How should fitdist help if the probability distribution you want cannot be selected ? Maybe your distribution is a special case of one of the distributions that can be chosen - I did not look into it more deeply. But fitdist does the same as I did: Define the maximum likelihood function with your data and determine the parameter(s) for which it is maximized. fitdist also does not return a parameter on how well the fitting was done, as far as I know. Maybe confidence intervals for the parameters - I'm not sure.

Torsten el 2 de Abr. de 2023

Editada: Torsten el 2 de Abr. de 2023

I think I understand how it works in principle, but I'm still not sure (other than by attempting some hacky & probably buggy code with lots of for loops) how to obtain that expression given an input vector x that will have a different number of elements each time.

Yes, make an attempt with hacky and probably buggy code. You know what function should come out from the example above. If you have problems, include your code, and I will try to unhack and debug.

I wonder if there is some other (higher-level) way of obtaining the parameters of the fitted distribution (`s`, if that's the only one).

This is the usual way. Often the logarithm is taken of the maximum likelihood function in order to make numerics easier. Look it up under Wikipedia.

Also, ultimately I need to numerically assess how well Zipf fits to each of my vectors, as those numbers are then used further downstream. Assessing goodness of fit visually via the plot for each one, as you suggested, is unfortunately not an option. s merely tells us the shape of the Zipf if I understand well, but not how well it fits the data per se.

The visualization of your empirical distribution together with the theoretical approximation should be the starting point. I understand that this cannot be used in computations, but if one has problems interpreting theoretical results, a plot often helps.

But for theoretically estimating goodness of fit for distributions I reach my statistical limits. You mentionned Anderson-Darling - this could be a point to start.

You may find the discussion and the links given here of interest for your application:

https://de.mathworks.com/matlabcentral/answers/1913185-how-to-calculate-r-2-using-1-ssr-sst-for-normal-fit-distribution?s_tid=srchtitle

z8080 el 4 de Abr. de 2023

Editada: z8080 el 4 de Abr. de 2023

Abrir en MATLAB Online

Here's my attempt so far (for some reason code highlighting doesn't work):

% Define some empirical frequency distribution

x = 1:10;

freq = 100 ./ x; % textbook zipf!

% Define the Zipf distribution

alpha = 1.5; % Shape parameter, 1.5 is apparently a good all-round value to start with

N = sum(freq); % Total number of observations

k = 1:length(x); % Rank of each observation

zipf_dist = N ./ (k.^alpha); % Compute the Zipf distribution

% Plot our empirical frequency distribution alongside the Zipf distribution

figure;

bar(x, freq); % or freq\N

hold on;

plot(x, zipf_dist, 'r--');

xlabel('Rank');

ylabel('Frequency');

legend('Observed', 'Zipf');

% Compute the goodness of fit using the chi-squared test

expected_freq = zipf_dist .* N;

chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);

dof = length(freq) - 1;

p_value = 1 - chi2cdf(chi_squared, dof);

% Display the results

fprintf('Chi-squared statistic = %.4f\n', chi_squared);

Chi-squared statistic = 170591.8168

fprintf('p-value = %.4f\n', p_value);

p-value = 0.0000

if p_value < 0.05

fprintf('Conclusion: The data is not from a Zipf distribution.\n');

else

fprintf('Conclusion: The data is from a Zipf distribution.\n');

end

Conclusion: The data is not from a Zipf distribution.

As a sanity check, I used a textbook-Zipf rank distribution, which unfortunately doesn't pass the statistical test. If that doesn't, nothing will. Clearly something is wrong, but I don't know what. Using the Kolmogorov-Smirnoff test, or the Anderson-Darling test with a custom-built (non-normal) distribution in place of the chi-squared test does not change this.

THoughts?

Torsten el 4 de Abr. de 2023

Editada: Torsten el 4 de Abr. de 2023

Abrir en MATLAB Online

Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.

To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).

After this, zipf must be defined as

zipf = 1./x.^alpha / sum(1./x.^alpha);

% Define some empirical frequency distribution

x = 1:10;

freq = 100 ./ x; % textbook zipf!

epdf = freq/sum(freq);

% Define the Zipf distribution

alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with

zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution

% Plot our empirical frequency distribution alongside the Zipf distribution

figure(1);

bar(x, epdf); % or freq\N

hold on;

plot(x, zipf_dist, 'r--');

xlabel('Rank');

ylabel('Probability');

legend('Observed', 'Zipf');

hold off

fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));

alpha = 0.5:0.001:3;

falpha = arrayfun(@(alpha)fun(alpha),alpha);

figure(2)

plot(alpha,falpha)

[~,index] = max(falpha);

alphamax = alpha(index)

alphamax = 1

% Compute the goodness of fit using the chi-squared test

expected_freq = zipf_dist .* sum(freq);

chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);

dof = length(freq) - 1;

p_value = 1 - chi2cdf(chi_squared, dof);

% Display the results

fprintf('Chi-squared statistic = %.4f\n', chi_squared);

Chi-squared statistic = 0.0000

fprintf('p-value = %.4f\n', p_value);

p-value = 1.0000

if p_value < 0.05

fprintf('Conclusion: The data is not from a Zipf distribution.\n');

else

fprintf('Conclusion: The data is from a Zipf distribution.\n');

end

Conclusion: The data is from a Zipf distribution.

Iniciar sesión para comentar.

Answer 2

Walter Roberson el 30 de Mzo. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1938269-goodness-of-fit-for-a-zipf-distribution#answer_1205209

The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

z8080 el 31 de Mzo. de 2023

Editada: z8080 el 31 de Mzo. de 2023

Thanks, but since the task I'm aiming for is fairly simple (see how well a Zipf distribution fits a bunch of numbers), surely there's a straightforward way to achieve that in Matlab? Edge cases such as the one you demonstrated suggest, if anything, that what I want cannot be achieved, which I doubt.

In sum, if restricting rho to positives doesn't get the job done, what does get it done?

Walter Roberson el 31 de Mzo. de 2023

One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.

Iniciar sesión para comentar.

Goodness of fit for a Zipf distribution

2 comentarios
Mostrar NingunoOcultar Ninguno

Respuesta aceptada

11 comentarios
Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos

Más respuestas (1)

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Goodness of fit for a Zipf distribution

2 comentarios Mostrar NingunoOcultar Ninguno

Respuesta aceptada

11 comentarios Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos

Más respuestas (1)

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

2 comentarios
Mostrar NingunoOcultar Ninguno

11 comentarios
Mostrar 9 comentarios más antiguosOcultar 9 comentarios más antiguos

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos