Borrar filtros
Borrar filtros

Goodness of fit for a Zipf distribution

9 visualizaciones (últimos 30 días)
z8080
z8080 el 30 de Mzo. de 2023
Editada: Torsten el 4 de Abr. de 2023
I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.
From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)');
plot(f,'k') % plot model curve
However, this returns
rho = 28.05 (-Inf, Inf)
and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.
My questions:
1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?
2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.
3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?
Many thanks for any help!
  2 comentarios
Torsten
Torsten el 30 de Mzo. de 2023
I don't understand why you use curve fitting when you have to do distribution fitting.
Read about the difference and the adequate functions:
z8080
z8080 el 31 de Mzo. de 2023
Thanks for pointing this out: indeed, seems I was oblivious about the difference between those two, and it's really distribution fitting that I need, since my x axis is merely rank.
The documentation you pointed to suggests to me it's fitdist that I should use, however that doesn't have the distribution I need (Zipf) among its possible distname values. Any more advice?

Iniciar sesión para comentar.

Respuesta aceptada

Torsten
Torsten el 31 de Mzo. de 2023
Editada: Torsten el 2 de Abr. de 2023
I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):
If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / zeta(smax);
by
fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / sum(1./x.^(smax));
x = [1:6]';
y = [80 40 20 10 5 2]';
X = [];
for i=1:numel(x)
X = [X,x(i)*ones(1,y(i))];
end
hold on
histogram(X,'Normalization','pdf')
%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
s = 1.5:0.001:3;
fs = fun(s);
[~,index] = max(fs);
smax = s(index)
smax = 2.1160
zipf = 1./x.^smax / zeta(smax);
%zipf = 1./x.^smax / sum(1./x.^(smax));
plot(x,zipf)
hold off
grid on
  11 comentarios
z8080
z8080 el 4 de Abr. de 2023
Editada: z8080 el 4 de Abr. de 2023
Here's my attempt so far (for some reason code highlighting doesn't work):
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
% Define the Zipf distribution
alpha = 1.5; % Shape parameter, 1.5 is apparently a good all-round value to start with
N = sum(freq); % Total number of observations
k = 1:length(x); % Rank of each observation
zipf_dist = N ./ (k.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure;
bar(x, freq); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Frequency');
legend('Observed', 'Zipf');
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* N;
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
Chi-squared statistic = 170591.8168
fprintf('p-value = %.4f\n', p_value);
p-value = 0.0000
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
Conclusion: The data is not from a Zipf distribution.
As a sanity check, I used a textbook-Zipf rank distribution, which unfortunately doesn't pass the statistical test. If that doesn't, nothing will. Clearly something is wrong, but I don't know what. Using the Kolmogorov-Smirnoff test, or the Anderson-Darling test with a custom-built (non-normal) distribution in place of the chi-squared test does not change this.
THoughts?
Torsten
Torsten el 4 de Abr. de 2023
Editada: Torsten el 4 de Abr. de 2023
Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.
To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).
After this, zipf must be defined as
zipf = 1./x.^alpha / sum(1./x.^alpha);
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
epdf = freq/sum(freq);
% Define the Zipf distribution
alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with
zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure(1);
bar(x, epdf); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Probability');
legend('Observed', 'Zipf');
hold off
fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));
alpha = 0.5:0.001:3;
falpha = arrayfun(@(alpha)fun(alpha),alpha);
figure(2)
plot(alpha,falpha)
[~,index] = max(falpha);
alphamax = alpha(index)
alphamax = 1
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* sum(freq);
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
Chi-squared statistic = 0.0000
fprintf('p-value = %.4f\n', p_value);
p-value = 1.0000
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
Conclusion: The data is from a Zipf distribution.

Iniciar sesión para comentar.

Más respuestas (1)

Walter Roberson
Walter Roberson el 30 de Mzo. de 2023
The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.
  4 comentarios
z8080
z8080 el 31 de Mzo. de 2023
Editada: z8080 el 31 de Mzo. de 2023
Thanks, but since the task I'm aiming for is fairly simple (see how well a Zipf distribution fits a bunch of numbers), surely there's a straightforward way to achieve that in Matlab? Edge cases such as the one you demonstrated suggest, if anything, that what I want cannot be achieved, which I doubt.
In sum, if restricting rho to positives doesn't get the job done, what does get it done?
Walter Roberson
Walter Roberson el 31 de Mzo. de 2023
One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.

Iniciar sesión para comentar.

Categorías

Más información sobre Hypothesis Tests en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by