Cannot use 'histogram' to compute entropy

I'd like to compute the entropy of various vectors. I was going to use something like:
X = randn(1,100);
h1 = histogram(X, 'Normalization', 'Probability');
probabilities = h1.Values;
entropy = -sum(probabilities .* log2(probabilities ))
The second command however gives the error:
Undefined function 'c:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m' for input arguments of type 'double'.
But surely that's exactly what the standard Matlab function 'histogram' expects?! Doing a
which histogram
indeed returns
C:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m
which is the newest file (by modified date) from several of that name that (sadly) exist in my Matlab folder. I believe this should be the standard Matlab function 'histogram'.
If on the other hand in the above example I use 'hist' instead of 'histogram', I get the scalar value for entropy that I expect. However, I know 'hist' is not recommended, not least because with it one cannot specify the normalization type.
So, my question is: is using 'hist' for computing probabilities ok, or should I try something else to be able to use 'histogram' instead?

13 comentarios

Please show the output of
dbtype histogram 1:5
Question: does your code just happen to assign a value to a variable named histogram at some point?
The output is:
1 function h = histogram(varargin)
2 %HISTOGRAM Plots a histogram.
3 % HISTOGRAM(X) plots a histogram of X. HISTOGRAM uses an automatic binning
4 % algorithm that returns bins with a uniform width, chosen to cover the
5 % range of elements in X and reveal the underlying shape of the distribution.
Your intuition is right, I did end up with a variable unhappily named histogram at some point in my Workspace. I deleted that, but still the example code behaved weirdly. Now it is the histogram option that returns the scalar value expected,
h1 = histogram(X, 'Normalization', 'Probability');
probabilities = h1.Values;
entropy = -sum(probabilities .* log2(probabilities ))
entropy =
1.0409
whereas the hist output is NaN:
h1 = hist(X);
probabilities = h1;
entropy = -sum(probabilities .* log2(probabilities ))
entropy =
NaN
Bjorn Gustavsson
Bjorn Gustavsson el 9 de Sept. de 2021
That you get a nan in the second variant is most likely because one of more of your probabilities are zero.
z8080
z8080 el 9 de Sept. de 2021
That is indeed the case, there's one zero in the probabilities vector. But I get this for most input X vector, and it's strange to not be able to compute X's entropy because of that.
Walter Roberson
Walter Roberson el 9 de Sept. de 2021
Right, you have to filter out the items with count 0.
z8080
z8080 el 9 de Sept. de 2021
Editada: z8080 el 9 de Sept. de 2021
By "items with count 0" you mean items of the probabilities vector that are equal to 0?
Actually, for some reason all probabilities vectors that I get include at least one 0, however the corresponding entropy only is NaN for those where the 0 is mid-vector rather than when the 0s are at the end (due to padding when creating matrix rows of various length). It's not obvious which input signals lead to which situation.
For instance this code:
signal = [2.06623794644167,1.90137810157315,5.79024624899804,-10.0872171266770,3.72029696914678,1.88888495632052,-0.899652931772044,-1.01593746059118,6.02304846348934,-1.47602178714163,-1.26585426828863,-1.23193945906322,-1.58972503058801,-1.19316666394375,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
% signal(isnan(signal)) = [];
h1 = histogram(signal, 'Normalization', 'Probability');
probabilities = h1.Values
entropy = -sum(probabilities .* log2(probabilities )) % this returns a NaN
signal = [-6.30156202700424,2.31276368504634,2.53575767657180,3.85380766080873,-0.788693245197791,-3.14860106509044,5.94561552053542,-4.88433510447800,-0.536192076206886,4.93030202134006,-4.09979662991061,-0.597866984143998,-1.21151323711087,4.31812745523269,-1.74917050858314,-1.59171474633157,-0.845918488184140,NaN,NaN,NaN,NaN]
% signal(isnan(signal)) = [];
h1 = histogram(signal, 'Normalization', 'Probability');
probabilities = h1.Values
entropy = -sum(probabilities .* log2(probabilities )) % this returns a non-NaN!
returns:
signal =
Columns 1 through 17
2.0662 1.9014 5.7902 -10.0872 3.7203 1.8889 -0.8997 -1.0159 6.0230 -1.4760 -1.2659 -1.2319 -1.5897 -1.1932 NaN NaN NaN
Columns 18 through 21
NaN NaN NaN NaN
probabilities =
0.0476 0 0.3333 0.1905 0.0952
entropy =
NaN
signal =
Columns 1 through 17
-6.3016 2.3128 2.5358 3.8538 -0.7887 -3.1486 5.9456 -4.8843 -0.5362 4.9303 -4.0998 -0.5979 -1.2115 4.3181 -1.7492 -1.5917 -0.8459
Columns 18 through 21
NaN NaN NaN NaN
probabilities =
0.0476 0.4762 0.2381 0.0476
entropy =
1.4210
Removing the NaNs from the signal vectors themselves makes no difference.
Thus, my questions:
1) I don't understand why a 0 in the middle leads to a NaN entropy, but a 0 at the end is acceptable.
2) Would it actually be correct to manually remove those mid-way 0s and compute entropy from the 'cleaned-up' probabilities vector?
Bjorn Gustavsson
Bjorn Gustavsson el 9 de Sept. de 2021
Editada: Bjorn Gustavsson el 9 de Sept. de 2021
It is the zeros in the probabilities that leads to nans - you get terms of the form 0*log(0) in the entropy-calculation, which defaults to nan. Note that there's no zeros in the probabilities-vecor in your second example.
z8080
z8080 el 9 de Sept. de 2021
Editada: z8080 el 9 de Sept. de 2021
You're right, now I get it. But I still have no idea when a (non-padding-related) 0 appears in the probabilities vector, so what about my question 2: Would it actually be correct to manually remove those mid-way 0s and compute entropy from the 'cleaned-up' probabilities?
It should be simple enough to remove those zero-probability-bins:
probs = probabilities;
entropy = -sum(probs(probs(:)>0) .* log2(probs(probs(:)>0) ))
z8080
z8080 el 9 de Sept. de 2021
It is indeed simple enough – what I wonder is whether it's a good idea to just manually remove those zeros, as long as I don't know what causes a probability to be zero in the first place
You have a finite sample of a distribution, and you are not specifying bin edges or the number of bins.
Under those circumstances, histogram() is documented as using the data to create bins of uniform width that represents the shape of the histogram. However, there is no documentation as to the algorithm it uses to select the bin widths (number of bins), and the relevant code is inside a .p file so we cannot look at it.
So you let histogram choose uniform bins in your finite distribution of data, using an unknown algorithm to select the bin widths, and some of the bins come up zero counts.
syms N positive
p = 1/10;
thresh = 1/100;
n = solve((1-p)^N == thresh)
n = 
vpa(n)
ans = 
43.708690653565665125154703017494
This calculates that if you have a bin with 10% probability, that you would have to take more than 43 samples before the probability dropped to less than 1/100 that the bin was empty. So, with finite samples, probability happens.
z8080
z8080 el 10 de Sept. de 2021
Editada: z8080 el 10 de Sept. de 2021
Thanks a lot for this excellent answer and derivation. to answer my own question then, I guess that it is acceptable to manually remove all bins with a count of 0, to enable the computation of entropy based on the non-0 bins. This is in fact what you had answered me from the very beginning :)
Thanks again!
Walter Roberson
Walter Roberson el 10 de Sept. de 2021
Depending on your knowledge of the distribution, it might make sense to take ask for the counts, and take max(1,counts) to substitute a nominal hit for each bin, and then calculate probability from that, as adjusted_counts ./ sum(adjusted_counts) .
The fewer samples you have, the more that distorts the probabilities; the more samples you have, the less likely you are to need it.
But I do recommend figuring out the number of bits yourself somehow or else you are going to continue to be at the mercy of its undocumented method of selecting the number of bins.

Iniciar sesión para comentar.

Respuestas (0)

Categorías

Más información sobre Data Distribution Plots en Centro de ayuda y File Exchange.

Productos

Versión

R2019b

Preguntada:

el 9 de Sept. de 2021

Comentada:

el 10 de Sept. de 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by