I'd like to compute the entropy of various vectors. I was going to use something like: X = randn(1,100); h1 = histogram(X, 'Normalization', 'Probability'); probabilities = h1.Values; entropy = -sum(probabilities .* log2(probabilities )) The second command however gives the error: Undefined function 'c:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m' for input arguments of type 'double'. But surely that's exactly what the standard Matlab function 'histogram' expects?! Doing a which histogram indeed returns C:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m which is the newest file (by modified date) from several of that name that (sadly) exist in my Matlab folder. I believe this should be the standard Matlab function 'histogram'. If on the other hand in the above example I use 'hist' instead of 'histogram', I get the scalar value for entropy that I expect. However, I know 'hist' is not recommended, not least because with it one cannot specify the normalization type. So, my question is: is using 'hist' for computing probabilities ok, or should I try something else to be able to use 'histogram' instead?

Cannot use 'histogram' to compute entropy

Walter Roberson el 9 de Sept. de 2021

Abrir en MATLAB Online

Please show the output of

dbtype histogram 1:5

Question: does your code just happen to assign a value to a variable named histogram at some point?

z8080 el 9 de Sept. de 2021

Abrir en MATLAB Online

The output is:

   function h = histogram(varargin)
   %HISTOGRAM  Plots a histogram.
   %   HISTOGRAM(X) plots a histogram of X. HISTOGRAM uses an automatic binning 
   %   algorithm that returns bins with a uniform width, chosen to cover the 
   %   range of elements in X and reveal the underlying shape of the distribution. 

Your intuition is right, I did end up with a variable unhappily named histogram at some point in my Workspace. I deleted that, but still the example code behaved weirdly. Now it is the histogram option that returns the scalar value expected,

h1 = histogram(X, 'Normalization', 'Probability');
probabilities = h1.Values; 
entropy = -sum(probabilities .* log2(probabilities ))
entropy =
    1.0409

whereas the hist output is NaN:

h1 = hist(X); 
probabilities = h1;
entropy = -sum(probabilities .* log2(probabilities ))
entropy =
   NaN

Bjorn Gustavsson el 9 de Sept. de 2021

That you get a nan in the second variant is most likely because one of more of your probabilities are zero.

z8080 el 9 de Sept. de 2021

That is indeed the case, there's one zero in the probabilities vector. But I get this for most input X vector, and it's strange to not be able to compute X's entropy because of that.

Walter Roberson el 9 de Sept. de 2021

Right, you have to filter out the items with count 0.

z8080 el 9 de Sept. de 2021

Editada: z8080 el 9 de Sept. de 2021

Abrir en MATLAB Online

By "items with count 0" you mean items of the probabilities vector that are equal to 0?

Actually, for some reason all probabilities vectors that I get include at least one 0, however the corresponding entropy only is NaN for those where the 0 is mid-vector rather than when the 0s are at the end (due to padding when creating matrix rows of various length). It's not obvious which input signals lead to which situation.

For instance this code:

signal = [2.06623794644167,1.90137810157315,5.79024624899804,-10.0872171266770,3.72029696914678,1.88888495632052,-0.899652931772044,-1.01593746059118,6.02304846348934,-1.47602178714163,-1.26585426828863,-1.23193945906322,-1.58972503058801,-1.19316666394375,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
% signal(isnan(signal)) = [];
h1 = histogram(signal, 'Normalization', 'Probability'); 
probabilities = h1.Values 
entropy = -sum(probabilities .* log2(probabilities )) % this returns a NaN
signal = [-6.30156202700424,2.31276368504634,2.53575767657180,3.85380766080873,-0.788693245197791,-3.14860106509044,5.94561552053542,-4.88433510447800,-0.536192076206886,4.93030202134006,-4.09979662991061,-0.597866984143998,-1.21151323711087,4.31812745523269,-1.74917050858314,-1.59171474633157,-0.845918488184140,NaN,NaN,NaN,NaN]
% signal(isnan(signal)) = [];
h1 = histogram(signal, 'Normalization', 'Probability'); 
probabilities = h1.Values 
entropy = -sum(probabilities .* log2(probabilities )) % this returns a non-NaN!

returns:

signal =
  Columns 1 through 17
    2.0662    1.9014    5.7902  -10.0872    3.7203    1.8889   -0.8997   -1.0159    6.0230   -1.4760   -1.2659   -1.2319   -1.5897   -1.1932       NaN       NaN       NaN
  Columns 18 through 21
       NaN       NaN       NaN       NaN
probabilities =
    0.0476         0    0.3333    0.1905    0.0952
entropy =
   NaN
signal =
  Columns 1 through 17
   -6.3016    2.3128    2.5358    3.8538   -0.7887   -3.1486    5.9456   -4.8843   -0.5362    4.9303   -4.0998   -0.5979   -1.2115    4.3181   -1.7492   -1.5917   -0.8459
  Columns 18 through 21
       NaN       NaN       NaN       NaN
probabilities =
    0.0476    0.4762    0.2381    0.0476
entropy =
    1.4210

Removing the NaNs from the signal vectors themselves makes no difference.

Thus, my questions:

1) I don't understand why a 0 in the middle leads to a NaN entropy, but a 0 at the end is acceptable.

2) Would it actually be correct to manually remove those mid-way 0s and compute entropy from the 'cleaned-up' probabilities vector?

Bjorn Gustavsson el 9 de Sept. de 2021

Editada: Bjorn Gustavsson el 9 de Sept. de 2021

It is the zeros in the probabilities that leads to nans - you get terms of the form 0*log(0) in the entropy-calculation, which defaults to nan. Note that there's no zeros in the probabilities-vecor in your second example.

z8080 el 9 de Sept. de 2021

Editada: z8080 el 9 de Sept. de 2021

You're right, now I get it. But I still have no idea when a (non-padding-related) 0 appears in the probabilities vector, so what about my question 2: Would it actually be correct to manually remove those mid-way 0s and compute entropy from the 'cleaned-up' probabilities?

Bjorn Gustavsson el 9 de Sept. de 2021

Abrir en MATLAB Online

It should be simple enough to remove those zero-probability-bins:

probs = probabilities;
entropy = -sum(probs(probs(:)>0) .* log2(probs(probs(:)>0) ))

z8080 el 9 de Sept. de 2021

It is indeed simple enough – what I wonder is whether it's a good idea to just manually remove those zeros, as long as I don't know what causes a probability to be zero in the first place

Walter Roberson el 9 de Sept. de 2021

Abrir en MATLAB Online

You have a finite sample of a distribution, and you are not specifying bin edges or the number of bins.

Under those circumstances, histogram() is documented as using the data to create bins of uniform width that represents the shape of the histogram. However, there is no documentation as to the algorithm it uses to select the bin widths (number of bins), and the relevant code is inside a .p file so we cannot look at it.

So you let histogram choose uniform bins in your finite distribution of data, using an unknown algorithm to select the bin widths, and some of the bins come up zero counts.

syms N positive

p = 1/10;

thresh = 1/100;

n = solve((1-p)^N == thresh)

n =

vpa(n)

ans =

43.708690653565665125154703017494

This calculates that if you have a bin with 10% probability, that you would have to take more than 43 samples before the probability dropped to less than 1/100 that the bin was empty. So, with finite samples, probability happens.

z8080 el 10 de Sept. de 2021

Editada: z8080 el 10 de Sept. de 2021

Thanks a lot for this excellent answer and derivation. to answer my own question then, I guess that it is acceptable to manually remove all bins with a count of 0, to enable the computation of entropy based on the non-0 bins. This is in fact what you had answered me from the very beginning :)

Thanks again!

Walter Roberson el 10 de Sept. de 2021

Depending on your knowledge of the distribution, it might make sense to take ask for the counts, and take max(1,counts) to substitute a nominal hit for each bin, and then calculate probability from that, as adjusted_counts ./ sum(adjusted_counts) .

The fewer samples you have, the more that distorts the probabilities; the more samples you have, the less likely you are to need it.

But I do recommend figuring out the number of bits yourself somehow or else you are going to continue to be at the mercy of its undocumented method of selecting the number of bins.

Cannot use 'histogram' to compute entropy

13 comentarios
Mostrar 11 comentarios más antiguos Ocultar 11 comentarios más antiguos

Respuestas (0)

Categorías

Productos

Versión

Etiquetas

Community Treasure Hunt

Cannot use 'histogram' to compute entropy

13 comentarios Mostrar 11 comentarios más antiguos Ocultar 11 comentarios más antiguos

Respuestas (0)

Categorías

Productos

Versión

Etiquetas

Ver también

Community Treasure Hunt

13 comentarios
Mostrar 11 comentarios más antiguos Ocultar 11 comentarios más antiguos