Confusino regarding statistical tests for given distribution

Question

Morten Nissov el 8 de Jun. de 2021

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/850710-confusino-regarding-statistical-tests-for-given-distribution

Editada: dpb el 8 de Jun. de 2021

temp.mat

Abrir en MATLAB Online

I am attempting to describe some data I have with a distribution but am experiencing some strange behavior.

I have a dataset which I would describe as "obviously normal", see the probplot and histfit for that

when running some tests I can confirm it is normal as well:

>> lillietest(data)
ans =
     0
>> pd = fitdist(data, 'normal'); kstest(data, 'cdf', pd)
ans =
  logical
   0
>> pd = fitdist(data, 'normal'); chi2gof(data, 'cdf', pd)
ans =
     1

I am just having a hard time understanding why this is failing the chi square goodness of fit test. I have >7000 data points and I simply cannot see how it can't be Gaussian.

Attached is the data.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

John D'Errico el 8 de Jun. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/850710-confusino-regarding-statistical-tests-for-given-distribution#answer_719755

Editada: John D'Errico el 8 de Jun. de 2021

Abrir en MATLAB Online

It looks FAIRLY normal. But you have a whole crapload of data. It needs to look more normal than that. When you have a lot of data, it had better be darned tootin normal. Said differently, what those tests did not tell you is how badly does it fail?

[H,P] = chi2gof(data, 'cdf', pd)
H =
     1
P =
     0.032397

The default tolerance for the test was probably 0.05, if I had to guess. So your data was just over the line. By the way, if I look at that probplot, it has a few squiggles, that apparently were just enough to push it over the line.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 2

dpb el 8 de Jun. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/850710-confusino-regarding-statistical-tests-for-given-distribution#answer_719795

Editada: dpb el 8 de Jun. de 2021

Abrir en MATLAB Online

You need to examine what the default parameters of the function return --

>> [h,p,stats]=chi2gof(data)
h =
          1.00
p =
          0.03
stats = 
  struct with fields:
    chi2stat: 13.76
          df: 6.00
       edges: [-0.47 -0.28 -0.19 -0.10 -0.01 0.08 0.17 0.26 0.36 0.45]
           O: [50.00 211.00 833.00 1856.00 2133.00 1461.00 554.00 89.00 13.00]
           E: [35.88 225.21 855.97 1818.58 2162.62 1439.93 536.40 111.61 13.80]
>>

NB: there are only 6 DOF in the output statistic --

However, it does show what the normality plot shows to some degree, the LH tail is "heavy" with more observations at the lower extreme than expected. With the coarse binning, this is enough to reject the hypothesis at the default level of significance. (And, correspondingly, it is a little light on the RH end).

>> [h,p,stats]=chi2gof(data,"NBins",40)
h =
             0
p =
          0.57
stats = 
  struct with fields:
    chi2stat: 29.04
          df: 31.00
       edges: [1×35 double]
           O: [1×34 double]
           E: [1×34 double]
>> 

Let's look at how well the guess worked; one should have at least 5 observations in a bin(*)

>> [min(stats.O), max(stats.O);min(stats.E), max(stats.E)]
ans =
          5.00        550.00
          5.18        559.11
>> 

As the NIST handbook notes, one of the weaknesses of the Chi-Square is there is no optimal binning algorithm so the results can be sensitive to the choice made.

John d' has a very valid point that lots of data means can reject more easily...depending on the use of the data, it's probably such that these deviations from normality will not be very significant unless, of course, you're doing something like estimating from the tails in which a normal approximation will likely underestimate/overestimate the observed data frequency somewhat in the left/right tails, respectively.

(*) My memory refresher is always the NIST handbook; what it says about Chi-Square GOF is at https://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm

I tend to rely upon the Shapiro-Wilk test which I don't believe TMW has implemented; I've a homebrew version I coded 40 years ago...

https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Confusino regarding statistical tests for given distribution

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (2)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Confusino regarding statistical tests for given distribution

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (2)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos