Confusino regarding statistical tests for given distribution
1 view (last 30 days)
I am attempting to describe some data I have with a distribution but am experiencing some strange behavior.
I have a dataset which I would describe as "obviously normal", see the probplot and histfit for that
when running some tests I can confirm it is normal as well:
>> pd = fitdist(data, 'normal'); kstest(data, 'cdf', pd)
>> pd = fitdist(data, 'normal'); chi2gof(data, 'cdf', pd)
I am just having a hard time understanding why this is failing the chi square goodness of fit test. I have >7000 data points and I simply cannot see how it can't be Gaussian.
Attached is the data.
John D'Errico on 8 Jun 2021
Edited: John D'Errico on 8 Jun 2021
It looks FAIRLY normal. But you have a whole crapload of data. It needs to look more normal than that. When you have a lot of data, it had better be darned tootin normal. Said differently, what those tests did not tell you is how badly does it fail?
[H,P] = chi2gof(data, 'cdf', pd)
The default tolerance for the test was probably 0.05, if I had to guess. So your data was just over the line. By the way, if I look at that probplot, it has a few squiggles, that apparently were just enough to push it over the line.
dpb on 8 Jun 2021
Edited: dpb on 8 Jun 2021
You need to examine what the default parameters of the function return --
struct with fields:
edges: [-0.47 -0.28 -0.19 -0.10 -0.01 0.08 0.17 0.26 0.36 0.45]
O: [50.00 211.00 833.00 1856.00 2133.00 1461.00 554.00 89.00 13.00]
E: [35.88 225.21 855.97 1818.58 2162.62 1439.93 536.40 111.61 13.80]
NB: there are only 6 DOF in the output statistic --
However, it does show what the normality plot shows to some degree, the LH tail is "heavy" with more observations at the lower extreme than expected. With the coarse binning, this is enough to reject the hypothesis at the default level of significance. (And, correspondingly, it is a little light on the RH end).
struct with fields:
edges: [1×35 double]
O: [1×34 double]
E: [1×34 double]
Let's look at how well the guess worked; one should have at least 5 observations in a bin(*)
>> [min(stats.O), max(stats.O);min(stats.E), max(stats.E)]
As the NIST handbook notes, one of the weaknesses of the Chi-Square is there is no optimal binning algorithm so the results can be sensitive to the choice made.
John d' has a very valid point that lots of data means can reject more easily...depending on the use of the data, it's probably such that these deviations from normality will not be very significant unless, of course, you're doing something like estimating from the tails in which a normal approximation will likely underestimate/overestimate the observed data frequency somewhat in the left/right tails, respectively.
(*) My memory refresher is always the NIST handbook; what it says about Chi-Square GOF is at https://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
I tend to rely upon the Shapiro-Wilk test which I don't believe TMW has implemented; I've a homebrew version I coded 40 years ago...