Generating PDF and Overplotting from Subset of Data Using Gaussian Mixture Model

3 visualizaciones (últimos 30 días)
Assume there is a data set that contains n-measurments (In this case n=2) and that overlap. I wish to discriminate a subset of the total set by identifying the greater proportion and then applying a probability density function fit onto the distribution. I tried doing just that by applying the fitgmdist function onto the data set, knowing n=2 and then chose the higher "componentproportion" as the true subset I wish to keep the fit for. I thought what I could do was apply makedist using the mu and sigma from the distirbution I chose (TDist), then create a probability density function using pdf so I can overplot it on top of the histogram data.
clc; clear;
DataSize = 10000;
linLength = 1000;
Data1 = normrnd(-2,1,[1,DataSize]);
Data2 = normrnd(3,2,[1,DataSize]);
Data = [Data1,Data2]';
Dist = fitgmdist(Data,2); %Create fits for both distributions
DistNdex = find(Dist.ComponentProportion==max(Dist.ComponentProportion)); %Find distribuition with greater contribution
TDist.Mu = Dist.Mu(DistNdex); %Average
Unrecognized method, property, or field 'Mu' for class 'gmdistribution'.

Error in indexing (line 22)
[varargout{1:nargout}] = builtin('subsref',this,s);
TDist.Sigma = Dist.Sigma(DistNdex); %Std. Dev
PD = makedist('Normal','mu',TDist.Mu,'sigma',TDist.Sigma); %Create normal distribution using mu/sigma
xPD = linspace(TDist.Mu - 3*TDist.Sigma,TDist.Mu + 3*TDist.Sigma,linLength); %Create linspace that spans 3 sigma
pdfValues = pdf('Normal',xPD,TDist.Mu,TDist.Sigma); %Create non-normalized pdf over defined linspace
NormPdfValues = normpdf(xPD,TDist.Mu,TDist.Sigma); %Create normalized pdf over defined linspace
%Plotting
figure
histogram(Data)
hold on
plot(xPD,pdfValues,'r','LineWidth',5)
plot(xPD,NormPdfValues,'g','LineWidth',5)
hold off
But my issue here is that the max y-value for the fit is incredibly small wrt the data (Regardless of whether it is normalized or not). Why is this and how do I specify what the max y-value should be for it's associated component distribution? I'm thinking I'm lacking a piece of stats knowledge here rather than having a Matlab issue.
PS - I know this code gives an error when I ran it within the browser. Not sure why it doesn't recognize the substructure "Mu" but it works just fine and runs on my local without issues.

Respuesta aceptada

Chris
Chris el 1 de Dic. de 2022
Editada: Chris el 1 de Dic. de 2022
I believe the area under the curve of these PDFs is 1.
One way to work around that (though probably not the most correct way) would be to scale the pdf by its max value, to the maximum of the histogram.
DataSize = 10000;
linLength = 1000;
Data1 = normrnd(-2,1,[1,DataSize]);
Data2 = normrnd(3,2,[1,DataSize]);
Data = [Data1,Data2]';
Dist = fitgmdist(Data,2); %Create fits for both distributions
DistNdex = find(Dist.ComponentProportion==max(Dist.ComponentProportion)); %Find distribuition with greater contribution
TDist.Mu = Dist.mu(DistNdex); %Average
TDist.Sigma = Dist.Sigma(DistNdex); %Std. Dev
PD = makedist('Normal','mu',TDist.Mu,'sigma',TDist.Sigma); %Create normal distribution using mu/sigma
xPD = linspace(TDist.Mu - 3*TDist.Sigma,TDist.Mu + 3*TDist.Sigma,linLength); %Create linspace that spans 3 sigma
pdfValues = pdf('Normal',xPD,TDist.Mu,TDist.Sigma); %Create non-normalized pdf over defined linspace
% NormPdfValues = normpdf(xPD,TDist.Mu,TDist.Sigma); %Create normalized pdf over defined linspace
%Plotting
figure
h = histogram(Data);
hold on
mult = max(h.BinCounts)/max(pdfValues);
plot(xPD, mult*pdfValues,'k--','LineWidth',2)
  2 comentarios
John
John el 1 de Dic. de 2022
Editada: John el 1 de Dic. de 2022
So I tried using h.BinCounts but I get the error:
Invalid or deleted object.
Error in matlab.graphics.chart.primitive.Histogram/get.BinCounts
May be the same reason as before: My version has different, recognized syntaxes. Also, so I could scale it like that for sure, but I think I read some places online that there is a way to scale the y-values for probability density functions based on data correctly. I'm just not sure what that is.
John
John el 1 de Dic. de 2022
Nevermind, the scaling does not matter and is left up to the user because regardless, none of them are normalized curves and can be normalized later on. The reason I was initially confused was because when one runs the histfit() function, the overplotted fit Matlab produces is also not normalized but I assumed there was some, consistent statistical method I was missing. I guess whoever programmed the scaling into the function just decided it based on their own experience? Matlab's output:
(Ignore the fact that it's applying a fit to the total data. This is just exemplifying the scaling)
Your "mlt" scaling method:
Close enough! One can at least see the fit now.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Startup and Shutdown en Help Center y File Exchange.

Productos


Versión

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by