Draw a curve defining the bounds of a scatter plot to detect anomalous (sparse or low density) points
26 visualizaciones (últimos 30 días)
Simone A. el 9 de Sept. de 2023
Dear all, I have thousands of x,y data paris of observed and theoretical temperatures in a given region, where I need to identify some anomalous (for the purpose of my work) data points. I have tried PCA analysys (i.e., DBSCAN), statistical thresholds, curve fittings, etc., but none of them did work. I then tried a KDE-based approach, which seems to 'better' work for my data. To achieve the results shown below I used https://it.mathworks.com/matlabcentral/fileexchange/8430-flow-cytometry-data-reader-and-visualization as:
[hAxes,col,ctrs1,ctrs2,F] = dscatter(X,Y,'BINS',[250,250]);
% Please note, I modified the code to export hAxes,col,ctrs1,ctrs2 and plot
% the contour line as:
contour(ctrs1,ctrs2,F,0.0015,'k-'); % or 0.015 ---> See figure below
By using this apporach I was able to draw a contour line around the main cluster of my data, and I was mostly able to draw the line between 'main cluster' and 'anomalous' points. However, if I have too many 'anomalous' data points (see xy4 in the figure below), the method, using a fixed threshold, fails. Beside, I have to adjust the threshold based on the region of each x,y pair and I have no idea on how to find the correct threshold level (see the figure below - For the same region, a threshold level of 0.0015 seems to work for situations with a few anomalous points, but a threshold of 0.015 is needed when more spreaded points occurred).
What I would really like to do is to draw a curve around the main cluster of my scatterplot, so to devide the anomalous data points. I fully understand this may be a challenging task, but I hope you may provide some good alternatives and/or solutions.
Another solution, as it seems to work, may be defining automatically the density threshold level, but I don't really know where to start from.
Below, you can see 4 examples (xy 1 to 4 are also attached as .mat files). To the left is the simple x,y scatterplot. In the middle, the ideal curve (in red) defining the boundary of my scatterplot (manually sketched). To the right, the KDE-based approach discussed above. Please note the last figure where a threshold of 0.0015 fails, and a threshld of 0.015 is needed instead.
Any help is grately appreciated!
Any other approach to identify the points outside the red boundaries is more than wellcome!
Thanks a lot in advance
Matt J el 13 de Sept. de 2023
Editada: Matt J el 13 de Sept. de 2023
Sets=["xy1" "xy2" "xy3" "xy4"];
Más respuestas (1)
Mathieu NOE el 13 de Sept. de 2023
maybe this ?
I simply decided that I wanted the contour line corresponding to a level 5% of F range above the min value of F
[hAxes,col,ctrs1,ctrs2,F] = dscatter(xy4(:,1),xy4(:,2),'BINS',[250,250]);
lb = min(F,,'all');
ub = max(F,,'all');
threshold = lb+0.05*(ub-lb); % threshold is 5% of data range above min value
contour(ctrs1,ctrs2,F,[threshold threshold],'k-'); %