How to Cluster Dataset and remove outlier in MATLAB

I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:

Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.

Please refer to the below code snippet that illustrates the above workflow:

data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
    [idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
    wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);
    centroid = C(i, :);
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Hope it helps!

2 comentarios
Mostrar NingunoOcultar Ninguno

Med Future el 23 de Abr. de 2024

Editada: Walter Roberson el 24 de Abr. de 2024

Abrir en MATLAB Online

Question.mat

@Sai Pavan

I have implement the code you shared with my code. But still there is an error Arrays have incompatible sizes for this operation. I have attached the dataset and the code below. Please modified the code for that. As i know the ground truth there should be only 1 cluster the remaining are the noise. Based on the distance calculation

load Question
dataset1=data(:,[2 4]);
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
   clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);
    centroid = C(i, :);
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Med Future el 24 de Abr. de 2024

@Image Analyst @Walter Roberson Can you please look it how to solve this issue?

Iniciar sesión para comentar.

Answer 2

Walter Roberson el 24 de Abr. de 2024

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1894680-how-to-cluster-dataset-and-remove-outlier-in-matlab#answer_1447371

Movida: Walter Roberson el 24 de Abr. de 2024

Abrir en MATLAB Online

Question.mat

load Question
dataset1=data(:,[2 4]);

dataset1 is created from 2 columns of data

% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);

C is created from dataset1 so it has two columns

dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
   clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);

data has 6 columns, so clusterPoints has 6 columns

centroid = C(i, :);

centroid is created from C so it has two columns

    whos clusterPoints centroid
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));

You are trying to subtract something with 2 columns from something with 6 columns, which is an error

end
  Name                 Size            Bytes  Class     Attributes

  centroid             1x2                16  double              
  clusterPoints      177x6              8496  double              
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Med Future el 25 de Abr. de 2024

@Walter Roberson Thank you for explaining it that much. Basically the problem is to reassign the clusters which are already made by K-means. means i want to remove the outliers. as you see the solution the each distance of each centroid from the clusterpoints are recalculated by facing the error. can you please help me to solve this problem.

Iniciar sesión para comentar.

How to Cluster Dataset and remove outlier in MATLAB

2 comentarios
Mostrar NingunoOcultar Ninguno

Respuestas (2)

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

How to Cluster Dataset and remove outlier in MATLAB

2 comentarios Mostrar NingunoOcultar Ninguno

Respuestas (2)

2 comentarios Mostrar NingunoOcultar Ninguno

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

2 comentarios
Mostrar NingunoOcultar Ninguno

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos