Separating 1D data into clusters and counting the amount of elements in each

Question

Menno van Dam el 20 de Ag. de 2021

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/1437724-separating-1d-data-into-clusters-and-counting-the-amount-of-elements-in-each

Comentada: Image Analyst el 22 de Ag. de 2021

Hello,

I have a 1-dimensional array with index values of extremes I found in a much larger dataset, ordered from lowest to highest. I wish to seek out if there is any data clustering within this array. What's the simplest way to do this? The total amount of clusters is unknown beforehand. A cluster can be defined as any values that are within a certain predefined distance from each other. I'd like to know the amount of clusters found, and an array with the number of values found within each cluster.

I made a simple illustration of what I'd like in the figure below. The crosses indicate values in the array on the number line. Values close together (removed less than a given distance), are grouped together and counted.

So, in this case, the output would be;

5 (number of clusters found)

[3, 2, 1, 1, 5] (size of each cluster)

Thanks in advance!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Kelly Kearney el 21 de Ag. de 2021

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1437724-separating-1d-data-into-clusters-and-counting-the-amount-of-elements-in-each#answer_771659

Abrir en MATLAB Online

If the cluster tolerance defines how close points need to be to their nearest neighbor in order to be called a cluster, then you just need to check for where the gaps are larger than that tolerance:

% The data
 
tol = 0.1;
x = [1 1 1 2 2 3 4 5 5 5 5 5];
x = x + rand(size(x))*tol-tol/2;
 
% Find clusters
 
dx = diff(x);
isnew = dx > tol;
 
idx = cumsum([1 isnew]);
ncluster = max(idx)
ncluster = 5
nper = splitapply(@length, idx, idx)
nper = 1×5
     3     2     1     1     5

If instead your tolerance sets the maximum distance a point can be from all points in its cluster, then things get a bit trickier.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Image Analyst el 22 de Ag. de 2021

Abrir en MATLAB Online

@Menno van Dam, be aware that this solution, unlike dbscan() used in my solution, requires that you sort x in advance. So if you had a pair of 1's at the end, not near the original 1's, they would not be counted

x = [1 1 1 2 2 3 4 5 5 5 5 5 1 1];

It says:

nper =

3 2 1 1 7

so notice that the final 2 1's get grouped in with the group of five 5's to give a count of three 1's and seven 5's, instead of five 1's and five 5's.

The fix is to sort the x before adding random numbers to it:

x = sort([1 1 1 2 2 3 4 5 5 5 5 5 1 1], 'ascend');

Then you get

nper =

5 2 1 1 5

which I think is what you intend.

I'll post another answer below that uses findgroups() and histcounts() to get the answer regardless if it's sorted or not.

Menno van Dam el 22 de Ag. de 2021

@Image Analyst Thank you for the notifying me about that. Luckily, my data already happened to be sorted, so this wasn't an issue for me. Still, it is useful to keep that in mind.

Iniciar sesión para comentar.

Answer 2

Image Analyst el 21 de Ag. de 2021

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1437724-separating-1d-data-into-clusters-and-counting-the-amount-of-elements-in-each#answer_771664

Editada: Image Analyst el 21 de Ag. de 2021

Abrir en MATLAB Online

dbscan_demo.m

If you have the Statistics and Machine Learning Toolbox, there is a function that does this. It's called dbscan() after the clustering algorithm of the same name (which should probably be more famous than it is.)

See attached dbscan demo, and Wikipedia description:

https://en.wikipedia.org/wiki/DBSCAN

Basically it finds all points within a specified distance from other points by daisy chaining its way along. All those points that can be connected by a daisy chain length less than what you specified belong to one cluster. Then it moves on to other unclassified points to find more clusters. Points that are farther away from any point than your distance are not clusters, or in other words they're a cluster of only 1 point.

Here's a very well commented demo that is more tailored to your data:

% First define the data.
x = [1 1 1 2 2 3 4 5 5 5 5 5];
y = zeros(1, length(x));
% Plot it.
plot(x, y, 'rx', 'MarkerSize', 20, 'LineWidth', 3);
ax = gca;
ax.XAxisLocation = 'origin';
grid on;
% Now find clusters.
minDistance = 0.5; % Need to be closer than this to be in the same cluster.
minPointsPerCluster = 2; % A cluster must have at least 2 points to be a valid cluster.
indexes = dbscan([x',y'], minDistance, minPointsPerCluster)
% The value of indexes say what cluster number that point belongs to.
% If the value is -1, it does not belong to a cluster and is a single, isolated point.
% Find the number (count) in each cluster
pointCounts = hist(indexes, -1 : max(indexes))
% It says there are 2 points not in a cluster (single isolated points),and
% then the counts are 3, 2, and 5 for the 3 clusters it found.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 3

Image Analyst el 22 de Ag. de 2021

1
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/1437724-separating-1d-data-into-clusters-and-counting-the-amount-of-elements-in-each#answer_772314

Abrir en MATLAB Online

You can do this in a single line of code with a call to findgroups() and histcounts():

% First define the data.
x = [1 1 1 2 2 3 4 5 5 5 5 5];
% Now find groups.
g1 = findgroups(x)
% Count the number in each group.
counts1 = histcounts(g1)
% Now with another group of 1's on the end, separated from the first group of 1's.
x = [1 1 1 2 2 3 4 5 5 5 5 5 1 1];
counts2 = histcounts(findgroups(x)) % Combining 2 statements into 1.

You see:

counts1 =

3 2 1 1 5

counts2 =

5 2 1 1 5

as you should.

2 comentarios
Mostrar NingunoOcultar Ninguno

Menno van Dam el 22 de Ag. de 2021

Thank you for your answer. This doesn't use a tolerance for finding the clusters, right? It works well if the array contains nonunique values. My data however consists of only unique values, some just happen to be very close to each other (like a group of 22000, 22001 and 22003, which could be seen as a single cluster of 3 datapoints). This is why I wanted to use a predefined tolerance/distance variable for identifying clusters.

Image Analyst el 22 de Ag. de 2021

findgroups() doesn't appear to have a tolerance and appears to expect discrete values, so 22000 and 22001 would be in two different groups. Your original array seemed like it was integers with some repeated so that's why I suggested findgroups(). But if they are slightly different, and you have some tolerance, I believe most experts would recommend you use dbscan(), that is, if you have the Statistics and Machine Learning Toolbox.

Iniciar sesión para comentar.

Separating 1D data into clusters and counting the amount of elements in each

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (2)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Separating 1D data into clusters and counting the amount of elements in each

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Más respuestas (2)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios Mostrar NingunoOcultar Ninguno

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno