Evalclusters function gives different results for each running

Hi,
I am using evalclusters function to evaluate the number of clusters for kmeans cluster like this:
eval = evalclusters(data, 'kmeans', 'gap', 'klist', [1:10],'B', 50, 'SearchMethod', 'firstMaxSE');
However, each time I run the function, it gives different cluster numbers. I'm quite confused about this.
Could you please help me to explain this problem and do you know which parameters are
most suitable to set for this function (i.e., klist, number of reference data B, search method, reference distribution...) if I want to use gap criteria, for instance.
Thank you!
Cheers,
Ni

1 comentario

I am getting different results each time too, but not too different (K=8, 9 or 10).

Iniciar sesión para comentar.

Respuestas (1)

Walter Roberson
Walter Roberson el 28 de Feb. de 2018
"Start: Method for choosing initial cluster centroid positions (or seeds), specified as the comma-separated pair consisting of 'Start' and 'cluster', 'plus', 'sample', 'uniform', a numeric matrix, or a numeric array, . This table summarizes the available options for choosing seeds."
Notice that all of the options in the table except the numeric matrix or numeric array involve random selection, which is going to have results that depend upon the state of the random number generator.
You have two choices:
  1. You can provide the Start option and provide a numeric matrix or numeric array of exact initial cluster positions; or
  2. You can set the random number generator to a consistent value each time

2 comentarios

The 'Start' parametre is not valid in the evalclusters commnad. How can set the random number gernrator to a fixed value?
evalclusters permits you to pass a function handle as the method, which could be a call to kmeans with the Start parameter set.
The documentation for evalclusters shows an example of looping running kmeans for different cluster sizes and passing the results into evalclusters for analysis.
If you do either of the above two then you set the same starting point for each of the kmeans runs, and so be able to directly compare the effects of using different number of clusters for the same configuration.
Or you could use
rng(655321)
before each call to evalclusters(). If you do this then you will be able to replicate the evalclusters() results, but each of the individual kmeans calls will use a different set of starting centroids, which makes it more difficult to directly compare the effects of using a different number of clusters versus differences due to different starting points.

Iniciar sesión para comentar.

Etiquetas

Preguntada:

el 28 de En. de 2015

Comentada:

el 1 de Feb. de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by