Adapt k-means to only use datapoints as centroids

1 visualización (últimos 30 días)
Moritz H
Moritz H el 16 de Ag. de 2016
Editada: John D'Errico el 8 de Sept. de 2016
Hey!
I am looking for a way to adapt/extend the kmeans function to only give back points of the original input data as centroids and still minimize the error of the global ploblem.
I quickly hacked this in:
line 756:
case 'sqeuclidean'
C(nidx,:) = C(nidx,:) + (X(moved,:) - C(nidx,:)) / m(nidx);
C(oidx,:) = C(oidx,:) - (X(moved,:) - C(oidx,:)) / m(oidx);
%edit: make centroid the nearest member of X:(only for 2
%dimensions!!!)
dist_vec = bsxfun(@hypot, X(:,1)-C(nidx,1), X(:,2)-C(nidx,2));
C(nidx,:) = X(find(dist_vec == min(dist_vec),1),:);
dist_vec = bsxfun(@hypot, X(:,1)-C(oidx,1), X(:,2)-C(oidx,2));
C(oidx,:) = X(find(dist_vec == min(dist_vec),1),:);
function gcentroids (line 863):
case 'sqeuclidean'
centroids(i,:) = sum(X(members,:),1) / counts(i);
%edit: make centroid the nearest member of X:(only for 2
%dimensions!!!)
dist_vec = bsxfun(@hypot, X(:,1)-centroids(i,1), X(:,2)-centroids(i,2));
centroids(i,:) = X(find(dist_vec == min(dist_vec),1),:);
Obviously I only use sqeuclidean and 2d data.
With this, I get pretty good results, even though the problem does not converge in 100 steps in all 5 iterations (which takes a loong time).
I am wondering if there is a better/faster way to do this.
Thank you
  1 comentario
Moritz H
Moritz H el 8 de Sept. de 2016
Does someone have a take on this?
Also it is not clear to me after the 5 replicated are finished/aborted that the best solution is picked, as you can see by this console output: ('Best total sum of distances = 746573' doesn't actually exist)
Replicate 1, 100 iterations, total sum of distances = 703298.
Warning: Failed to converge in 100 iterations during replicate 1.
> In kmeans_constraint/loopBody (line 440)
In internal.stats.parallel.smartForReduce (line 136)
In kmeans_constraint (line 316)
In KDEFindLines (line 11)
In GUI>btnStartInput_Callback (line 220)
In gui_mainfcn (line 95)
In GUI (line 42)
In @(hObject,eventdata)GUI('btnStartInput_Callback',hObject,eventdata,guidata(hObject))
Replicate 2, 100 iterations, total sum of distances = 677004.
Warning: Failed to converge in 100 iterations during replicate 2.
> In kmeans_constraint/loopBody (line 440)
In internal.stats.parallel.smartForReduce (line 136)
In kmeans_constraint (line 316)
In KDEFindLines (line 11)
In GUI>btnStartInput_Callback (line 220)
In gui_mainfcn (line 95)
In GUI (line 42)
In @(hObject,eventdata)GUI('btnStartInput_Callback',hObject,eventdata,guidata(hObject))
Replicate 3, 100 iterations, total sum of distances = 661592.
Warning: Failed to converge in 100 iterations during replicate 3.
> In kmeans_constraint/loopBody (line 440)
In internal.stats.parallel.smartForReduce (line 136)
In kmeans_constraint (line 316)
In KDEFindLines (line 11)
In GUI>btnStartInput_Callback (line 220)
In gui_mainfcn (line 95)
In GUI (line 42)
In @(hObject,eventdata)GUI('btnStartInput_Callback',hObject,eventdata,guidata(hObject))
Replicate 4, 100 iterations, total sum of distances = 687313.
Warning: Failed to converge in 100 iterations during replicate 4.
> In kmeans_constraint/loopBody (line 440)
In internal.stats.parallel.smartForReduce (line 136)
In kmeans_constraint (line 316)
In KDEFindLines (line 11)
In GUI>btnStartInput_Callback (line 220)
In gui_mainfcn (line 95)
In GUI (line 42)
In @(hObject,eventdata)GUI('btnStartInput_Callback',hObject,eventdata,guidata(hObject))
Replicate 5, 100 iterations, total sum of distances = 728279.
Warning: Failed to converge in 100 iterations during replicate 5.
> In kmeans_constraint/loopBody (line 440)
In internal.stats.parallel.smartForReduce (line 136)
In kmeans_constraint (line 316)
In KDEFindLines (line 11)
In GUI>btnStartInput_Callback (line 220)
In gui_mainfcn (line 95)
In GUI (line 42)
In @(hObject,eventdata)GUI('btnStartInput_Callback',hObject,eventdata,guidata(hObject))
Best total sum of distances = 746573

Iniciar sesión para comentar.

Respuestas (1)

John D'Errico
John D'Errico el 8 de Sept. de 2016
Editada: John D'Errico el 8 de Sept. de 2016
Hacking code that you did not write tends to lead to unexpected consequences. In fact, if you were going to write a k-means code that used only the data points as centers, you might even choose some other basic scheme.
Anyway, a k-means code is not that tough to write. So start from scratch and write it yourself. Or expect to have to live with a buggy, hacked code. You may well find that it is faster to write new code from scratch anyway, compared to the time you will spend debugging problems created by your hacks.

Categorías

Más información sobre Statistics and Machine Learning Toolbox en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by