How to improve K-means clustering with TF-IDF?
5 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Geovane Gomes
el 7 de Oct. de 2024
Comentada: Christopher Creutzig
el 22 de Oct. de 2024
Hi all,
I’m currently working on a project where I need to classify company segments based on their activity descriptions.
I’ve implemented K-means clustering using TF-IDF for feature extraction from text data. However, the current clustering results aren’t entirely accurate, especially when it comes to grouping semantically similar segments (e.g., "cars" and "vehicles" are placed into separate clusters). Is this possible to optmise it, or use another approche rather than TF-IDF.
See cluster 13. More than 50% of the items were assigned to this cluster. I also tried using other distance parameters, but the results didn't improve.
Here is my code:
clear
close
% load and preprocess
d = readtable("segmentos95Translated.xlsx");
t = d.TRANSLATED;
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'EXCEPT');
t{i} = strtrim(splitStr{1});
end
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'WITHOUT PREDOMINANCE');
t{i} = strtrim(splitStr{1});
end
% tokenization
t = lower(t);
t = tokenizedDocument(t);
t = removeStopWords(t);
t = normalizeWords(t);
customStopWords = ["manufactur","activ",",","rental","(",")","*","exempt"...
"commerci","repres","agent","trade","product","retail","sale","waiv","special","wholesal"];
t = removeWords(t,customStopWords);
% bag of words and TF-IDF
bag = bagOfWords(t);
tfidfMatrix = tfidf(bag);
X = full(tfidfMatrix);
% kmeans
rng(1)
numClusters = 25; % about 10%
[idx, C, sumd, D] = kmeans(X, numClusters);
d.clusters = idx;
% display results
for i = 1:numClusters
fprintf('Cluster %d:\n', i);
disp(d.TRANSLATED(idx == i));
end
sortrows(groupcounts(d,"clusters"),"Percent","descend")
0 comentarios
Respuesta aceptada
Sandeep Mishra
el 8 de Oct. de 2024
Hi Geovane,
I can observe that you are trying to enhance the accuracy of your K-means clustering implementation.
The current implementation using 'TF-IDF' fails to capture the semantic meanings between words, which can lead to unrelated synonyms or related terms being treated as distinct.
To resolve this, you can use word embeddings such as 'fastText' which represent words in a continuous vector space, capturing semantic meanings.
You can leverage the 'Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding' add-on in MATLAB to implement 'fastText' word embedding.
Consider the following implementation:
% Converting tokenized documents to cell array
textData = arrayfun(@(doc) joinWords(doc), t, 'UniformOutput', false);
% Loading fastText word embedding
emb = fastTextWordEmbedding;
% Converting text to embedding
X = zeros(numel(textData), emb.Dimension);
for i = 1:numel(textData)
words = split(textData{i});
validWords = words(isVocabularyWord(emb, words));
if ~isempty(validWords)
vecs = word2vec(emb, validWords);
X(i, :) = mean(vecs, 1);
end
end
[idx, C] = kmeans(X, numClusters);
Refer to the following MathWorks Documentation to learn more about ‘Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding’ function in MATLAB: https://www.mathworks.com/matlabcentral/fileexchange/66229-text-analytics-toolbox-model-for-fasttext-english-16-billion-token-word-embedding
I hope this helps.
4 comentarios
Christopher Creutzig
el 22 de Oct. de 2024
Also worth checking out are documentEmbedding and, for a different workflow with “soft clustering,” fitlda.
Más respuestas (0)
Ver también
Categorías
Más información sobre Language Support en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!