Cleveland heart disease dataset - how to improve test accurarcy?

11 visualizaciones (últimos 30 días)
HOCK WENG
HOCK WENG el 10 de Mayo de 2024
Respondida: Drew el 28 de Jun. de 2024
I'm using Multilayer Perceptron with backpropagation to predict heart diseases, and I am using the dataset linked here:https://drive.google.com/file/d/1ZuVXGbE6UVQFJ5ab5m1k4LzvDNTtLqYQ/view?usp=sharing
It has 303 records and 4 output classes (0,1,2,3,4) that represent the severity of heart disease on a scale of 0 to 4.
There's missing data in the dataset that I took care of by replacing missing values with the mean of the respective feature.
Here is the parameter that I set to train the model:
  • Number of hidden layer neurons = 100 with single hidden layer
  • Number of output layer neurons = 5
  • activation function of hidden layer = logsig
  • activation function of output layer = softmax
  • training function = trainlm
  • learning rate = 0.001
  • Maximum validation failures = 10
  • Maximum epochs = 5000
  • Minimum gradient = 1.99e-8
But no matter what I do - adjust the learning rate, no. of hidden layers, etc - the test accuracy stays between 55% to 60%, but the training accuracy can reach 80% above, And I need test accuracy to be >80% too.
How can I achieve my target? Please help me solve this problem, thank you.
Here is the result that I got, hope can as the reference:
% Load the dataset
data = readtable('C:\Users\User\Desktop\processed.cleveland.csv');
% Convert table to matrix
X = table2array(data(:, 1:13)); % Assuming the first 13 columns are input features
y = table2array(data(:, 14)); % Assuming the last column is the output label
% Handle missing values (if any)
% Replace missing values with the mean of the respective feature
X = fillmissing(X, 'linear'); % or 'nearest'
% % Normalize the input features
% X = normalize(X);
% Z-score normalization
mu = nanmean(X); % Compute mean of each feature, ignoring NaNs
sigma = nanstd(X); % Compute standard deviation of each feature, ignoring NaNs
X = (X - mu) ./ sigma; % Perform Z-score normalization
% Split the data into training and testing sets
cv = cvpartition(size(X,1),'HoldOut',0.5); % 30% of data for testing
idxTrain = training(cv); % Indices for training set
idxTest = test(cv); % Indices for testing set
X_train = X(idxTrain,:);
y_train = y(idxTrain,:);
X_test = X(idxTest,:);
y_test = y(idxTest,:);
% Define the MLP architecture
hiddenLayerSize = [100]; % Single hidden layer with 10 neurons
% Choose activation function for the output layer
outputLayerActivation = 'softmax'; % Softmax for multi-class classification
% Create the MLP model
net = patternnet(hiddenLayerSize);
% Set activation functions for hidden layers
for i = 1:numel(hiddenLayerSize)
net.layers{i}.transferFcn = 'logsig'; % Apply ReLU
net.layers{i}.userdata.dropoutFraction = 0.5; % Dropout fraction (adjust as needed)
end
% Set activation function for output layer
net.layers{end}.transferFcn = outputLayerActivation;
% Set training function
net.trainFcn = 'trainlm';
% Set training options
net.trainParam.lr = 0.001; % Learning rate
net.trainParam.max_fail = 10; % Maximum validation failures
net.trainParam.epochs = 5000; % Maximum epochs
net.trainParam.min_grad = 1.99e-8; % Minimum gradient
% Train the MLP model using training data
net = train(net, X_train', ind2vec(y_train'+1)); % '+1' to convert labels to 1-based indexing
% Test the trained model using testing data
y_pred = net(X_test');

Respuestas (1)

Drew
Drew el 28 de Jun. de 2024
Main ideas:
The main ideas for the answer are:
(1) Collapse the target classes to just two classes, namely, presence or absence of heart disease. As seen at https://archive.ics.uci.edu/dataset/45/heart+disease, "Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0)." So, collapse target values 1,2,3,4 to just "1".
(2) Categorical variables should be properly encoded for use with Neural Network classifiers. For example, use one hot encoding on the categorical variables. Again, info at https://archive.ics.uci.edu/dataset/45/heart+disease indicates which variables are categorical.
(3) This is a small dataset, so the choice of the validation and test data will affect the bias and variance of the observed accuracy. Using k-fold cross-validation, it is easy to observe accuracies over 80% for the two-class problem of presence vs absence of heart disease.
Implementation with (1) fitcnet & Classification Learner app, OR (2) patternnet
For classification of this tabular data, both fitcnet or patternnet could be used. See the accepted answer at https://www.mathworks.com/matlabcentral/answers/834428-difference-between-fitcnet-and-patternnet-functions for some similarities and differences. As mentioned in that answer, "Finally, note that fitcnet is available in the Classification Learner app, which facilitates easy comparison of multiple machine learning models for tabular classification problems."
(1) fitcnet and Classification Learner app
Let's first try easy comparison of multiple machine learning models using Classification Learner. First, prepare the data for loading into the Classification Learner app. This little script starts from the data you attached, adds variable names and categorical variable designation, imputes missing values using the mode, and collapses the target to just two classes.
% Load data
data = readtable('processed.cleveland.csv');
% Add Variable Names, info from https://archive.ics.uci.edu/dataset/45/heart+disease
data.Properties.VariableNames = {'age', 'sex', 'cp', 'trestbps', 'chol', ...
'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'};
% Convert variables to categorical
iscat=[0 1 1 0 0 1 1 0 1 0 1 0 1 0]; % Leave target as double for now
for i=1:width(data)
if (iscat(i)==1)
data.(i) = categorical(data.(i));
end
end
% Replace missing with the mode of that variable.
for i=1:width(data)
if (sum(ismissing(data.(i)))) % If true, then this var has some missing values
% replace missing with mode of the variable
data(:,i) = fillmissing(data(:,i),'constant',table2array(mode(data(:,i))));
end
end
% Collapse the targets to 0 or 1, and convert to categorical.
data.(14) = categorical( double( data.(14)>=1 ) );
head(data)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num ___ ___ __ ________ ____ ___ _______ _______ _____ _______ _____ __ ____ ___ 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0 62 0 4 140 268 0 2 160 0 3.6 3 2 3 1 57 0 4 120 354 0 0 163 1 0.6 1 0 3 0
Now, load the data into the Classification Learner app, and choose 10-fold cross-validation:
Once in the app, choose "All" models, "Optimizable Neural Network", and "Optimizable Ensemble" from the models gallery. After training those models, the following results (or similar) are obtained, with many models achieving over 80% accuracy. The exact results will vary, depending on the cross-validation partition and optimization results. The Optimizable Neural Network achieves 85.5% accuracy on the validation data (this is 10-fold cross-validation accuracy). In this case, it turns out that the optimization process chose a neural network with just one layer, and one node in that layer. So, a very simple neural network can do pretty well for this data.
Next, export the best performing neural network to the workspace, using the "Export Model" option. Inside the model, the expanded predictor names can be seen (look at trainedModel.ClassificationNeuralNetwork.ExpandedPredictorNames) indicating that fitcnet has automatically done the one hot encoding, based on which variables are categorical.
>> trainedModel.ClassificationNeuralNetwork.ExpandedPredictorNames
ans =
1×25 cell array
Columns 1 through 8
{'age'} {'sex == 0'} {'sex == 1'} {'cp == 1'} {'cp == 2'} {'cp == 3'} {'cp == 4'} {'trestbps'}
Columns 9 through 15
{'chol'} {'fbs == 0'} {'fbs == 1'} {'restecg == 0'} {'restecg == 1'} {'restecg == 2'} {'thalach'}
Columns 16 through 22
{'exang == 0'} {'exang == 1'} {'oldpeak'} {'slope == 1'} {'slope == 2'} {'slope == 3'} {'ca'}
Columns 23 through 25
{'thal == 3'} {'thal == 6'} {'thal == 7'}
(2) Patternnet
Similar results can be obtained using patternnet, but there will be some differences from fitcnet due to the different training algorithm. Remember to collapse the target classes to just two, and one-hot encode the categorical variables. Also, given the train/validation/test split used by patternnet training, one will generally be looking at the test accuracy, which is roughly similar to looking at the accuracy of one fold in a cross-validation scheme. Due to the smaller sample size, the per-fold accuracy will have much higher variance than the k-fold cross-validation accuracy which is averaged across all folds. I observed "per-fold" test accuracies ranging from a low around 73% to a high around 90%, with the average around 82-83% using a simple patternnet and the default training algorithm (no hyperparameter optimization). In the fitcnet case above, the app doesn't report the per-fold validation accuracy, but the per-fold accuracy will similarly be in a relatively wide range, with the average across all folds being around 85% after hyperparameter optimization (we observed 85.5% above) for a simple neural network.
If this answer helps you, please remember to accept the answer.

Categorías

Más información sobre Sequence and Numeric Feature Data Workflows en Help Center y File Exchange.

Productos


Versión

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by