Training a Neural Net on the entire dataset after model selection on K-fold Cross Validation: How to overcome overfitting if i don't have a validation and test set?

1 visualización (últimos 30 días)
Hi everyone,
I am working on artificial neural networks for application in Movement Analysis. I started using Neural Networks this year and following courses and post on ANSWER and Matlab community i tried to implement a K-fold CV procedure to develop a model for movement classification.
SOME CONSIDERATION: My Dataset is composed by 19 Subjects repeating a movement pattern for 20 times. This movement pattern is composed by 5 sequential phases which are divided in 100 ms observations from 6 sensors: in order to divide the data in 3 indipendent TRAINING, VALIDATION AND TEST SETS i have to include all observation from a subject inside a specific group.
I implemented the overall procedure which i will include at the end of this post. But now i have 2 question arising in my head:
1- Looking at the useful examples from prof. Greg Heath i saw that the R^2 is often used as performance measure to evaluate model. Beside i also read that it is typically recommended for regression problem. Is it possible to use it also in classification ?
2- After i get the results from my 10x10 iteration over weight and hidden neurons different model, should i get the collected information to train the 'optimal' model found on all the entire dataset ? Or should i simply take the best model found even if i don't consider a N°val+N*tst samples ? I ask this becouse i already tried to train a found optimal model an all my data, but off course if i don't specify a validation set the early stop does not work and i fall in the overfitting.
Thanks in advance for every possible help.
Mirko
%% MEMORY CLEAN
clear all; close all; clc
%% LOAD DATASET
datasetfolder='DATA/CLASSIFIER/Table_Classifier';
load(fullfile(cd,datasetfolder,'Table_Classifier.mat'));% ------------------- Load Dataset
x=table2array(DataSET(:,1:end-1))';% ---------------------------------------- Input [IxN°obs.] 252x42563
tc=table2array(DataSET(:,end));% -------------------------------------------- Label Cell Array [1xN°obs.]
targets=[strcmp(tc,'Phase1'),...% ------------------------------------------- Targets [OxN°osserv.] 5x42563
strcmp(tc,'Phase2'),...
strcmp(tc,'Phase3'),...
strcmp(tc,'Phase4'),...
strcmp(tc,'Phase5')]';
%% DIMENSIONALITY OF THE DATASET
[I N]=size(x);
[O ~]=size(targets);
%% DEFINITION OF FOLDS FOR XVALIDATION
% In my case each fold should include all observation from all exercise from a specific subject, DIVISOR is a
% label that indicate the specific subject of an observation.
Sbj=unique(DIVISOR);
loop=0;
% Choose of the type of validation
while loop==0
flag=input(['What validation model you would like to implement?\n',...
' 1 - 5 folds\n 2 - 10 folds\n 3 - LOSOCV\n\n']);
switch flag
case 1
folds = 6;
loop = 1;
case 2
folds = 11;
loop = 1;
case 3
folds = length(SBJ);
loop = 1;
otherwise
loop = 0;
end
end
Basing on the number of loop defined above, i created a cell array 'subgroup' (1,folds) containing the subjects label randomized in fold different groups, it is important to note that if i choose to implement 5-fold X Validation Subgroup will have 5+1 element (one element will be considered as test-set)
  • Subgroup {1}: Sbj1, Sbj7, Sbj5
  • Subgroup {2}: Sbj2, Sbj4
  • Subgroup {3}: Sbj3, Sbj6
At this point i implemented starting from the double loop approach by prof. Greg Heath an expanded approach that:
  1. each element of the Subgroup (i.e. folds) is considered as Test Set
  2. the remaining element are used for k-fold cross validation
  3. a validation loop is iterated for 10 random initialization of initial weights and 10 possible model of hidden neurons
%% IDENTIFICATION OF THE AVERAGE NTRN
% Changing different folds for test and validation implicitly change the number of training samples
% to calculate N° of hidden neurons, so i evaluate an average N° of training samples among all possible selections.
Ntr_av=0;%------------------------------------------------------------------- Average N°trn
for t=1:folds%--------------------------------------------------------------- For each test choice
logicalindext=cellfun(@(x)contains(DIVISOR,x),...
subgroup{t},'un',0);
for v=1:folds%----------------------------------------------------------- For each validation choice
if t~=v
logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
TrainSET=find(~any([any(...%------------------------------------- Train indixes
horzcat(logicalindext{:}),2),any(...
horzcat(logicalindexv{:}),2)],2)==1);
Ntr_av=Ntr_av+length(TrainSET);
end
end
end
Ntr_av=Ntr_av/((folds-1)*folds);%-------------------------------------------- Average N°trn
Hmin=10;%-------------------------------------------------------------------- Minimum Hidden nodes number
Hub_av=(Ntr_av*O-O)/(I+O+1);%------------------------------------------------ Upper limit for N° Hidden neuron
Hmax_av = round(Hub_av/10);%------------------------------------------------- Max N° hidden neurons (<<<Hub_av for robust training)
dn=floor((Hmax_av-Hmin)/9);%------------------------------------------------- Step dn
Neurons=(0:9).*dn+Hmin;%----------------------------------------------------- I define 10 possible models of hidde layer differentiatig for dn
% Hidden neurons
MSE00 = mean(var(targets',1));%---------------------------------------------- Naive Constant model reference on all dataset
%% NEURAL NETWORK MODEL
for t=1:folds%--------------------------------------------------------------- For each fold t
logicalindext=cellfun(@(x)contains(DIVISOR,x),...%----------------------- I define the current fold as TEST SET finding all the indixes corresponding
subgroup{t},'un',0); % to the label in subgroup{t}
ITST=find(any(horzcat(logicalindext{:}),2)==1);
MSE00tst = mean(var(targets(:,ITST)',1));%------------------------------- Naive Constant model reference on the Test SET
IVAL=cell(1,folds-1);%--------------------------------------------------- Declaration of folds-1 couple of possible training
ITRN=cell(1,folds-1);%--------------------------------------------------- and validation indixes and respective MSE00
MSE00val=zeros(1,folds-1);
MSE00trn=zeros(1,folds-1);
count=1;
for v=1:folds%----------------------------------------------------------- For each fold
if t~=v%------------------------------------------------------------- different from Test SET t
logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
IVAL{1,count}=find(any(...%-------------------------------------- I identify the indixes of validation and training
horzcat(logicalindexv{:}),2)==1);
ITRN{1,count}=find(~any([any(...
horzcat(logicalindext{:}),2),any(...
horzcat(logicalindexv{:}),2)],2)==1);
MSE00val(1,count)=mean(var(targets(:,ITRN{1,count})',1));%------- And i calculate the MSE00 references
MSE00trn(1,count)=mean(var(targets(:,IVAL{1,count})',1));
count=count+1;
end
end
S=cell(1,10);%----------------------------------------------------------- Across each validation loop i have to use the same initial weight
rng(0);%----------------------------------------------------------------- Default random state
for s=1:10
S{s}=rng;%----------------------------------------------------------- I save 10 different random states to be resettled across 10
rand; % different validation loop (initial weight iteration)
end
rng(0);%----------------------------------------------------------------- Default random state
% Performance measures
perf_xentrval=zeros(10,10);
perf_xentrtrn=zeros(10,10);
perf_xentrtst=zeros(10,10);
perf_mseval=zeros(10,10);
perf_msetrn=zeros(10,10);
perf_msetst=zeros(10,10);
perf_R2=zeros(10,10);
perf_R2trn=zeros(10,10);
perf_R2tst=zeros(10,10);
perf_R2val=zeros(10,10);
for n=1:10%-------------------------------------------------------------- For each model of hidden neurons
H=Neurons(n);%------------------------------------------------------- I use the model defined previously
parfor i=1:10%------------------------------------------------------- For each iteration of initial random weight
fprintf(['Validation for Model with: ',num2str(H),' neurons and randomization ',num2str(i),'\n']);
tic
[val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]=ValidationLoops...
(S{i},MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
toc
The function validationLoops has been created to overcome parfor problem and errors in multiprocessing comands:
function [val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]...
=ValidationLoops(S,MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
% Validation performance Variables
val_xentrval = zeros(1,folds-1);
val_xentrtrn = zeros(1,folds-1);
val_xentrtst = zeros(1,folds-1);
val_mseval = zeros(1,folds-1);
val_msetrn = zeros(1,folds-1);
val_msetst = zeros(1,folds-1);
val_R2 = zeros(1,folds-1);
val_R2trn = zeros(1,folds-1);
val_R2val = zeros(1,folds-1);
val_R2tst = zeros(1,folds-1);
for v=1:folds-1%---------------------------------------------- For each validation fold
net=patternnet(H,'trainlm');%----------------------------- Define the net
net.performFcn = 'mse';%---------------------------------- Loss function
net.divideFcn='divideind';%------------------------------- Setting TRAINING TEST AND VALIDATION
net.divideParam.trainInd=ITRN{v}; % TrainingSET
net.divideParam.valInd=IVAL{v}; % ValidationSET
net.divideParam.testInd=ITST; % TestSET
rng(S); % Reset initial weight, across validation loops i evaluate the SAME MODEL in terms
% of Neurons and Initial Weighy
net=configure(net,x,targets);
[net,tr,y,e]=train(net,x,targets);
% Save Performance variables
val_xentrval(v) = crossentropy(net,targets(:,IVAL{v}),...%------- Crossentropy
y(:,IVAL{v}));
val_xentrtrn(v) = crossentropy(net,targets(:,ITRN{v}),...
y(:,ITRN{v}));
val_xentrtst(v) = crossentropy(net,targets(:,ITST),...
y(:,ITST));
val_mseval(v) = tr.best_vperf;%---------------------------------- MSE
val_msetrn(v) = tr.best_perf;
val_msetst(v) = tr.best_tperf;
val_R2(v) = 1 - mse(e)/MSE00;%----------------------------------- R2
val_R2trn(v) = 1 - tr.best_perf/MSE00trn(v);
val_R2val(v) = 1 - tr.best_vperf/MSE00val(v);
val_R2tst(v) = 1 - tr.best_tperf/MSE00tst;
end
After the validation i save the results of model with N neurons and I random iteration of initial weights as a mean of results obtained in validation loops.
perf_xentrval(n,i)=...
mean(val_xentrval);
perf_xentrtrn(n,i)=...
mean(val_xentrtrn);
perf_xentrtst(n,i)=...
mean(val_xentrtst);
perf_mseval(n,i)=...
mean(val_mseval);
perf_msetrn(n,i)=...
mean(val_msetrn);
perf_msetst(n,i)=...
mean(val_msetst);
perf_R2(n,i)=...
mean(val_R2);
perf_R2trn(n,i)=...
mean(val_R2trn);
perf_R2val(n,i)=...
mean(val_R2val);
perf_R2tst(n,i)=...
mean(val_R2tst);
end
end
% This process is repeated for each choice of different Test Set
eval(['T',num2str(t),'Test_model.data.xentrval=perf_xentrval']);
eval(['T',num2str(t),'Test_model.data.xentrtrn=perf_xentrtrn']);
eval(['T',num2str(t),'Test_model.data.xentrtst=perf_xentrtst']);
eval(['T',num2str(t),'Test_model.data.mseval=perf_mseval']);
eval(['T',num2str(t),'Test_model.data.msetrn=perf_msetrn']);
eval(['T',num2str(t),'Test_model.data.msetst=perf_msetst']);
eval(['T',num2str(t),'Test_model.data.R2=perf_R2']);
eval(['T',num2str(t),'Test_model.data.R2val=perf_R2val']);
eval(['T',num2str(t),'Test_model.data.R2trn=perf_R2trn']);
eval(['T',num2str(t),'Test_model.data.R2tst=perf_R2tst']);
eval(['T',num2str(t),'Test_model.HiddenNeurons=Neurons']);
eval(['T',num2str(t),'Test_model.SET.Sbj=subgroup{t};']);
eval(['T',num2str(t),'Test_model.SET.Ind=ITST;']);
end
delete(gcp('nocreate'))

Respuestas (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by