Training a Neural Net on the entire dataset after model selection on K-fold Cross Validation: How to overcome overfitting if i don't have a validation and test set?

Question

Mirko Job el 26 de Jun. de 2019

0
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/468946-training-a-neural-net-on-the-entire-dataset-after-model-selection-on-k-fold-cross-validation-how-to

Editada: Mirko Job el 26 de Jun. de 2019

Hi everyone,

I am working on artificial neural networks for application in Movement Analysis. I started using Neural Networks this year and following courses and post on ANSWER and Matlab community i tried to implement a K-fold CV procedure to develop a model for movement classification.

SOME CONSIDERATION: My Dataset is composed by 19 Subjects repeating a movement pattern for 20 times. This movement pattern is composed by 5 sequential phases which are divided in 100 ms observations from 6 sensors: in order to divide the data in 3 indipendent TRAINING, VALIDATION AND TEST SETS i have to include all observation from a subject inside a specific group.

I implemented the overall procedure which i will include at the end of this post. But now i have 2 question arising in my head:

1- Looking at the useful examples from prof. Greg Heath i saw that the R^2 is often used as performance measure to evaluate model. Beside i also read that it is typically recommended for regression problem. Is it possible to use it also in classification ?

2- After i get the results from my 10x10 iteration over weight and hidden neurons different model, should i get the collected information to train the 'optimal' model found on all the entire dataset ? Or should i simply take the best model found even if i don't consider a N°val+N*tst samples ? I ask this becouse i already tried to train a found optimal model an all my data, but off course if i don't specify a validation set the early stop does not work and i fall in the overfitting.

Thanks in advance for every possible help.

Mirko

%% MEMORY CLEAN
clear all; close all; clc
%% LOAD DATASET
datasetfolder='DATA/CLASSIFIER/Table_Classifier';
load(fullfile(cd,datasetfolder,'Table_Classifier.mat'));% ------------------- Load Dataset
x=table2array(DataSET(:,1:end-1))';% ---------------------------------------- Input [IxN°obs.] 252x42563
tc=table2array(DataSET(:,end));% -------------------------------------------- Label Cell Array [1xN°obs.]
targets=[strcmp(tc,'Phase1'),...% ------------------------------------------- Targets [OxN°osserv.] 5x42563
    strcmp(tc,'Phase2'),... 
    strcmp(tc,'Phase3'),...
    strcmp(tc,'Phase4'),...
    strcmp(tc,'Phase5')]';
%% DIMENSIONALITY OF THE DATASET
[I N]=size(x);
[O ~]=size(targets);
%% DEFINITION OF FOLDS FOR XVALIDATION
% In my case each fold should include all observation from all exercise from a specific subject, DIVISOR is a
% label that indicate the specific subject of an observation.
Sbj=unique(DIVISOR);
loop=0;
% Choose of the type of validation
while loop==0
    flag=input(['What validation model you would like to implement?\n',...
        '   1 - 5 folds\n   2 - 10 folds\n   3 - LOSOCV\n\n']);
    switch flag
        case 1
            folds = 6;
            loop = 1;
        case 2
            folds = 11;
            loop = 1;
        case 3
            folds = length(SBJ);
            loop = 1;
        otherwise
            loop = 0;
    end
end

Basing on the number of loop defined above, i created a cell array 'subgroup' (1,folds) containing the subjects label randomized in fold different groups, it is important to note that if i choose to implement 5-fold X Validation Subgroup will have 5+1 element (one element will be considered as test-set)

Subgroup {1}: Sbj1, Sbj7, Sbj5
Subgroup {2}: Sbj2, Sbj4
Subgroup {3}: Sbj3, Sbj6

At this point i implemented starting from the double loop approach by prof. Greg Heath an expanded approach that:

each element of the Subgroup (i.e. folds) is considered as Test Set
the remaining element are used for k-fold cross validation
a validation loop is iterated for 10 random initialization of initial weights and 10 possible model of hidden neurons

%% IDENTIFICATION OF THE AVERAGE NTRN 
% Changing different folds for test and validation implicitly change the number of training samples
% to calculate N° of hidden neurons, so i evaluate an average N° of training samples among all possible selections.
Ntr_av=0;%------------------------------------------------------------------- Average N°trn
for t=1:folds%--------------------------------------------------------------- For each test choice
    logicalindext=cellfun(@(x)contains(DIVISOR,x),...
    subgroup{t},'un',0); 
    for v=1:folds%----------------------------------------------------------- For each validation choice
        if t~=v
            logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
            TrainSET=find(~any([any(...%------------------------------------- Train indixes
                horzcat(logicalindext{:}),2),any(...
                horzcat(logicalindexv{:}),2)],2)==1);           
            Ntr_av=Ntr_av+length(TrainSET);
        end
    end
end
Ntr_av=Ntr_av/((folds-1)*folds);%-------------------------------------------- Average N°trn
Hmin=10;%-------------------------------------------------------------------- Minimum Hidden nodes number
Hub_av=(Ntr_av*O-O)/(I+O+1);%------------------------------------------------ Upper limit for N° Hidden neuron                                                                                                                                      
Hmax_av = round(Hub_av/10);%------------------------------------------------- Max N° hidden neurons (<<<Hub_av for robust training)
dn=floor((Hmax_av-Hmin)/9);%------------------------------------------------- Step dn 
Neurons=(0:9).*dn+Hmin;%----------------------------------------------------- I define 10 possible models of hidde layer differentiatig for dn
                                                                            % Hidden neurons
MSE00 = mean(var(targets',1));%---------------------------------------------- Naive Constant model reference on all dataset
%% NEURAL NETWORK MODEL
for t=1:folds%--------------------------------------------------------------- For each fold t
    logicalindext=cellfun(@(x)contains(DIVISOR,x),...%----------------------- I define the current fold as TEST SET finding all the indixes corresponding
        subgroup{t},'un',0);                                                % to the label in subgroup{t}                                     
    ITST=find(any(horzcat(logicalindext{:}),2)==1);
    MSE00tst = mean(var(targets(:,ITST)',1));%------------------------------- Naive Constant model reference on the Test SET
   
    IVAL=cell(1,folds-1);%--------------------------------------------------- Declaration of folds-1 couple of possible training 
    ITRN=cell(1,folds-1);%--------------------------------------------------- and validation indixes and respective MSE00
    MSE00val=zeros(1,folds-1);
    MSE00trn=zeros(1,folds-1);
    count=1;
     
    for v=1:folds%----------------------------------------------------------- For each fold
        if t~=v%------------------------------------------------------------- different from Test SET t
            logicalindexv=cellfun(@(x)contains(DIVISOR,x),subgroup{v},'un',0);
            IVAL{1,count}=find(any(...%-------------------------------------- I identify the indixes of validation and training
                horzcat(logicalindexv{:}),2)==1);
            ITRN{1,count}=find(~any([any(...
                horzcat(logicalindext{:}),2),any(...
                horzcat(logicalindexv{:}),2)],2)==1);
            MSE00val(1,count)=mean(var(targets(:,ITRN{1,count})',1));%------- And i calculate the MSE00 references
            MSE00trn(1,count)=mean(var(targets(:,IVAL{1,count})',1));
            count=count+1;
        end
    end
    
    S=cell(1,10);%----------------------------------------------------------- Across each validation loop i have to use the same initial weight
    rng(0);%----------------------------------------------------------------- Default random state
    for s=1:10
        S{s}=rng;%----------------------------------------------------------- I save 10 different random states to be resettled across 10
        rand;                                                               % different validation loop (initial weight iteration)
    end
    rng(0);%----------------------------------------------------------------- Default random state
    
    % Performance measures 
    perf_xentrval=zeros(10,10);
    perf_xentrtrn=zeros(10,10);
    perf_xentrtst=zeros(10,10);
    perf_mseval=zeros(10,10);
    perf_msetrn=zeros(10,10);
    perf_msetst=zeros(10,10);
    perf_R2=zeros(10,10);
    perf_R2trn=zeros(10,10);
    perf_R2tst=zeros(10,10);
    perf_R2val=zeros(10,10);
     
    for n=1:10%-------------------------------------------------------------- For each model of hidden neurons
        H=Neurons(n);%------------------------------------------------------- I use the model defined previously
        parfor i=1:10%------------------------------------------------------- For each iteration of initial random weight
            fprintf(['Validation for Model with: ',num2str(H),' neurons and randomization ',num2str(i),'\n']);
           
            tic
            [val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]=ValidationLoops...
                (S{i},MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
            toc

The function validationLoops has been created to overcome parfor problem and errors in multiprocessing comands:

function [val_xentrval,val_xentrtrn,val_xentrtst,val_mseval,val_msetrn,val_msetst,val_R2,val_R2trn,val_R2val,val_R2tst]...
    =ValidationLoops(S,MSE00,MSE00trn,MSE00tst,MSE00val,folds,x,targets,H,ITRN,IVAL,ITST)
% Validation performance Variables
val_xentrval = zeros(1,folds-1);                     
val_xentrtrn = zeros(1,folds-1);                    
val_xentrtst = zeros(1,folds-1);                    
val_mseval = zeros(1,folds-1);                      
val_msetrn = zeros(1,folds-1);                      
val_msetst = zeros(1,folds-1);
val_R2 = zeros(1,folds-1);
val_R2trn = zeros(1,folds-1);
val_R2val = zeros(1,folds-1);
val_R2tst = zeros(1,folds-1);
for v=1:folds-1%---------------------------------------------- For each validation fold
    net=patternnet(H,'trainlm');%----------------------------- Define the net
    net.performFcn = 'mse';%---------------------------------- Loss function
    net.divideFcn='divideind';%------------------------------- Setting TRAINING TEST AND VALIDATION
    net.divideParam.trainInd=ITRN{v};                          %  TrainingSET
    net.divideParam.valInd=IVAL{v};                            %  ValidationSET
    net.divideParam.testInd=ITST;                              %  TestSET
    rng(S);                                                  % Reset initial weight, across validation loops i evaluate the SAME MODEL in terms
                                                             % of Neurons and Initial Weighy
    net=configure(net,x,targets);
    [net,tr,y,e]=train(net,x,targets);
    % Save Performance variables
    val_xentrval(v) = crossentropy(net,targets(:,IVAL{v}),...%------- Crossentropy
        y(:,IVAL{v}));
    val_xentrtrn(v) = crossentropy(net,targets(:,ITRN{v}),...
        y(:,ITRN{v}));
    val_xentrtst(v) = crossentropy(net,targets(:,ITST),...
        y(:,ITST));
    val_mseval(v) = tr.best_vperf;%---------------------------------- MSE
    val_msetrn(v) = tr.best_perf;
    val_msetst(v) = tr.best_tperf;
    val_R2(v) = 1 - mse(e)/MSE00;%----------------------------------- R2
    val_R2trn(v) = 1 - tr.best_perf/MSE00trn(v);
    val_R2val(v) = 1 - tr.best_vperf/MSE00val(v);
    val_R2tst(v) = 1 - tr.best_tperf/MSE00tst;
end

After the validation i save the results of model with N neurons and I random iteration of initial weights as a mean of results obtained in validation loops.

            perf_xentrval(n,i)=...                       
                mean(val_xentrval);
            perf_xentrtrn(n,i)=...
                mean(val_xentrtrn);
            perf_xentrtst(n,i)=...
                mean(val_xentrtst);
            perf_mseval(n,i)=...
                mean(val_mseval);
            perf_msetrn(n,i)=...
                mean(val_msetrn);
            perf_msetst(n,i)=...
                mean(val_msetst);
            perf_R2(n,i)=...
                mean(val_R2);
            perf_R2trn(n,i)=...
                mean(val_R2trn);
            perf_R2val(n,i)=...
                mean(val_R2val);
            perf_R2tst(n,i)=...
                mean(val_R2tst);
        end
    end
        
    % This process is repeated for each choice of different Test Set
    eval(['T',num2str(t),'Test_model.data.xentrval=perf_xentrval']);
    eval(['T',num2str(t),'Test_model.data.xentrtrn=perf_xentrtrn']);
    eval(['T',num2str(t),'Test_model.data.xentrtst=perf_xentrtst']);
    eval(['T',num2str(t),'Test_model.data.mseval=perf_mseval']);
    eval(['T',num2str(t),'Test_model.data.msetrn=perf_msetrn']);
    eval(['T',num2str(t),'Test_model.data.msetst=perf_msetst']);
    eval(['T',num2str(t),'Test_model.data.R2=perf_R2']);
    eval(['T',num2str(t),'Test_model.data.R2val=perf_R2val']);
    eval(['T',num2str(t),'Test_model.data.R2trn=perf_R2trn']);
    eval(['T',num2str(t),'Test_model.data.R2tst=perf_R2tst']);
    eval(['T',num2str(t),'Test_model.HiddenNeurons=Neurons']);
    eval(['T',num2str(t),'Test_model.SET.Sbj=subgroup{t};']);
    eval(['T',num2str(t),'Test_model.SET.Ind=ITST;']);
end
delete(gcp('nocreate'))

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Training a Neural Net on the entire dataset after model selection on K-fold Cross Validation: How to overcome overfitting if i don't have a validation and test set?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Training a Neural Net on the entire dataset after model selection on K-fold Cross Validation: How to overcome overfitting if i don't have a validation and test set?

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos