rmoutliers and graph of the cleaned data
1 visualización (últimos 30 días)
Mostrar comentarios más antiguos
Hi everybody , I am having a problem with my data analysis since after the running of rmoutliers function I don't get the right graph.
I am working data of pressures (at the input and output of a pipe) and flowrate.
Find attached a txt file compressed
It contains a table of 5 columns
column 1: pressure at pipe start [bars]
column 2: acoustic noise at start [standard deviation]
column 3: pressure at pipe end (10km from start) [bars]
column 4: acoustic noise at pipe end[standard deviation]
column 5: flow rate [m3/hour]
sample distance: 10s
This is the code I wrote.
readtable("flodata.txt");
load('flowdata.mat');
summary(flowdata)
% Vectors of Pressures
pressureIN = flowdata.pressureAtPipeStart_bars_;
pressureOUT= flowdata.pressureAtPipeEnd_10kmFromStart__bars_;
% Vectors of Acoustic Noises at IN and at OUT
aNoiseIN=flowdata.acousticNoiseAtStart_standardDeviation_;
aNoiseOUT=flowdata.acousticNoiseAtPipeEnd_standardDeviation_;
% Vector of flowrate
flowrate = flowdata.flowRate_m3_hour_;
% Table of flowRate
flowRate = table(flowrate);
rawData = [pressureIN aNoiseIN pressureOUT aNoiseOUT flowrate];
% t = (linspace(0, length(cleanData)*10, length(cleanData)))/86400;
sample_distance = 10;
%t = (linspace(0, 7948790, 794879))/86400;
t = (linspace(0, length(rawData)*sample_distance, length(rawData)))/86400;
% FLOWRATE vs TIME (nessuna pulizia ancora fatta)
figure('Name','raw FLOWRATE vs TIME (1)')
stairs(t, flowrate, '-b');
axis tight;
title('Uncleaned Flowrate over time');
legend('Flowrate');
xlabel('Tempo [Days]');
ylabel('Flowrate [m3/hour]');
grid on;
% Istogramma portata
figure('Name','raw FLOWRATE HISTOGRAM (1)')
histogram(flowrate)
%axis tight;
title('Uncleaned Flowrate Histogram');
legend('counts');
xlabel('Flowrate [m3/hour]');
ylabel('counts');
grid on;
% PRESSURE HEAD LOSS (APPROX _no fattore di attrito, flusso del fluido e
% densità del fluido)
PHeadLoss = pressureIN - pressureOUT;
figure('Name','raw PRESSURE HEAD LOSS vs TIME (2)')
plot(t, PHeadLoss, '-k');
title('Uncleaned Pressure Head Loss over time');
legend('Pressure Head Loss');
xlabel('Tempo [Days]');
ylabel('Pressure Head Loss [bar]');
grid on;
% Istogramma perdita di carico
figure('Name','raw PRESSURE HEAD LOSS HISTOGRAM (1)')
histogram(PHeadLoss)
%axis tight;
title('Uncleaned Pressure Head Loss Histogram');
legend('counts');
xlabel('Pressure Head Loss [bar]');
ylabel('counts');
grid on;
% INPUT PRESSURE vs TIME
figure('Name','raw INPUT PRESSURE vs TIME (3)')
plot(t, pressureIN, '-g');
axis tight;
title('Input Pressure over time');
legend('Input Pressure');
xlabel('Tempo [Days]');
ylabel('Input Pressure [bar]');
grid on;
% Istogramma pressione d'ingresso
figure('Name','raw INPUT PRESSURE HISTOGRAM (1)')
histogram(pressureIN)
%axis tight;
title('Uncleaned Input Pressure Histogram');
legend('counts');
xlabel('Input Pressure [bar]');
ylabel('counts');
grid on;
% OUTPUT PRESSURE vs TIME
figure('Name','raw OUTPUT PRESSURE vs TIME (4)')
plot(t, pressureOUT, '-r');
axis tight;
title('Output Pressure over time');
legend('Output Pressure');
xlabel('Tempo [Days]');
ylabel('Output Pressure [bar]');
grid on;
% Istogramma pressione d'uscita
figure('Name','raw OUTPUT PRESSURE HISTOGRAM (1)')
histogram(pressureOUT)
%axis tight;
title('Uncleaned Output Pressure Histogram');
legend('counts');
xlabel('Output Pressure [bar]');
ylabel('counts');
grid on;
% INPUT PRESSURE vs FLOWRATE
figure('Name','raw INPUT PRESSURE vs raw FLOWRATE (5)')
plot(flowrate, pressureIN, 'g.' );
axis tight;
title('Input Pressure over Flowrate');
legend('Input Pressure vs Flowrate');
xlabel('Flowrate [m3/hour]');
ylabel('Input Pressure [bar]');
grid on;
% OUTPUT PRESSURE vs FLOWRATE
figure('Name','raw OUTPUT PRESSURE vs raw FLOWRATE (6)')
plot(flowrate, pressureOUT, 'r.' );
axis tight;
title('Output Pressure over Flowrate');
legend('Output Pressure vs Flowrate');
xlabel('Flowrate [m3/hour]');
ylabel('Output Pressure [bar]');
grid on;
% INPUT & OUTPUT PRESSURE vs FLOWRATE
figure('Name','raw INPUT & OUTPUT PRESSURE vs raw FLOWRATE (7)')
plot(flowrate, pressureIN, 'g.' );
axis tight;
grid on;
hold on;
plot(flowrate, pressureOUT, 'r.' );
axis tight;
title('Input & Output Pressure over Flowrate');
legend('Input Pressure vs Flowrate','Output Pressure vs Flowrate');
xlabel('Flowrate[m3/hour]');
ylabel('Input & Output Pressure [bar]');
grid on;
% OUTPUT PRESSURE vs INPUT PRESSURE
figure('Name',' raw OUTPUT PRESSURE vs raw INPUT PRESSURE (8)')
plot(pressureIN, pressureOUT, 'k.' );
axis tight;
title('Output Pressure over Input Pressure');
legend('Output Pressure vs Input Pressure');
xlabel('Input Pressure [bar]');
ylabel('Output Pressure [bar]');
grid on;
% FIRST CLEANING
% ELIMINO TUTTE LE RIGHE CHE CONTENGONO UN NAN o UN NUMERO NEGATIVO
% isnan is a function that determines which array elements are NaN.
% isnan returns a logical array containing 1 (true) where the elements of A
% are NaN, and 0 (false) where they are not.
% any is a function that determines if any array elements are nonzero
% B = any(A,dim) tests elements along dimension dim. The dim input is a positive integer scalar.
% Creo un array di indici logici che indica quali righe contengono almeno un NaN
idx_nan = any(isnan(rawData), 2);
% Creo un array di indici logici che indica quali righe hanno almeno un valore negativo
idx_neg = any(rawData < 0, 2);
% Combino gli array di indici logici
idx = idx_nan | idx_neg;
cleanData = rawData;
% Elimino le righe selezionate
cleanData(idx, :) = [];
%dati che ho perso con la pulzia eliminando i valori di pressioni negative
% e i NaN >>>>>= 794879-709847=85032
% 85032/794879 = 0.1069 circa il 10.7 % dei dati totali.
lostData = size(rawData, 1) - size(cleanData, 1);
lostDataRatio = lostData/size(rawData, 1)
percentageLostData = lostDataRatio*100
pressureINc = cleanData(:,1);
pressureOUTc = cleanData(:,3);
aNoiseINc = cleanData(:, 2);
aNoiseOUTc = cleanData(:, 4);
flowrateC = cleanData(:,5);
% sample distance: 10s
% tempo in giorni
t = (linspace(0, length(cleanData)*10, length(cleanData)))/86400;
% FLOWRATE vs TIME
figure('Name','FLOWRATEc vs TIME (9)')
plot(t, flowrateC, '-b');
axis tight;
title('cleaned Flowrate over time');
legend('counts');
xlabel('Flowrate [m3/hour]');
ylabel('counts');
grid on;
% Istogramma portata
figure('Name','cleaned FLOWRATE HISTOGRAM (1)')
histogram(flowrateC)
%axis tight;
title('cleaned Flowrate Histogram');
legend('counts');
xlabel('Flowrate [m3/hour]');
ylabel('counts');
grid on;
% PRESSURE HEAD LOSS (APPROX _no fattore di attrito, flusso del fluido e
% densità del fluido)
PHeadLossc = pressureINc - pressureOUTc;
figure('Name',' PRESSURE HEAD LOSS 1c vs TIME (10)')
plot(t, PHeadLossc, '-k');
title('Pressure Head Loss over time');
legend('Pressure Head Loss');
xlabel('Tempo [Days]');
ylabel('Pressure Head Loss [bar]');
grid on;
% ISTOGRAMMA PRESSIONE DI CARICO
figure('Name','cleaned PRESSURE HEAD LOSS HISTOGRAM (1)')
histogram(PHeadLossc)
%axis tight;
title('cleaned Pressure Head Loss Histogram');
legend('counts');
xlabel('Pressure Head Loss [bar]');
ylabel('counts');
grid on;
% INPUT PRESSURE vs TIME
figure('Name','INPUT PRESSURE 1c vs TIME (11)')
plot(t, pressureINc, '-g');
axis tight;
title('Input Pressure over time');
legend('Input Pressure');
xlabel('Tempo [Days]');
ylabel('Input Pressure [bar]');
grid on;
% ISTOGRAMMA PRESSIONE D'INGRESSO
figure('Name','cleaned INPUT PRESSURE HISTOGRAM (1)')
histogram(pressureINc)
%axis tight;
title('cleaned Input Pressure Histogram');
legend('counts');
xlabel('Input Pressure [bar]');
ylabel('counts');
grid on;
% OUTPUT PRESSURE vs TIME
figure('Name',' OUTPUT PRESSURE 1c vs TIME (12)')
plot(t, pressureOUTc, '-r');
axis tight;
title('Output Pressure over time');
legend('Output Pressure');
xlabel('Tempo [Days]');
ylabel('Output Pressure [bar]');
grid on;
% Istogramma pressione d'uscita
figure('Name','cleaned OUTPUT PRESSURE HISTOGRAM (1)')
histogram(pressureOUTc)
%axis tight;
title('cleaned Output Pressure Histogram');
legend('counts');
xlabel('Output Pressure [bar]');
ylabel('counts');
grid on;
% INPUT PRESSURE vs FLOWRATE
figure('Name',' INPUT PRESSURE 1c vs FLOWRATE 1c (13)')
plot(flowrateC, pressureINc, 'g.' );
axis tight;
title('Input Pressure over Flowrate');
legend('Input Pressure vs Flowrate');
xlabel('Flowrate [m3/hour]');
ylabel('Input Pressure [bar]');
grid on;
% OUTPUT PRESSURE vs FLOWRATE
figure('Name',' OUTPUT PRESSURE 1c vs FLOWRATE 1c (14)')
plot(flowrateC, pressureOUTc, 'r.' );
axis tight;
title('Output Pressure over Flowrate');
legend('Output Pressure vs Flowrate');
xlabel('Flowrate [m3/hour]');
ylabel('Output Pressure [bar]');
grid on;
% INPUT & OUTPUT PRESSURE vs FLOWRATE
figure('Name','INPUT & OUTPUT PRESSURE 1c vs FLOWRATE 1c (15)')
plot(flowrateC, pressureINc, 'g.' );
axis tight;
grid on;
hold on;
plot(flowrateC, pressureOUTc, 'r.' );
axis tight;
title('Input & Output Pressure over Flowrate');
legend('Input Pressure vs Flowrate','Output Pressure vs Flowrate');
xlabel('Flowrate[m3/hour]');
ylabel('Input & Output Pressure [bar]');
grid on;
% OUTPUT PRESSURE vs INPUT PRESSURE
figure('Name','OUTPUT PRESSURE 1c vs INPUT PRESSURE 1c (16)')
plot(pressureINc, pressureOUTc, 'k.' );
axis tight;
title('Output Pressure over Input Pressure');
legend('Output Pressure vs Input Pressure');
xlabel('Input Pressure [bar]');
ylabel('Output Pressure [bar]');
grid on;
During the second cleaning I use quartiles as method.
The following code is the code of the second cleaning ... but... I should get the following image
but I get this graph:
As you can see, it seems I have removed too much data. But I used the same IQR OUTLIER REMOVAL, why this happens? Thanks in advance...
%%% SECOND CLEANING by QUARTILES
%
%cleanDatac2 = rmoutliers(cleanData(:,[1 3]), "quartiles"); % QUARTILES
%cleanDatac2 = rmoutliers(cleanData(:,[1 3 5]), "percentile", [24 74]); % QUARTILES
cleanDatac2 = rmoutliers(cleanData(:,[1 3 5])); % IQR
% 709847-576736 = 133111
% 133111/709847 = 0.1875 circa il 18.75 % dei dati totali.
lostDatac2 = size(cleanData, 1) - size(cleanDatac2, 1);
lostDataRatioc2 = lostDatac2/size(cleanData, 1)
percentageLostDatac2 = lostDataRatioc2*100
pressureINc2 = cleanDatac2(:,1);
pressureOUTc2 = cleanDatac2(:,2);
%aNoiseINc2 = cleanDatac2(:, 2);
%aNoiseOUTc2 = cleanDatac2(:, 4);
flowrateC2 = cleanDatac2(:,3);
% tempo in giorni
t = (linspace(0, length(cleanDatac2)*10, length(cleanDatac2)))/86400;
% FLOWRATE vs TIME
figure('Name','FLOWRATEc2 vs TIME (17)')
plot(t, flowrateC2, '-b');
axis tight;
title('2cleaned Flowrate over time');
legend('Flowrate');
xlabel('Tempo [Days]');
ylabel('Flowrate [m3/hour]');
grid on;
% PRESSURE HEAD LOSS (APPROX _no fattore di attrito, flusso del fluido e
% densità del fluido)
PHeadLossc2 = pressureINc2 - pressureOUTc2;
figure('Name',' PRESSURE HEAD LOSS 2c vs TIME (18)')
plot(t, PHeadLossc2, '-k');
title('Pressure Head Loss c2 over time');
legend('Pressure Head Loss');
xlabel('Tempo [Days]');
ylabel('Pressure Head Loss [bar]');
grid on;
% INPUT PRESSURE vs TIME
figure('Name','INPUT PRESSURE 2c vs TIME (19)')
plot(t, pressureINc2, '-g');
axis tight;
title('Input Pressure c2 over time');
legend('Input Pressure');
xlabel('Tempo [Days]');
ylabel('Input Pressure [bar]');
grid on;
% OUTPUT PRESSURE vs TIME
figure('Name',' OUTPUT PRESSURE 2c vs TIME (20)')
plot(t, pressureOUTc2, '-r');
axis tight;
title('Output Pressure c2 over time');
legend('Output Pressure');
xlabel('Tempo [Days]');
ylabel('Output Pressure [bar]');
grid on;
% INPUT PRESSURE vs FLOWRATE
figure('Name',' INPUT PRESSURE 2c vs FLOWRATE 2c (21)')
plot(flowrateC2, pressureINc2, 'g.' );
axis tight;
title('Input Pressure c2 over Flowrate c2');
legend('Input Pressure vs Flowrate');
xlabel('Flowrate [m3/hour]');
ylabel('Input Pressure [bar]');
grid on;
% OUTPUT PRESSURE vs FLOWRATE
figure('Name',' OUTPUT PRESSURE 2c vs FLOWRATE 2c (22)')
plot(flowrateC2, pressureOUTc2, 'r.' );
axis tight;
title('Output Pressure over Flowrate');
legend('Output Pressure vs Flowrate');
xlabel('Flowrate [m3/hour]');
ylabel('Output Pressure [bar]');
grid on;
% INPUT & OUTPUT PRESSURE vs FLOWRATE
figure('Name','INPUT & OUTPUT PRESSURE 2c vs FLOWRATE 2c (23)')
plot(flowrateC2, pressureINc2, 'g.' );
axis tight;
grid on;
hold on;
plot(flowrateC2, pressureOUTc2, 'r.' );
axis tight;
title('Input & Output Pressure c2 over Flowrate c2');
legend('Input Pressure vs Flowrate','Output Pressure vs Flowrate');
xlabel('Flowrate[m3/hour]');
ylabel('Input & Output Pressure [bar]');
grid on;
% OUTPUT PRESSURE vs INPUT PRESSURE
figure('Name','OUTPUT PRESSURE 2c vs INPUT PRESSURE 2c (24)')
plot(pressureINc2, pressureOUTc2, 'k.' );
axis tight;
title('Output Pressure over Input Pressure');
legend('Output Pressure vs Input Pressure');
xlabel('Input Pressure [bar]');
ylabel('Output Pressure [bar]');
% grid on;
6 comentarios
dpb
el 24 de En. de 2024
Well, then you'll have to delve into precise manipulations the other author used in order to replicate those results; clearly they did at least one step differently.
Again, we have no way to know what that might be without the original paper and it's not the purpose of the Answers forum to be a general-purpose no-fee consulting service. If you need to reproduce the dataset, that's part of the task of your thesis research. And, of course, it wouldn't necessarily be the first time a paper has been published with incomplete descriptions of all the steps performed in producing the results presented....
Respuestas (2)
dpb
el 24 de En. de 2024
Movida: dpb
el 24 de En. de 2024
One additional comment -- you used
cleanDatac2 = rmoutliers(cleanData(:,[1 3 5])); % IQR
which treated each response variable globally over the entire data set -- that does not seem at all like what one would want to do to try to find outliers in what appears must be a time series of a dataset that must have had some changes in other variables going on during the time or one would expect a steady-state solution,
Instead, the reference figure you present, while pretty noisy, shows definite bands/patterns with something like five or so more-or-less distinct groupings. However, if you consider all the data as a single composite sample as you did with rmoutliers, it's not terribly surprising you have eliminated a major fraction of the lower-valued response points -- the largest concentration of data are in the top section; hence a great deal of the data on the low side is going to look like it's not part of that distribution of points, being identified as "outliers" on the low side. But, almost all of those are likely not actually outliers but representative of the other flow regimes that existed during the test, clearly for much shorter time periods.
It would seem that if there are outliers during a given flow regime, what they really are is likely the transients recorded between the various flow regimes. Without anything else to go on, it would appear that whatever exclusion process was used in the original paper must have been applied over segments of the full time span reflective of the various periods of given flow regimes.
Your result, while not what you were expecting, does illustrate a similar phenomenon, there are again identifiable regimes of a given flow pattern; you've taken the uppermost "big blob" and broken it down into a number of smaller regimes; that they exist is somewhat visible in the original other than they're confounded by the scale of the plot encompassing the whole range instead of just a very small section.
0 comentarios
Mathieu NOE
el 25 de En. de 2024
hello @Giuseppe Zumbo
for the time being I have not used any fancy outlier removal technique , it's simply field engineer's brain processing
first thing, I overlaid the 3 main data (input / output pressure and flow rate) on the same plot to see where there could be invalid segments of data
first , I am sceptical about what happens during the first 15 days : lots of peaks are seen on the output pressure side that are not reflected / correlated to the input side , so I prefer to discard the first 20 days (yes it's brutal !)
then i also don't like those sudden drops to zero , because to me its like dividing zero by zero , when you want to make a relationship between two variables , I avoid using background noise data , it will blurr the result.
so after ding some manual removing stuff , I finally got this picture that is not too bad looking IMHO. You can now probably make a fit of that if that is your final goal
still someone has to find an explanation for the outliers (red rectangle) - could it be some sensor reliability issue ?
hope it helps !!
code :
flowdata = readtable("flodata.txt");
% column 1: pressure at pipe start [bars]
% column 2: acoustic noise at start [standard deviation]
% column 3: pressure at pipe end (10km from start) [bars]
% column 4: acoustic noise at pipe end[standard deviation]
% column 5: flow rate [m3/hour]
% summary(flowdata)
% Vectors of Pressures
pressureIN = flowdata.pressureAtPipeStart_bars_;
pressureOUT= flowdata.pressureAtPipeEnd_10kmFromStart__bars_;
% Vectors of Acoustic Noises at IN and at OUT
aNoiseIN=flowdata.acousticNoiseAtStart_standardDeviation_;
aNoiseOUT=flowdata.acousticNoiseAtPipeEnd_standardDeviation_;
% Vector of flowrate
flowrate = flowdata.flowRate_m3_hour_;
% % Table of flowRate
% flowRate = table(flowrate);
% rawData = [pressureIN aNoiseIN pressureOUT aNoiseOUT flowrate];
% t = (linspace(0, length(cleanData)*10, length(cleanData)))/86400;
sample_distance = 10;
%t = (linspace(0, 7948790, 794879))/86400;
[samples,channels] = size(flowdata);
t = linspace(0, samples*sample_distance, samples)/86400;
% all main curves overlaid
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
% large variations on the output pressure can be seen at days 1 to 15 , not related to input pression
% we should not use that portion of the data
ind = t>20;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
% keep data (valid) for gradient below a certain threshold (and non zero
% FR too btw)
pressureIN_grad = abs(gradient(pressureIN));
pressureOUT_grad = abs(gradient(pressureOUT));
flowrate_grad = abs(gradient(flowrate));
threshold = 0.001;
ind1 = pressureIN_grad<threshold*max(pressureIN_grad);
ind2 = pressureOUT_grad<threshold*max(pressureOUT_grad);
ind3 = flowrate_grad<threshold*max(flowrate_grad);
ind4 = flowrate> 0;
ind = ind1 & ind2 & ind3 & ind4;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
figure()
plot(t, pressureIN, '.g',t, pressureOUT, '.r',t, flowrate, '.b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
% PRESSURE HEAD LOSS (APPROX _no fattore di attrito, flusso del fluido e
% densità del fluido)
PHeadLoss = pressureIN - pressureOUT;
% consider only positive values
ind = PHeadLoss> 0;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
PHeadLoss = pressureIN - pressureOUT;
figure('Name','FLOW RATE vs PRESSURE HEAD LOSS')
plot(PHeadLoss, flowrate, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('P Head Loss');
ylabel('flow rate');
3 comentarios
Mathieu NOE
el 29 de En. de 2024
hello again
I am not sure to follow you
do you mean that the data we are analysing is the combination of 3 pumps ?
I am not sure how I could distinguish 3 pumps in this plot (for me it was only 1 pump)
beside that I don't have th StatisticsToolbox, so I can't not do much on how to use fitgmdist on your data (even though I have downloaded Fex submissions about clustering - that could be an alternative)
dpb
el 31 de En. de 2024
From above
% column 1: pressure at pipe start [bars]
% column 2: acoustic noise at start [standard deviation]
% column 3: pressure at pipe end (10km from start) [bars]
% column 4: acoustic noise at pipe end[standard deviation]
% column 5: flow rate [m3/hour]
Nothing there identifies anything about a pump ID.
Can't analyze what isn't recorded, sorry, but is a general principle for variables as well as changing levels of a given variable.
Again, there's absolutely nothing anybody here can do about comparing what you have posted to some other unknown reference paper.
Ver también
Categorías
Más información sobre Data Preprocessing en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!