removing outliers

Question

joseph Frank el 26 de Mzo. de 2011

1
Enlazar

Enlace directo a esta pregunta

https://es.mathworks.com/matlabcentral/answers/4040-removing-outliers

Comentada: Anirudh Thatipelli el 24 de Mayo de 2018

Hi,

I have data which is by event for n number of companies (not time series data). Visually, I can see that there are outliers but I don't know which method to use to remove these outliers using matlab. Any help is appreciated

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Richard Willey el 1 de Abr. de 2011

4
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/4040-removing-outliers#answer_6305

Automatically detecting outliers is tricky stuff.

You normally need fairly precise information regarding your data as well as the model that you are fitting to your data.

Here's a relatively simple technique that will work for many types of linear models. The methodology is based on a statistics called "Cook's Distance" that you can extract from regstats.

Cook's Distance for a given data point measures the extent to which a regression model would change if this data point were excluded from the regression. Cook's Distance is sometimes used to suggest whether a given data point might be an outlier.

Here's a simple example illustrating how this works

% Create a vector of X values
X = 1:100;
X = X';
% Create a noise vector
noise = randn(100,1);
% Create a second noise value where sigma is much larger
noise2 = 10*randn(100,1);
% Substitute noise2 for noise1 at obs# (11, 31, 51, 71, 91)
% Many of these points will have an undue influence on the model 
noise(11:20:91) = noise2(11:20:91);
% Specify Y = F(X)
Y = 3*X + 2 + noise;
% Cook's Distance for a given data point measures the extent to 
% which a regression model would change if this data point 
% were excluded from the regression. Cook's Distance is 
% sometimes used to suggest whether a given data point might be an outlier.
% Use regstats to calculate Cook's Distance
stats = regstats(Y,X,'linear');
% if Cook's Distance > n/4 is a typical treshold that is used to suggest
% the presence of an outlier
potential_outlier = stats.cookd > 4/length(X);
% Display the index of potential outliers and graph the results
X(potential_outlier)
scatter(X,Y, 'b.')
hold on
scatter(X(potential_outlier),Y(potential_outlier), 'r.')

2 comentarios
Mostrar NingunoOcultar Ninguno

Mark Shore el 1 de Abr. de 2011

Looks interesting but unfortunately requires the Statistics Toolbox.

Anirudh Thatipelli el 24 de Mayo de 2018

Thanks for referring to Cook's distance @Richard Wiley. It has been a great help for me in removing outliers.

Iniciar sesión para comentar.

Answer 2

Matt Fig el 26 de Mzo. de 2011

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/4040-removing-outliers#answer_5755

What form is the data? You might be able to use logical indexing. For example:

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10);  % Take those values less than 10
x = x(x>0);  % Take those values greater than zero.

.

You could also do this in one shot, as below.

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10 & x>0)

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

joseph Frank el 27 de Mzo. de 2011

the data is in % stock returns. it will be difficult to set a subjective cut off point. I am wondering if there is another way t determine what is outlier and what is not

Iniciar sesión para comentar.

Answer 3

Walter Roberson el 27 de Mzo. de 2011

0
Enlazar

Enlace directo a esta respuesta

https://es.mathworks.com/matlabcentral/answers/4040-removing-outliers#answer_5784

"outlier" is mathematically a matter of interpretation.

What is the outlier in this data?

1 2 3 1 2 3

Answer: 2, because the underlying process is believed to create 2 only 1 time in 1000 compared to 1 or 3, so for 2 to show up twice is unusual for this data.

But if you only had the data, how would you know that?

Thus, in order for a program to determine what is an "outlier" or not, you need to encode a model about what is "typical" data and what is not.

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

joseph Frank el 1 de Abr. de 2011

I have used 3 standard deviations away from the mean to remove outliers and I still have some.I have no clue how to compute the 1st derivative. If you have any instructions I will follow them to compute the 1st derivative

Walter Roberson el 1 de Abr. de 2011

Sometimes it is more effective to compute deviations with a "leave one out" method: if this point was not already part of the dataset, how many deviations away from the mean would it be of the (smaller) dataset?

Three standard deviations is 99.7%; possibly for your purposes, a looser test such as 2.5 standard deviations is warranted.

Iniciar sesión para comentar.

removing outliers

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (3)

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

removing outliers

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (3)

2 comentarios Mostrar NingunoOcultar Ninguno

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

5 comentarios Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos