How to estimate Y using a linear regression and an uncertain X value.

38 visualizaciones (últimos 30 días)
Samuel
Samuel el 10 de Nov. de 2023
Comentada: Star Strider el 18 de Nov. de 2023
I have a training dataset that I'd like to perform regression upon so that I can use the resulting regression to estimate values of Y for other samples for which I only have X. This is standard, using polyfit and polyval, however I have uncertainties in my new X value, and polyval doesn't seem to be able to handle uncertainties in X.
More detail: In my training set I have 100 paired observations of X and Y (in my case, these are observations of Aluminium content and Calcium content of 100 rock samples). These samples don't have any uncertainty, for ease. A linear regression with uncertainty is simple enough:
%Synthetic data for example
X_data = 1:0.25:25;
Y_data = -0.3*X_data + 2*randn(1,100);
%Create linear regression on data
[p, S] = polyfit(X_data, Y_data, 1);
If I now wanted to estimate the Y value in a new sample, if I had it's X value, I would use polyval. However, my new X value (Xnew) has uncertainty, such that Xnew has a mean of 12 and an stddev of 3. I cannot see any way to incorporate the Xnew uncertainty into polyval.
How can I evaluate Y whilst accounting for both the uncertainty in the linear regression and the uncertainty in my new X value?

Respuestas (2)

Star Strider
Star Strider el 10 de Nov. de 2023
If I understand corectly what yoiu want to do, the best approach might be to use fitlm and then use the appropriate predict function (I beleive that is the correct function, however MATLAB will automatically choose the correct one, so no worries) since it will incorporate the uncertainties in the parameters and produce a ‘Y’ value and an associated confidence interval for each value of ‘X’ that you provide. (There are ways to calculate the parameter uncertainties for polyfit — I wrote one and need to post the update — as well as a way of calculating the prediction confidence intervals, one of which I also wrote and does not need to be updated.) Those will work, however fitlm and its predict function are easier in this instance.
As for accounting for the uncertainties in your new ‘X’ value, I would just use a range of each of the ‘X’ values (either the ±95% coinfidence interval extremes as well as the mean, or just the extremes, or something else that makes sense in the context of your calculations). Use those as your arguments to predict.
  3 comentarios
Samuel
Samuel el 17 de Nov. de 2023
I decided to write a code to do what I needed. I might be missing a trick however. I'll put my thought process and then the code below:
  1. Polyval can only run with deterministic values of X. So, randomly sample from uncertain X value I want to evaluate (a normal distribution with X_mu and X_sigma). Run polyval on all the X values.
  2. For each X value evaluated with Polyval, polyval returns a Y_mean and Y_sigma. I randomly sample some number of counts from each of these Y distributions.
  3. I fit a normal distribution to the combined set of all random samples. If in step 1 I randomly sample from X 200 times, and then in step 2, for each 200 distributions of Y, sample 4000 times, the set of samples I am fitting a distribution to in step 3 has 200*4000 samples, capturing both the variability that X can take and the variability of each value of Y.
This returns a mean and 1sigma uncertainty for Y that accounts for both the uncertainty in X and in the linear regression.
My function is below.
function[mu_final, sigma_final] = polyval_wxerr2(p, S, xcal_mu, xcal_sigma, numreps, numsamples)
% p = 1st output of polyfit (linear regression)
% S = 2nd output of polyfit (linear regression)
% xcal_mu = Mean of X to calibrate
% xcal_sigma = Sigma of X to Calibrate
% numreps = How many times to sample from X distribution (good value is
% 500-1000)
% numsamples = How many samples from each Y distribution to contribute to
% final (good value is 10,000)
%% 1. Run Polyval numreps times, each time using a sampled X value
%Randomly sample from the x dist
x_i = normrnd(xcal_mu, xcal_sigma, numreps, 1);
%Return the y distributions for each x value
[y_i, delta_i] = polyval(p,x_i, S);
%% 2. Take Random Samples (#Samples = numsamples) from Each Y distribution (#Y distributions = numreps)
pd_randsamps = zeros(numreps, numsamples);
for i = 1:numreps
pd_randsamps(i,:) = y_i(i) + delta_i(i).*randn(1,numsamples);
end
%% 3. Fit a distribution to all the resulting samples (Final # samples = numreps*numsamples)
pd_final = fitdist(pd_randsamps(:), 'Normal');
mu_final = pd_final.mu;
sigma_final = pd_final.sigma;
end
Star Strider
Star Strider el 18 de Nov. de 2023
That seems similar to the bootstrp approach. If you have the Statistics and Machine Learning Toolbox, you might consider it. Otherwise, polyfit can return a covariance matrix as one of its ‘S’ outputs, although I am not certain how that fits with what you want to do.

Iniciar sesión para comentar.


the cyclist
the cyclist el 11 de Nov. de 2023
This is not exactly what you have asked for, but you could also look into doing the first regression using an errors-in-variables model, e.g. total least squares, which will take into account uncertainty in both explanatory and response variables.
If your real application has just one explanatory variable, then this reduces to a Deming regression. To my knowledge, there is no native MATLAB implementation of this model, but there is at least one user-contributed File Exchange submission, e.g. this one.

Etiquetas

Productos


Versión

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by