Main Content

Subset of Regressors Approximation for GPR Models

The subset of regressors (SR) approximation method consists of replacing the kernel function k(x,xr|θ) in the exact GPR method by its approximation k^SR(x,xr|θ,A), given the active set AN={1,2,...,n}. You can specify the SR method for parameter estimation by using the 'FitMethod','sr' name-value pair argument in the call to fitrgp. For prediction using SR, you can use the 'PredictMethod','sr' name-value pair argument in the call to fitrgp.

Approximating the Kernel Function

For the exact GPR model, the expected prediction in GPR depends on the set of N functions SN={k(x,xi|θ),i=1,2,,n}, where N={1,2,...,n} is the set of indices of all observations, and n is the total number of observations. The idea is to approximate the span of these functions by a smaller set of functions, SA, where AN={1,2,...,n} is the subset of indices of points selected to be in the active set. Consider SA={k(x,xj|θ),jA}. The aim is to approximate the elements of SN as linear combinations of the elements of SA.

Suppose the approximation to k(x,xr|θ) using the functions in SA is as follows:

k^(x,xr|θ)=jAcjrk(x,xj|θ),

where cjr are the coefficients of the linear combination for approximating k(x,xr|θ). Suppose C is the matrix that contains all the coefficients cjr. Then, C, is a |A|×n matrix such that C(j,r)=cjr. The software finds the best approximation to the elements of SN using the active set AN={1,2,...,n} by minimizing the error function

E(A,C)=r=1nk(x,xr|θ)k^(x,xr|θ)2,

where is the Reproducing Kernel Hilbert Spaces (RKHS) associated with the kernel function k [1], [2].

The coefficient matrix that minimizes E(A,C) is

C^A= K(XA,XA|θ)1K(XA,X|θ),

and an approximation to the kernel function using the elements in the active set AN={1,2,...,n} is

k^(x,xr|θ)=jAcjrk(x,xj|θ)= K(xT,XA|θ)C(:,r).

The SR approximation to the kernel function using the active set AN={1,2,...,n} is defined as:

k^SR(x,xr|θ,A)= K(xT,XA|θ)C^A(:,r)=K(xT,XA|θ)K(XA,XA|θ)1K(XA,xrT|θ)

and the SR approximation to K(X,X|θ) is:

K^SR(X,X|θ,A)=  K(X,XA|θ) K(XA,XA|θ)1 K(XA,X|θ).

Parameter Estimation

Replacing K(X,X|θ) by K^SR(X,X|θ,A) in the marginal log likelihood function produces its SR approximation:

logPSR(y|X,β,θ,σ2,A)=12(yHβ)T[K^SR(X,X|θ,A)+σ2In]1(yHβ)N2log2π12log|K^SR(X,X|θ,A)+σ2In|

As in the exact method, the software estimates the parameters by first computing β^(θ,σ2), the optimal estimate of β, given θ and σ2. Then it estimates θ, and σ2 using the β-profiled marginal log likelihood. The SR estimate to β for given θ, and σ2 is:

β^SR(θ,σ2,A)=[HT[K^SR(X,X|θ,A)+σ2In]1H*]1HT[K^SR(X,X|θ,A)+σ2In]1y**,

where

[K^SR(X,X|θ,A)+σ2In]1=INσ2K(X,XA|θ)σ2AA1K(XA,X|θ)σ2,AA=K(XA,XA|θ)+K(XA,X|θ)K(X,XA|θ)σ2,*=HTHσ2HTK(X,XA|θ)σ2AA1K(XA,X|θ)Hσ2,**=HTyσ2HTK(X,XA|θ)σ2AA1K(XA,X|θ)yσ2.

And the SR approximation to the β-profiled marginal log likelihood is:

logPSR(y|X,β^SR(θ,σ2,A),θ,σ2,A)=12(yHβ^SR(θ,σ2,A))T[K^SR(X,X|θ,A)+σ2In]1(yHβ^SR(θ,σ2,A))N2log2π12log|K^SR(X,X|θ,A)+σ2In|.

Prediction

The SR approximation to the distribution of ynew given y, X, xnew is

P(ynew|y,X,xnew)=N(ynew|h(xnew)Tβ+μSR,σnew2+ΣSR),

where μSR and ΣSR are the SR approximations to μ and Σ shown in prediction using the exact GPR method.

μSR and ΣSR are obtained by replacing k(x,xr|θ) by its SR approximation k^SR(x,xr|θ,A) in μ and Σ, respectively.

That is,

μSR=K^SR(xnewT,X|θ,A)(1)(K^SR(X,X|θ,A)+σ2 IN)1(2)(yHβ).

Since

(1)=K(xnewT,XA|θ) K(XA,XA|θ)1K(XA,X|θ),

(2)=INσ2K(X,XA|θ)σ2[ K(XA,XA|θ)+K(XA,X|θ) K(X,XA|θ)σ2]1K(XA,X|θ)σ2, 

and from the fact that IN B( A+ B)1= A( A+ B)1, μSR can be written as

μSR= K(xnewT,XA|θ)[K(XA,XA|θ)+K(XA,X|θ)K(X,XA|θ)σ2]1K(XA,X|θ)σ2(yHβ).

Similarly, ΣSR is derived as follows:

ΣSR=k^SR(xnew,xnew|θ,A)*K^SR(xnewT,X|θ,A)**(K^SR(X,X|θ,A)+σ2IN)1***K^SR(X,xnewT|θ,A)****.

Because

* = K(xnewT,XA|θ)K(XA,XA|θ)1K(XA, xnewT|θ),

**=K(xnewT,XA|θ)K(XA,XA|θ)1K(XA,X|θ),***=(2) in the equation of μSR,

**** = K(X,XA|θ)K(XA,XA|θ)1K(XA, xnewT|θ),

ΣSR is found as follows:

SR=K(xnewT,XA|θ)[ K(XA,XA|θ)+K(XA,X|θ) K(X,XA|θ))σ2]1K(XA, xnewT|θ).

Predictive Variance Problem

One of the disadvantages of the SR method is that it can give unreasonably small predictive variances when making predictions in a region far away from the chosen active set AN={1,2,...,n}. Consider making a prediction at a new point xnew that is far away from the training set X. In other words, assume that K(xnewT,X|θ)0.

For exact GPR, the posterior distribution of fnew given y, X and xnew would be Normal with mean μ=0 and variance Σ=k(xnew,xnew|θ). This value is correct in the sense that, if xnew is far from X, then the data (X,y) does not supply any new information about fnew and so the posterior distribution of fnew given y, X, and xnew should reduce to the prior distribution fnew given xnew, which is a Normal distribution with mean 0 and variance k(xnew,xnew|θ).

For the SR approximation, if xnew is far away from X (and hence also far away from XA), then μSR=0 and ΣSR=0. Thus in this extreme case, μSR agrees with μ from exact GPR, but ΣSR is unreasonably small compared to Σ from exact GPR.

The fully independent conditional approximation method can help avoid this problem.

References

[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

[2] Smola, A. J. and B. Schökopf. "Sparse greedy matrix approximation for machine learning." In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.

See Also

|

Related Topics