What is a Regression Model? Example: Honolulu tide

Transcription

What is a Regression Model? Example: Honolulu tide
What is a Regression Model?
Example: Honolulu tide
Statistical model for:
I
depending on
Y (random variable) ←− x (non-random variable)
Aim: understanding the effect of x on the random quantity Y
General formulation:
Y |x ∼ Distribution(g(x))
Statistical Problem: Estimate (learn) g(·) from data
{(xi , yi )}ni=1 . Use for:
I
Inference
I
Description
I
Prediction
I
Data compression (parsimonious representations)
I
...
Example: Honolulu tide
Great Variety of Models
Remember general model:
Y |x ∼ Distribution(g(x))
x can be:
I
continuous, discrete, categorical, vector . . .
Distribution can be:
I
Gaussian (Normal), Laplace, General exponential family . . .
Function g(·) can be:
I
g(x) = β0 + β1 x, g(x) =
PK
k=−K
βk e−ikx , Cubic spline . . .
Fundamental Case: Normal Linear Regression
I
Example: Professor’s Van
Y, x ∈ R, g(x) = β0 + β1 x, Distribution= Gaussian
Y |x ∼ N (β0 + β1 x, σ 2 )
m
Y = β0 + β1 x + ,
I
∼ N (0, σ 2 )
Also, x could be vector (Y, β0 ∈ R, x ∈ Rp , β ∈ Rp ):
Y |x ∼ N (β0 + β > x, σ 2 )
m
Y = β0 + β > x + ,
Example: Professor’s Van
∼ N (0, σ 2 )
The tools of the trade...
Start from Normal linear model −→ gradually generalise...
Important features of Normal linear model:
I
Gaussian distribution
I
Linearity
These two mix well together and give geometric insights to
solve the estimation problem. Thus we need to revise some
linear algebra and probability...
Will mostly concentrate on the Gaussian assumption, but
gradually consider:
I
linear Gaussian regression
I
nonlinear Gaussian regression
I
nonparametric Gaussian regression
Reminder: Subspaces, Spectra, Projections.
Recall that is Q is an n × p real matrix, we define its column
space (or range) of Q to be the set spanned by its columns,
M(Q) = {y ∈ Rn : ∃β ∈ Rp , y = Qβ}.
I
Recall that M(Q) is a subspace of Rp .
I
The columns of Q provide a coordinate system for the
subspace M(Q)
I
If Q is of full column rank (p), then the coordinates β
corresponding to a y ∈ M(Q) are unique.
I
Allows interpretation of system of linear equations
Qβ = y.
[existence of solution ↔ is y an element of M(Q)?]
[uniqueness of solution ↔ is there a unique coordinate vector β?]
Reminder: Subspaces, Spectra, Projections.
(The Spectral Theorem)
A p × p matrix Q is symmetric if and only if there exists a p × p
orthogonal matrix U and a diagonal matrix Λ such that
>
Q = U ΛU .
In particular,
1. The columns of U = (u1 . . . up ) are eigenvectors of Q, i.e.
there exist λj such that
Quj = λj uj ,
j = 1, ..., p.
2. The entries of Λ are the corresponding eigenvalues of Q,
Λ = diag{λ1 , ..., λp } which are real.
3. The rank of Q is the number of non-zero eigenvalues.
Note: If the eigenvalues are distinct, the eigenvectors are unique
(up to re-ordering).
Reminder: Subspaces, Spectra, Projections.
Recall two further important subspaces associated with a real
n × p matrix Q:
I
I
The nullspace (or kernel), ker(Q), of Q is the subspace
defined as
ker(Q) = {y ∈ Rp : Qy = 0}.
The orthogonal complement of M(Q), M⊥ (Q), is the
subspace defined as
M⊥ (Q) = {y ∈ Rn : y > Qx = 0, ∀x ∈ Rp }
= {y ∈ Rn : y > v = 0, ∀v ∈ M(Q)}.
Obviously, the orthogonal complement is defined for arbitrary
subspaces (see second equality).
Reminder: Subspaces, Spectra, Projections.
(Singular Value Decomposition)
Any n × p real matrix can be factorised as
Q = U
n×p
Σ V
n×n n×p p×p
>
,
where U and V > are orthogonal with columns called left
singular vectors and right singular vectors, respectively, and Σ
is diagonal with real entries called singular values.
1. The left singular vectors are eigenvectors of QQ> .
2. The right singular vectors are eigenvectors of Q> Q.
3. The squares of the singular values are eigenvalues of both
QQ> and Q> Q.
4. The left singular vectors corresponding to non-zero singular
values form an orthonormal basis for M(Q).
5. The left singular vectors corresponding to zero singular
values form an orthonormal basis for M⊥ (Q).
Reminder: Subspaces, Spectra, Projections.
Recall that a matrix Q is called idempotent if
Q2
Reminder: Subspaces, Spectra, Projections.
= Q.
An orthogonal projection (henceforth projection) onto a
subspace V is a symmetric idempotent matrix H such that
M(H) = V.
Proposition
Proposition
Proof.
The only possible eigenvalues of a projection matrix are 0 and 1.
(I − H)> = I − H > = I − H since H is symmetric. Furthemore,
(I − H)2 = I 2 − 2H + H 2 = I − H since H is idempotent. Thus
I − H is a projection matrix.
Proposition
If P and Q are projection matrices onto a subspace V, then
P = Q.
Proposition
If x1 , ..., xp are linearly independent and are such that
span(x1 , ..., xp ) = V, then the projection onto V can be
represented as
H = X(X > X)−1 X >
Let V be a subspace and H be a projection onto V. Then I − H
is the projection matrix onto V⊥ .
It remains to identify the column space of I − H. Let
H = U ΛU > be the spectral decomposition of H. Then
I − H = U U > − U ΛU > = U (I − Λ)U >
Hence the column space of I − H is spanned by the eigenvectors
of H corresponding to zero eigenvalues of H, and this coincides
with M⊥ (H) = V⊥ .
where X is a matrix with columns x1 , ..., xp .
Reminder: Subspaces, Spectra, Projections.
Reminder: Subspaces, Spectra, Projections.
(proof continued).
Notice that for v ∈ V,
Proposition
Let V be a subspace of
Rn
and H be a projection onto V. Then
kx − Hxk ≤ kx − vk,
∀v ∈ V.
Proof. Let H = U ΛU > be the spectral decomposition of H,
U = (u1 ...un ) and Λ = diag{λ1 , ..., λn } Letting p = dim(V),
kx − Hxk2 =
=
i=1
n
X
i=1
=
n
X
i=1
1. λ1 = . . . = λp = 1 and λp+1 = . . . = λn = 0
2. u1 , ..., un is an orthonormal basis of Rn
n
X
(x> ui − (Hx)> ui )2
(x> ui − x> Hui )2
[H is symmetric]
(x> ui − λi x> ui )2
[u’s are eigenvectors of H]
= 0+
3. u1 , ..., up is an an orthonormal basis of V.
≤
[orthonormal basis]
n
X
(x> ui )2
i=p+1
n
X
>
[eigenvalues 0 or 1]
p
X
(x ui ) +
(x> ui − v > ui )2
i=p+1
2
= kx − vk .
2
i=1
Positive-Definite Matrices
A p × p real symmetric matrix Ω is called non-negative definite
(denoted as Ω 0) if and only if x> Ωx ≥ 0 for all x ∈ Rp . If
x> Ωx > 0 for all x ∈ Rp \ {0}, then we call Ω positive definite
(denoted as Ω 0).
An equivalent definition is as follows:
A p × p real symmetric matrix Ω is called non-negative definite
(denoted as Ω 0) if and only the eigenvalues of Ω are
non-negative. If the eigenvalues of Ω are strictly positive, then
Ω is called positive definite (denoted as Ω 0).
Lemma (Exercise)
Covariance Matrices
Let Y = (Y1 , ..., Yn )> be a random n × 1 vector such that
EkY k2 < ∞. The covariance matrix of Y , say Ω, is the n × n
symmetric matrix with entries
Ωij = Cov(Yi , Yj ) = E[(Yi − E[Yi ])(Yj − E[Yj ])],
1 ≤ i ≤ j ≤ n.
That is, the covariance matrix encodes the variances of the
coordinates of Y (on the diagonal) and the covariances between
the coordinates of Y (off the diagonal). If we write
µ = E[Y ] = (E[Y1 ], ..., E[Yn ])>
for the mean vector of Y , then the covariance matrix of Y can
be written as
E[(Y − µ)(Y − µ)> ] = E[Y Y > ] − µµ> .
Prove that the two definitions are equivalent.
Whenever Y is a random vector, we will write Cov(Y ) for the
covariance matrix of Y .
Covariance Matrices
Non-negative Matrices ≡ Covariance Matrices
Lemma
Let Y be a random d × 1 vector such that EkY k2 < ∞. Let µ be
the mean vector and Ω be the covariance matrix of Y . If A is a
p × d real matrix, the mean vector and covariance matrix of AY
are given by Aµ and AΩA> , respectively.
Proposition (Non-Negative and Covariance Matrices)
Exercise.
Let Ω be a real symmetric matrix. Then Ω is non-negative
definite if and only if Ω is the covariance matrix of some
random variable Y .
Corollary (Covariance of Projections)
Proof.
Let Y be a random d × 1 vector such that EkY k2 < ∞. Let
β, γ ∈ Rd be fixed vectors. If Ω denotes the covariance matrix of
Y,
Exercise.
Proof.
I
The variance of β > Y is given by β > Ωβ;
I
The covariance of β > Y with γ > Y is given by γ > Ωβ.
Gaussian Vectors and Affine Transformations
Definition: Multivariate Gaussian Distribution
A random vector Y in Rd has the multivariate normal
distribution if and only if β > Y has the univariate normal
distribution ∀ β ∈ Rd .
(recall the Cram´er-Wold device which says that the distribution of a
random vector is completely determined by the distribution of all its
one-dimensional projections).
How can we used this definition to determine basic properties?
Recall that the moment generating function (MGF) of a random vector W
in Rd is defined as
MW (θ) = E[eθ
>
W
θ ∈ Rd ,
],
provided the expectation exists. When the MGF exists it characterises the
distribution of the random vector. Furthermore, two random vectors are
Gaussian Vectors and Affine Transformations
Useful facts:
1. Moment generating function of Y ∼ N (µ, Ω):
MY (u) = exp µ> u + 12 u> Ωu
2. Y ∼ N (µp×1 , Ωp×p ) and given Bn×p and θn×1 , then
BY + θ ∼ N (θ + Bµ, BΩB > )
3. N (µ, Ω) density, assuming Ω nonsingular:
1
exp − 12 (y − µ)> Ω−1 (y − µ)
fY (y) =
p/2
1/2
(2π)
|Ω|
4. Constant density isosurfaces are ellipsoidal
5. Marginals of Gaussian are Gaussian (converse NOT true)
6. Ω diagonal ⇔ Independent coordinates
7. If Y ∼ N (µp×1 , Ωp×p ),
AY independent of BY ⇐⇒ AΩB > = 0.
independent if and only if their joint MGF is the product of their marginal
MGF’s.
Proposition (Property 1: Moment Generating Function)
Proposition (Property 2: Affine Transformation)
The moment generating function of Y ∼ N (µ, Ω) is
MY (u) = exp µ> u + 21 u> Ωu
For Y ∼ N (µp×1 , Ωp×p ) and given Bn×p and θn×1 , we have
Proof.
Proof.
Let θ ∈ Rd be an arbitrary vector. Then θ> Y is Gaussian with mean θ> µ
and variance θ> Ωθ. Hence it has moment generating function:
h > i
t2 >
tθ Y
>
= exp t(θ µ) + (θ Ωθ) .
Mθ> Y (t) = E e
2
Now take t = 1 and observe that
h
Mθ> Y (1) = E eθ
>
Y
i
BY + θ ∼ N (θ + Bµ, BΩB > )
MBY +θ (u)
=
=
=
= MY (θ)
Combining the two, we conclude that
1 >
>
MY (θ) = exp µ θ + θ Ωθ
2
=
=
h
i
n
o h
i
E exp{u> (BY + θ)} = exp u> θ E exp{(B > u)> Y }
n
o
exp u> θ MY (B > u)
n
o
1
exp u> θ exp (B > u)> µ + u> BΩB > u
2
1
exp u> θ + u> (Bµ) + u> BΩB > u
2
1
exp u> (θ + Bµ) + u> BΩB > u
2
And this last expression is the MGF of a N (θ + Bµ, BΩB > ) distribution.
Proposition (Property 3: Density Function)
Let Ωp×p be nonsingular. The density of N (µp×1 , Ωp×p ) is
1
fY (y) =
exp − 12 (y − µ)> Ω−1 (y − µ)
p/2
1/2
(2π)
|Ω|
Proof.
Let Z = (Z1 , ..., Zp )> be a vector of iid N (0, 1) random variables. Then,
because of independence,
proof continued.
(a)+(b)
=⇒ the N (0, I) density is fZ (z) =
Now observe that from our Property 2, we have Y = Ω1/2 Z + µ ∼ N (µ, Ω).
By the change of variables formula,
fY (y)
Y
Y 1
1
1
1 >
√ exp − zi2 =
fZ (z) =
fZi (zi ) =
exp
−
z
z
2
2
2π
(2π)p/2
i=1
i=1
p
(b) The MGF of Z is given by
"
!#
p
p
X
Y
MZ (u) = E exp
ui Zi
=
E[exp(ui Zi )] = exp{u> u/2}
i=1
i=1
exp − 12 z > z .
By the spectral theorem, Ω admits a square root, Ω1/2 . Furthermore, since
Ω is non-singular, so is Ω1/2 .
(a) The density of Z is given by
p
1
(2π)p/2
=
fΩ1/2 Z+µ (y)
=
|Ω−1/2 |fZ (Ω−1/2 (y − µ))
1
1
> −1
(y
−
µ)
Ω
(y
−
µ)
exp
−
2
(2π)p/2 |Ω|1/2
=
[recall that to obtain the density of W = g(X) at w, we need to evaluate fX
at g −1 (w) but also multiply by the Jacobian determinant of g −1 at w]
which is the MGF of a p-variate N (0, I) distribution.
Proposition (Property 4: Isosurfaces)
The isosurfaces of a N (µp×1 , Ωp×p ) are (p − 1)-dimensional
ellipsoids centred at µ, with principle axes given by the
eigenvectors of Ω and with anisotropies given by the ratios of
the square roots of the corresponding eigenvalues of Ω .
Proposition (Property 6: Diagonal Ω ⇐⇒ Independence)
Let Y = (Y1 , ...Yp )> ∼ N (µp×1 , Ωp×p ). Then the Yi are mutually
independent if and only if Ω is diagonal.
Proof.
Suppose that the Yj are independent. Property 5 yields Yj ∼ N (µj , σj2 ) for
some σj > 0. Thus the density of Y is
Proof.
Exercise: Use Property 3, and the spectral theorem.
fY (y) =
p
Y
fYj (yj ) =
j=1
Proposition (Property 5: Coordinate Distributions)
Let Y = (Y1 , ...Yp
)>
∼ N (µp×1 , Ωp×p ). Then Yj ∼ N (µj , Ωjj ) .
Proof.
Observe that Yj = (0 , 0 , ...,
jth
1
, ..., 0 , 0)Y and use Property 2.
|{z}
position
=
1
(2π)p/2 |diag{σ12 , ..., σp2 }|1/2
p
Y
i=1
1
1 (yj − µj )2
√ exp −
2
σj2
σj 2π
1
exp − (y − µ)> diag{σ1−2 , ..., σp−2 }(y − µ) .
2
Hence Y ∼ N (µ, diag{σ12 , ..., σp2 }), i.e. the covariance Ω is diagonal.
Conversely, assume Ω is diagonal, say Ω = diag{σ12 , ..., σp2 }. Then we can
reverse the steps of the first part to see that the joint density fY (y) can be
written as a product of the marginal densities fYj (yj ), thus proving
independence.
Proposition (Property 7: AY, BY indep ⇐⇒ AΩB > = 0)
If Y ∼ N (µp×1 , Ωp×p ), and Am×p , Bd×p be real matrices. Then,
AY independent of BY ⇐⇒ AΩB > = 0.
proof continued.
For the converse, assume that AY and BY are independent. Then, ∀u, v,
MW (θ) = MBY (u)MAY (v)
Proof.
It suffices to prove the result assuming µ = 0. First assume AΩB > = 0. Let W(m+d)×1 = BY
and θ(m+d)×1 =
AY
MW (θ)
=
=
=
=
.
h
n
oi
E[exp{W > θ}] = E exp Y > B > u + Y > A> v
h
n
oi
E exp Y > (B > u + A> v) = MY (B > u + A> v)
1 >
exp
(B u + A> v)> Ω(B > u + A> v)
2
 



1 >
>
>
> 
u BΩB > u + v > AΩA> v + u> BΩA
v
+
v
AΩB
u
exp
| {z }
| {z } 
2
=0
=
um×1
vd×1
=⇒ exp
1
u> BΩB > u + v > AΩA> v + u> BΩA> v + v > AΩB > u
=
2
1 >
1 >
= exp
u BΩB > u exp
v AΩA> v
2
2
1
=⇒ exp
× 2v > AΩB > u = 1
2
=⇒ v > AΩB > u = 0
=⇒ AΩB
>
= 0.
∀ u, v
=0
MBY (u)MAY (v)
i.e. the joint MGF is the product of the marginal MGFs, proving
independence.
Gaussian Quadratic Forms and the χ2 Distribution
Definition (χ2 distribution)
Let Z ∼ N (0, Ip×p ). Then kZk2 is said to have the χ2
distribution with p degrees of freedom.
[i.e. χ2p is the distribution of the sum of squares of p real independent
standard Gaussian r.v.’s]
Definition (F distribution)
χ2p
χ2q
Let V ∼
and W ∼
be independent random variables. Then
(V /p)/(W/q) is said to have the F distribution with p and q degrees of
freedom.
Gaussian Quadratic Forms and the χ2 Distribution
Proposition (Gaussian Quadratic Forms)
1. If Z ∼ N (0p×1 , Ip×p ) and H is a projection of rank r ≤ p,
Z > HZ ∼ χ2r .
2. Y ∼ N (µp×1 , Ωp×p ) with Ω nonsingular =⇒
(Y − µ)> Ω−1 (Y − µ) ∼ χ2p .
Exercise: Prove (1) and (2).