What is a Regression Model? Example: Honolulu tide
Transcription
What is a Regression Model? Example: Honolulu tide
What is a Regression Model? Example: Honolulu tide Statistical model for: I depending on Y (random variable) ←− x (non-random variable) Aim: understanding the effect of x on the random quantity Y General formulation: Y |x ∼ Distribution(g(x)) Statistical Problem: Estimate (learn) g(·) from data {(xi , yi )}ni=1 . Use for: I Inference I Description I Prediction I Data compression (parsimonious representations) I ... Example: Honolulu tide Great Variety of Models Remember general model: Y |x ∼ Distribution(g(x)) x can be: I continuous, discrete, categorical, vector . . . Distribution can be: I Gaussian (Normal), Laplace, General exponential family . . . Function g(·) can be: I g(x) = β0 + β1 x, g(x) = PK k=−K βk e−ikx , Cubic spline . . . Fundamental Case: Normal Linear Regression I Example: Professor’s Van Y, x ∈ R, g(x) = β0 + β1 x, Distribution= Gaussian Y |x ∼ N (β0 + β1 x, σ 2 ) m Y = β0 + β1 x + , I ∼ N (0, σ 2 ) Also, x could be vector (Y, β0 ∈ R, x ∈ Rp , β ∈ Rp ): Y |x ∼ N (β0 + β > x, σ 2 ) m Y = β0 + β > x + , Example: Professor’s Van ∼ N (0, σ 2 ) The tools of the trade... Start from Normal linear model −→ gradually generalise... Important features of Normal linear model: I Gaussian distribution I Linearity These two mix well together and give geometric insights to solve the estimation problem. Thus we need to revise some linear algebra and probability... Will mostly concentrate on the Gaussian assumption, but gradually consider: I linear Gaussian regression I nonlinear Gaussian regression I nonparametric Gaussian regression Reminder: Subspaces, Spectra, Projections. Recall that is Q is an n × p real matrix, we define its column space (or range) of Q to be the set spanned by its columns, M(Q) = {y ∈ Rn : ∃β ∈ Rp , y = Qβ}. I Recall that M(Q) is a subspace of Rp . I The columns of Q provide a coordinate system for the subspace M(Q) I If Q is of full column rank (p), then the coordinates β corresponding to a y ∈ M(Q) are unique. I Allows interpretation of system of linear equations Qβ = y. [existence of solution ↔ is y an element of M(Q)?] [uniqueness of solution ↔ is there a unique coordinate vector β?] Reminder: Subspaces, Spectra, Projections. (The Spectral Theorem) A p × p matrix Q is symmetric if and only if there exists a p × p orthogonal matrix U and a diagonal matrix Λ such that > Q = U ΛU . In particular, 1. The columns of U = (u1 . . . up ) are eigenvectors of Q, i.e. there exist λj such that Quj = λj uj , j = 1, ..., p. 2. The entries of Λ are the corresponding eigenvalues of Q, Λ = diag{λ1 , ..., λp } which are real. 3. The rank of Q is the number of non-zero eigenvalues. Note: If the eigenvalues are distinct, the eigenvectors are unique (up to re-ordering). Reminder: Subspaces, Spectra, Projections. Recall two further important subspaces associated with a real n × p matrix Q: I I The nullspace (or kernel), ker(Q), of Q is the subspace defined as ker(Q) = {y ∈ Rp : Qy = 0}. The orthogonal complement of M(Q), M⊥ (Q), is the subspace defined as M⊥ (Q) = {y ∈ Rn : y > Qx = 0, ∀x ∈ Rp } = {y ∈ Rn : y > v = 0, ∀v ∈ M(Q)}. Obviously, the orthogonal complement is defined for arbitrary subspaces (see second equality). Reminder: Subspaces, Spectra, Projections. (Singular Value Decomposition) Any n × p real matrix can be factorised as Q = U n×p Σ V n×n n×p p×p > , where U and V > are orthogonal with columns called left singular vectors and right singular vectors, respectively, and Σ is diagonal with real entries called singular values. 1. The left singular vectors are eigenvectors of QQ> . 2. The right singular vectors are eigenvectors of Q> Q. 3. The squares of the singular values are eigenvalues of both QQ> and Q> Q. 4. The left singular vectors corresponding to non-zero singular values form an orthonormal basis for M(Q). 5. The left singular vectors corresponding to zero singular values form an orthonormal basis for M⊥ (Q). Reminder: Subspaces, Spectra, Projections. Recall that a matrix Q is called idempotent if Q2 Reminder: Subspaces, Spectra, Projections. = Q. An orthogonal projection (henceforth projection) onto a subspace V is a symmetric idempotent matrix H such that M(H) = V. Proposition Proposition Proof. The only possible eigenvalues of a projection matrix are 0 and 1. (I − H)> = I − H > = I − H since H is symmetric. Furthemore, (I − H)2 = I 2 − 2H + H 2 = I − H since H is idempotent. Thus I − H is a projection matrix. Proposition If P and Q are projection matrices onto a subspace V, then P = Q. Proposition If x1 , ..., xp are linearly independent and are such that span(x1 , ..., xp ) = V, then the projection onto V can be represented as H = X(X > X)−1 X > Let V be a subspace and H be a projection onto V. Then I − H is the projection matrix onto V⊥ . It remains to identify the column space of I − H. Let H = U ΛU > be the spectral decomposition of H. Then I − H = U U > − U ΛU > = U (I − Λ)U > Hence the column space of I − H is spanned by the eigenvectors of H corresponding to zero eigenvalues of H, and this coincides with M⊥ (H) = V⊥ . where X is a matrix with columns x1 , ..., xp . Reminder: Subspaces, Spectra, Projections. Reminder: Subspaces, Spectra, Projections. (proof continued). Notice that for v ∈ V, Proposition Let V be a subspace of Rn and H be a projection onto V. Then kx − Hxk ≤ kx − vk, ∀v ∈ V. Proof. Let H = U ΛU > be the spectral decomposition of H, U = (u1 ...un ) and Λ = diag{λ1 , ..., λn } Letting p = dim(V), kx − Hxk2 = = i=1 n X i=1 = n X i=1 1. λ1 = . . . = λp = 1 and λp+1 = . . . = λn = 0 2. u1 , ..., un is an orthonormal basis of Rn n X (x> ui − (Hx)> ui )2 (x> ui − x> Hui )2 [H is symmetric] (x> ui − λi x> ui )2 [u’s are eigenvectors of H] = 0+ 3. u1 , ..., up is an an orthonormal basis of V. ≤ [orthonormal basis] n X (x> ui )2 i=p+1 n X > [eigenvalues 0 or 1] p X (x ui ) + (x> ui − v > ui )2 i=p+1 2 = kx − vk . 2 i=1 Positive-Definite Matrices A p × p real symmetric matrix Ω is called non-negative definite (denoted as Ω 0) if and only if x> Ωx ≥ 0 for all x ∈ Rp . If x> Ωx > 0 for all x ∈ Rp \ {0}, then we call Ω positive definite (denoted as Ω 0). An equivalent definition is as follows: A p × p real symmetric matrix Ω is called non-negative definite (denoted as Ω 0) if and only the eigenvalues of Ω are non-negative. If the eigenvalues of Ω are strictly positive, then Ω is called positive definite (denoted as Ω 0). Lemma (Exercise) Covariance Matrices Let Y = (Y1 , ..., Yn )> be a random n × 1 vector such that EkY k2 < ∞. The covariance matrix of Y , say Ω, is the n × n symmetric matrix with entries Ωij = Cov(Yi , Yj ) = E[(Yi − E[Yi ])(Yj − E[Yj ])], 1 ≤ i ≤ j ≤ n. That is, the covariance matrix encodes the variances of the coordinates of Y (on the diagonal) and the covariances between the coordinates of Y (off the diagonal). If we write µ = E[Y ] = (E[Y1 ], ..., E[Yn ])> for the mean vector of Y , then the covariance matrix of Y can be written as E[(Y − µ)(Y − µ)> ] = E[Y Y > ] − µµ> . Prove that the two definitions are equivalent. Whenever Y is a random vector, we will write Cov(Y ) for the covariance matrix of Y . Covariance Matrices Non-negative Matrices ≡ Covariance Matrices Lemma Let Y be a random d × 1 vector such that EkY k2 < ∞. Let µ be the mean vector and Ω be the covariance matrix of Y . If A is a p × d real matrix, the mean vector and covariance matrix of AY are given by Aµ and AΩA> , respectively. Proposition (Non-Negative and Covariance Matrices) Exercise. Let Ω be a real symmetric matrix. Then Ω is non-negative definite if and only if Ω is the covariance matrix of some random variable Y . Corollary (Covariance of Projections) Proof. Let Y be a random d × 1 vector such that EkY k2 < ∞. Let β, γ ∈ Rd be fixed vectors. If Ω denotes the covariance matrix of Y, Exercise. Proof. I The variance of β > Y is given by β > Ωβ; I The covariance of β > Y with γ > Y is given by γ > Ωβ. Gaussian Vectors and Affine Transformations Definition: Multivariate Gaussian Distribution A random vector Y in Rd has the multivariate normal distribution if and only if β > Y has the univariate normal distribution ∀ β ∈ Rd . (recall the Cram´er-Wold device which says that the distribution of a random vector is completely determined by the distribution of all its one-dimensional projections). How can we used this definition to determine basic properties? Recall that the moment generating function (MGF) of a random vector W in Rd is defined as MW (θ) = E[eθ > W θ ∈ Rd , ], provided the expectation exists. When the MGF exists it characterises the distribution of the random vector. Furthermore, two random vectors are Gaussian Vectors and Affine Transformations Useful facts: 1. Moment generating function of Y ∼ N (µ, Ω): MY (u) = exp µ> u + 12 u> Ωu 2. Y ∼ N (µp×1 , Ωp×p ) and given Bn×p and θn×1 , then BY + θ ∼ N (θ + Bµ, BΩB > ) 3. N (µ, Ω) density, assuming Ω nonsingular: 1 exp − 12 (y − µ)> Ω−1 (y − µ) fY (y) = p/2 1/2 (2π) |Ω| 4. Constant density isosurfaces are ellipsoidal 5. Marginals of Gaussian are Gaussian (converse NOT true) 6. Ω diagonal ⇔ Independent coordinates 7. If Y ∼ N (µp×1 , Ωp×p ), AY independent of BY ⇐⇒ AΩB > = 0. independent if and only if their joint MGF is the product of their marginal MGF’s. Proposition (Property 1: Moment Generating Function) Proposition (Property 2: Affine Transformation) The moment generating function of Y ∼ N (µ, Ω) is MY (u) = exp µ> u + 21 u> Ωu For Y ∼ N (µp×1 , Ωp×p ) and given Bn×p and θn×1 , we have Proof. Proof. Let θ ∈ Rd be an arbitrary vector. Then θ> Y is Gaussian with mean θ> µ and variance θ> Ωθ. Hence it has moment generating function: h > i t2 > tθ Y > = exp t(θ µ) + (θ Ωθ) . Mθ> Y (t) = E e 2 Now take t = 1 and observe that h Mθ> Y (1) = E eθ > Y i BY + θ ∼ N (θ + Bµ, BΩB > ) MBY +θ (u) = = = = MY (θ) Combining the two, we conclude that 1 > > MY (θ) = exp µ θ + θ Ωθ 2 = = h i n o h i E exp{u> (BY + θ)} = exp u> θ E exp{(B > u)> Y } n o exp u> θ MY (B > u) n o 1 exp u> θ exp (B > u)> µ + u> BΩB > u 2 1 exp u> θ + u> (Bµ) + u> BΩB > u 2 1 exp u> (θ + Bµ) + u> BΩB > u 2 And this last expression is the MGF of a N (θ + Bµ, BΩB > ) distribution. Proposition (Property 3: Density Function) Let Ωp×p be nonsingular. The density of N (µp×1 , Ωp×p ) is 1 fY (y) = exp − 12 (y − µ)> Ω−1 (y − µ) p/2 1/2 (2π) |Ω| Proof. Let Z = (Z1 , ..., Zp )> be a vector of iid N (0, 1) random variables. Then, because of independence, proof continued. (a)+(b) =⇒ the N (0, I) density is fZ (z) = Now observe that from our Property 2, we have Y = Ω1/2 Z + µ ∼ N (µ, Ω). By the change of variables formula, fY (y) Y Y 1 1 1 1 > √ exp − zi2 = fZ (z) = fZi (zi ) = exp − z z 2 2 2π (2π)p/2 i=1 i=1 p (b) The MGF of Z is given by " !# p p X Y MZ (u) = E exp ui Zi = E[exp(ui Zi )] = exp{u> u/2} i=1 i=1 exp − 12 z > z . By the spectral theorem, Ω admits a square root, Ω1/2 . Furthermore, since Ω is non-singular, so is Ω1/2 . (a) The density of Z is given by p 1 (2π)p/2 = fΩ1/2 Z+µ (y) = |Ω−1/2 |fZ (Ω−1/2 (y − µ)) 1 1 > −1 (y − µ) Ω (y − µ) exp − 2 (2π)p/2 |Ω|1/2 = [recall that to obtain the density of W = g(X) at w, we need to evaluate fX at g −1 (w) but also multiply by the Jacobian determinant of g −1 at w] which is the MGF of a p-variate N (0, I) distribution. Proposition (Property 4: Isosurfaces) The isosurfaces of a N (µp×1 , Ωp×p ) are (p − 1)-dimensional ellipsoids centred at µ, with principle axes given by the eigenvectors of Ω and with anisotropies given by the ratios of the square roots of the corresponding eigenvalues of Ω . Proposition (Property 6: Diagonal Ω ⇐⇒ Independence) Let Y = (Y1 , ...Yp )> ∼ N (µp×1 , Ωp×p ). Then the Yi are mutually independent if and only if Ω is diagonal. Proof. Suppose that the Yj are independent. Property 5 yields Yj ∼ N (µj , σj2 ) for some σj > 0. Thus the density of Y is Proof. Exercise: Use Property 3, and the spectral theorem. fY (y) = p Y fYj (yj ) = j=1 Proposition (Property 5: Coordinate Distributions) Let Y = (Y1 , ...Yp )> ∼ N (µp×1 , Ωp×p ). Then Yj ∼ N (µj , Ωjj ) . Proof. Observe that Yj = (0 , 0 , ..., jth 1 , ..., 0 , 0)Y and use Property 2. |{z} position = 1 (2π)p/2 |diag{σ12 , ..., σp2 }|1/2 p Y i=1 1 1 (yj − µj )2 √ exp − 2 σj2 σj 2π 1 exp − (y − µ)> diag{σ1−2 , ..., σp−2 }(y − µ) . 2 Hence Y ∼ N (µ, diag{σ12 , ..., σp2 }), i.e. the covariance Ω is diagonal. Conversely, assume Ω is diagonal, say Ω = diag{σ12 , ..., σp2 }. Then we can reverse the steps of the first part to see that the joint density fY (y) can be written as a product of the marginal densities fYj (yj ), thus proving independence. Proposition (Property 7: AY, BY indep ⇐⇒ AΩB > = 0) If Y ∼ N (µp×1 , Ωp×p ), and Am×p , Bd×p be real matrices. Then, AY independent of BY ⇐⇒ AΩB > = 0. proof continued. For the converse, assume that AY and BY are independent. Then, ∀u, v, MW (θ) = MBY (u)MAY (v) Proof. It suffices to prove the result assuming µ = 0. First assume AΩB > = 0. Let W(m+d)×1 = BY and θ(m+d)×1 = AY MW (θ) = = = = . h n oi E[exp{W > θ}] = E exp Y > B > u + Y > A> v h n oi E exp Y > (B > u + A> v) = MY (B > u + A> v) 1 > exp (B u + A> v)> Ω(B > u + A> v) 2 1 > > > > u BΩB > u + v > AΩA> v + u> BΩA v + v AΩB u exp | {z } | {z } 2 =0 = um×1 vd×1 =⇒ exp 1 u> BΩB > u + v > AΩA> v + u> BΩA> v + v > AΩB > u = 2 1 > 1 > = exp u BΩB > u exp v AΩA> v 2 2 1 =⇒ exp × 2v > AΩB > u = 1 2 =⇒ v > AΩB > u = 0 =⇒ AΩB > = 0. ∀ u, v =0 MBY (u)MAY (v) i.e. the joint MGF is the product of the marginal MGFs, proving independence. Gaussian Quadratic Forms and the χ2 Distribution Definition (χ2 distribution) Let Z ∼ N (0, Ip×p ). Then kZk2 is said to have the χ2 distribution with p degrees of freedom. [i.e. χ2p is the distribution of the sum of squares of p real independent standard Gaussian r.v.’s] Definition (F distribution) χ2p χ2q Let V ∼ and W ∼ be independent random variables. Then (V /p)/(W/q) is said to have the F distribution with p and q degrees of freedom. Gaussian Quadratic Forms and the χ2 Distribution Proposition (Gaussian Quadratic Forms) 1. If Z ∼ N (0p×1 , Ip×p ) and H is a projection of rank r ≤ p, Z > HZ ∼ χ2r . 2. Y ∼ N (µp×1 , Ωp×p ) with Ω nonsingular =⇒ (Y − µ)> Ω−1 (Y − µ) ∼ χ2p . Exercise: Prove (1) and (2).