Course compendium, Applied matrix analysis November 14, 2014 Christopher Engstr¨
Transcription
Course compendium, Applied matrix analysis November 14, 2014 Christopher Engstr¨
Course compendium, Applied matrix analysis Christopher Engstr¨ om Karl Lundeng˚ ard Sergei Silvestrov November 14, 2014 Contents 1 Review of basic linear algebra 1.1 Notation and elementary matrix operations 1.2 Linear equation systems . . . . . . . . . . . 1.2.1 Determinants . . . . . . . . . . . . . 1.3 Eigenvalues and eigenvectors . . . . . . . . 1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 . 6 . 8 . 10 . 13 . 14 2 Matrix factorization and canonical forms 2.1 Important types of matrices and some useful properties . 2.1.1 Triangular and Hessenberg matrices . . . . . . . . 2.1.2 Hermitian matrices . . . . . . . . . . . . . . . . . . 2.1.3 Unitary matrices . . . . . . . . . . . . . . . . . . . 2.1.4 Positive definite matrices . . . . . . . . . . . . . . 2.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Spectral factorization . . . . . . . . . . . . . . . . 2.2.2 Rank factorization . . . . . . . . . . . . . . . . . . 2.2.3 LU and Cholesky factorization . . . . . . . . . . . 2.2.4 QR factorization . . . . . . . . . . . . . . . . . . . 2.2.5 Canonical forms . . . . . . . . . . . . . . . . . . . 2.2.6 Reduced row echelon form . . . . . . . . . . . . . . 2.2.7 Jordan normal form . . . . . . . . . . . . . . . . . 2.2.8 Singular value decomposition (SVD) . . . . . . . . 2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Portfolio evaluation using Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 17 17 18 18 19 19 20 20 21 22 22 23 23 24 25 25 3 Non-negative matrices and graphs 3.1 Introduction to graphs . . . . . . . . . 3.2 Connectivity and irreducible matrices 3.3 The Laplacian matrix . . . . . . . . . 3.4 Exercises . . . . . . . . . . . . . . . . 3.5 Applications . . . . . . . . . . . . . . . 3.5.1 Shortest path . . . . . . . . . . 3.5.2 Resistance distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 30 33 35 35 35 36 . . . . 39 39 42 43 43 . . . . . . . . . . . . . . . . . . . . . 4 Perron-Frobenius theory 4.1 Perron-Frobenius for non negative matrices 4.2 Exercises . . . . . . . . . . . . . . . . . . . 4.3 Applications . . . . . . . . . . . . . . . . . . 4.3.1 Leontief input-output model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Stochastic matrices and Perron-Frobenius 46 5.1 Stochastic matrices and Markov chains . . . . . . . . . . . . . . . . . . . . 46 1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Irreducible, primitive stochastic matrices . . . . Irreducible, imprimitive stochastic matrices . . Reducible, stochastic matrices . . . . . . . . . . Hitting times and hitting probabilities . . . . . A short look at continuous time Markov chains Exercises . . . . . . . . . . . . . . . . . . . . . Applications . . . . . . . . . . . . . . . . . . . . 5.8.1 Return to the Leontief model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 50 51 53 55 57 58 58 6 Linear spaces and projections 6.1 Linear spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Inner product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Fourier approximation . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Finding a QR decomposition using the Gram-Schmidt process 6.4.3 Principal component analysis (PCA) and dimension reduction . . . . . . . . . . . . . . . 60 60 66 70 74 74 75 77 7 Linear transformations 7.1 Linear transformations in geometry . . . . . . . . . . . . . . . . . . . . . 7.2 surjection,injection and combined transformations . . . . . . . . . . . . 7.3 Application: Transformations in computer graphics and homogeneous coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Kernel and image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Least Square Method (LSM) 8.1 Finding the coefficients . . . . . . . . . . 8.2 Generalized LSM . . . . . . . . . . . . . 8.3 Weighted Least Squares Method (WLS) 8.4 Pseudoinverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . 82 . 84 . 85 . 88 . 91 . . . . 93 95 96 97 98 9 Matrix functions 100 9.1 Matrix power function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 9.1.1 Calculating An . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 9.2 Matrix root function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.2.1 Root of a diagonal matrix . . . . . . . . . . . . . . . . . . . . . . . 103 9.2.2 Finding the square root of a matrix using the Jordan canonical form104 9.3 Matrix polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.3.1 The Cayley-Hamilton theorem . . . . . . . . . . . . . . . . . . . . 106 9.3.2 Calculating matrix polynomials . . . . . . . . . . . . . . . . . . . . 109 9.3.3 Companion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 110 9.4 Exponential matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.4.1 Solving matrix differential equations . . . . . . . . . . . . . . . . . 114 2 9.5 General matrix functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10 Matrix equations 116 10.1 Tensor products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.2 Solving matrix equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 11 Numerical methods for computing eigenvalues and 11.1 The Power method . . . . . . . . . . . . . . . . 11.1.1 Inverse Power method . . . . . . . . . . 11.2 The QR-method . . . . . . . . . . . . . . . . . 11.3 Applications . . . . . . . . . . . . . . . . . . . . 11.3.1 PageRank . . . . . . . . . . . . . . . . . Index eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 . 121 . 124 . 126 . 129 . 129 133 3 4 Notation v vector (usually represented by a row or column matrix) |v| length (Euclidean norm) of a vector kvk norm of a vector M matrix M > transpose M −1 inverse M ∗ pseudoinverse M complex conjugate M H Hermitian transpose M k the k:th power of matrix M, see section 9.1 mi,j The element on the i:th row and j:th column of M (k) mi,j The element on the i:th row and j:th column of Mk Mi. i:th row of M M.j j:th column of M ⊗ Kroenecker (tensor) product a scalar |a| absolute value of a scalar h·, ·i Z + inner product (scalar product) set of integers Z set of positive integers R set of real numbers C set of complex numbers Mm×n (K) set of matrices with dimension m × n and elements from the set K 5 1 Review of basic linear algebra The reader should already be familiar with the concepts and content in this chapter, instead it serves more as a reminder and overview of what we already know. We will start by giving some notation used throughout the book as well as reminding ourselves of the most basic matrix operations. We will then take a look at linear systems, eigenvalues and eigenvectors and related concepts as seen in elementary courses in linear algebra. While the methods presented here works, they are often very slow or unstable when used on a computer. Much of our later work will relate to how to find better ways to find these or similar quantities or give other examples of where and why they are useful. 1.1 Notation and elementary matrix operations We will here go through the most elementary matrix operations as well as give the notation which will be used throughout the book. We remind ourselves that a matrix is a rectangular array of numbers, symbols or expressions. Some example of matrices can be seen below: 1 0 A= 0 4 1 −2 C = −2 3 3 1 B= 0 1−i 1 0 0 D= 0 3 0 We note that matrices need not be square as in C and D, we also note that although matrices can be complex valued, in most of our examples or practical applications we need only to work with real valued matrices. • We denote every element of A as ai,j , where i is it’s row number and j is it’s column number. For example looking at C we have c3,2 = 1. • The size of a matrix is the number of rows and columns (or the index of the bottom right element): We say that C above is a 3 × 2 matrix. • A matrix with only one row or column is called a row vector or column vector respectively. We usually denote vectors with an arrow v and we assume all vectors are column vectors unless stated otherwise. • The diagonal of a matrix is the elements diagonally from the top left corner towards the bottom left (a1,1 , a2,2 , . . . an,n ). A matrix where all elements except those on the diagonal is zero is called a diagonal matrix . For example A is a diagonal matrix, but so is D although we will usually only consider square diagonal matrices. • The trace of a matrix is the sum of the elements on the diagonal. 6 Matrix addition Add every element in the first matrix with correspondign element in the second matrix. Both matrices need to be of the same size for addition to be defined. 1 0 1 −2 1+1 0−2 2 −2 + = = 0 2 2 −1 0+2 2−1 2 1 Although addition between a scalar and a matrix is undefined, sometimes authors write it as such in which case they usually means to add the scalar to every element of the matrix as if the scalar was a matrix of the same size with every element equal to the constant. The general definition of matrix addition is: Let A = B + C where all three matrices have the number of rows and columns. Then ai,j = bi,j + ci,j for every element in A. Matrix multiplication Given the two matrices A and B we get the product AB as the matrix whose elements ei,j are found by multiplying the elements in row i of A with the elements of column j of B and adding the results. 11 10 7 14 3 2 0 1 3 2 0 1 1 2 3 4 1 2 3 4 5 9 16 24 0 1 2 1 0 1 2 1 = 1 4 8 9 0 1 5 10 0 0 1 3 0 0 1 3 For the colored element we have: 1 · 3 + 2 · 1 + 3 · 0 + 4 · 0 = 5 We note that we need the number of columns in A to be the same as the number of rows in B but A can have any number of rows and B can have any number of columns. Also we have: • Generally AB 6= BA, why? • The size of AB is the number of rows of A times the number of columns of B. • The Identity matrix I is the matrix with ones on the diagonal and zero elsewere. For the Identity matrix we have: AI = IA = A. Multiplying a scalar with a matrix is done by multiplying the scalar with every element of the matrix. 1 0 3·1 3·0 3 0 3 = = 0 2 3·0 3·2 0 6 The general definition of matrix multiplication is: Let A = BC then j X ai,j = bi,k ck,j k=1 for each element in A. 7 Matrix transpose The transpose of a matrix is obtained by flipping the rows and columns such that the elements of the first row is now the elements of the first column. For example given matrix A below: 1 −2 A = −1 0 3 1 We get A transpose, denoted A> as: A> = 1 −1 3 −2 0 1 The general definition of the transpose is: Let B = A> then bi,j = aj,i for all elements in B. For complex valued matrices we usually instead talk about the Hermitian transpose where we first take the transpose and then take the complex conjugate (change sign of imaginary part). We usually denote this by AH (sometimes this is denoted by A∗ but here we will reserve that notation for pseudoinverses, see section 8.4). The general definition of the Hermitian transpose (also known as Hermitian conjugate or conjugate transpose) is: Let B = A> then bi,j = a∗j,i , where ∗ denotes the complex conjugate, for all elements in B. A real matrix where A = A> is called a symmetric matrix and a complex matrix where AH = A we call a hermitian matrix . 1.2 Linear equation systems Very often in applications we end up with a system of linear equations which we want to solve or find an as good solution as possible. Either repeatedly for the same or similar systems or only once. Matrices presents presents powerful tools which we can use to solve these problems, in fact even if we have a nonlinear system, we very often approximate it with a linear system and try to solve that instead. Linear systems will be treated repeatedly throughout the book, but here we take a short look at the problem itself, when we can solve it, and how from what we should be familiar with from courses in linear algebra. Given a linear equation system a1,1 x1 + a1,2 x2 + . . . a2,1 x1 + a2,2 x2 + . . . .. .. .. . . . am,1 x1 + am,2 x2 + . . . 8 + + a1,n xn a2,n xn .. . = = b1 , b2 , .. . + am,n xn = bm , we can represent it using a matrix and column vectors b1 x1 a1,1 a1,2 . . . a1,n a2,1 a2,2 . . . a2,n x2 b2 .. .. .. .. = .. . . . . . . . . bm xm am,1 am,2 . . . am,n Or in a more compact form as Ax = b. Solving this system of equations is the same as finding an inverse matrix A−1 such that A−1 A = I and multiply both expressions from the left with this matrix. The reader should be familiar with using Gaussian elimination to write the system in a easy to solve form, or going a step further and finding the inverse A−1 using Gauss-Jordan elimination. We remember for Gaussian elimination using elementary row operations (adding and multiplying rows) trying to find pivot elements in order to change our matrix to diagonal/row echelon form . F F F F F F F 0 0 0 F F F F F 0 0 0 0 F F F F 0 0 0 0 0 F 0 F F 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 After we have done this we can easily solve the system as long as it is solvable. Not all systems can be solved exactly, for example the system below is obviously unsolvable: 1 2 x1 1 = 1 2 x2 2 For square matrices we can go one step further by using the Gauss-Jordan elimination reducing the elements above the diagonal to zero as well in the same manner. Similarly some systems have more than one solution. For example will any point on the line y = b−1 (c − ax) solve the linear equation system 4 −1 x1 2 = −8 2 x2 −4 We are often interested in if there is a single unique solution to a linear equation system. Theorem 1.1. The following statements are equivalent for A ∈ Mn×n (K): 1. AX = B has a unique solution for all B ∈ Mn×m where X ∈ Mn×m . 2. AX = 0 only has the trivial solution (X = 0). 3. A has linearly independent row/column vectors. 9 4. A is invertible/non-degenerate/non-singular. 5. det(A) 6= 0. 6. A has maximum rank (rank(A) = n). 7. A’s row/column space has dimension n. 8. A’s null space has dimension 0. 9. A’s row/column vectors span K Most of these statements should be familiar already, some will be looked at more in depth later, especially when we look at linear spaces in chapter 6. 1.2.1 Determinants One of the ways to guarantee that a linear system Ax = b has a unique solution is that the determinant is non-zero det(A) 6= 0. We remember the determinant as the product of the eigenvalues. For a small matrix we can easily calculate the determinant by hand, we had the formulas for 2 × 2 and 3 × 3 matrices as: a b c d = ad − bc a b c d e f = aei − af h − bdi + bf g + cdh − ceg g h i And for larger matrices we could write it as a sum of determinants of smaller matrices: a b c d e f = a e f − b d f + c d e h i g i g h g h i And in general we get the determinant function det : Mn×n (K) 7→ K det(A) = det(A.1 , A.2 , . . . , A.k , . . . , A.n ) Some properties of the determinant: • det A> = det A • det AB = det A det B but det(A + B) 6= det(A) + det(B). • det cA = cn det A • det is multilinear 10 • det is alternating • det(I) = 1 With multilinear we mean that it’s linear in each argument det(A.1 , A.2 , . . . , a · A.k + b · B, . . . , A.n ) = = a det(A.1 , A.2 , . . . , A.k , . . . , A.n ) + b det(A.1 , A.2 , . . . B, . . . A.n ) For example if A is a 3 × 3 matrix and B, C are 3 × 1 matrices. Let M = [A.1 A.2 + B A.3 + C] then det(M) = det(A.1 , A.2 + B, A.3 + C) = = det(A.1 , A.2 , A.3 + C) + det(A.1 , B, A.3 + C) = = det(A) + det(A.1 , B, A.3 ) + det(A.1 , A.2 , C) + det(A.1 , B, C) Since the determinant is alternating, switching two columns (or rows) of A switches the sign of det(A) det(B) = det(A.1 , A.2 , . . . , A.k+1 , A.k , . . . , A.n ) = − det(A) But more important is what happens when we have two linear dependent vectors: det(A.1 , A.2 , . . . , B, B, . . . , A.n ) = 0 Then the multilinear and alternating property gives: linear dependence between columns (or rows) in A ⇔ det(A) = 0 Apart from using column or row expansion to calculate the determinant there is an alternative formula for the determinant X aσ1 ,1 aσ2 ,2 . . . aσn ,n sign(σ) (1) det(A) = σ∈Sn σ is a permutation of columns (rows), Sn is the set of all possible permutations, aσi ,i is the element that ai,i is switched with. sign is a function that is equal to 1 if an even number of rows have been switched and -1 if an odd number of rows have been switched. For example for a 2 × 2 matrix A a1,1 a1,2 A= a2,1 a2,2 We get two possible permutations: σ a do not switch places of columns, aσ1a ,1 = a1,1 , aσ2a ,2 = a2,2 . σ b switch places of columns, aσb ,1 = a1,2 , aσb ,2 = a2,1 . 1 2 And we get determinant det(A) = aσ1a ,1 aσ2a ,2 sign(σ a ) + aσb ,1 aσb ,2 sign(σ b ) = a1,1 a2,2 − 1 2 a2,1 a1,2 11 The basic idea for the formula is as follows. We start by rewriting columns as sums of columns of the identity matrix n X A.k = aσk ,k I.σk σk =1 Using the multilinearity of the determinant function we get det( n X aσ1 ,1 I.σ1 , A.2 , . . . A.n ) = σ1 =1 = n X aσ1 ,1 det(I.σ1 , A.2 , . . . A.n ) = σ1 =1 = n X aσ1 ,1 σ1 =1 = n X aσ1 ,1 σ1 =1 n X aσ2 ,2 det(I.σ1 , I.σ2 , . . . A.n ) = σ2 =1 n X n X aσ2 ,2 . . . σ2 =1 aσn ,n det(I.σ1 , I.σ2 , . . . I.σn ) σn =1 Terms where σi = σk for some i 6= k, 1 ≤ i, k ≤ n will be equal to zero. det(A) = X aσ1 ,1 aσ2 ,2 . . . aσn ,n det(I.σ1 , I.σ2 , . . . I.σn ) σ∈Sn The (sign) function captures the alternating behaviour. Q 1≤i<j≤n sign(σ) = Q (σj − σi ) 1≤i<j≤n (j − i) = ±1 While the function given here i correct, it’s extremely slow for large matrices, for an n × n matrix there is n! permutations. Instead we mostly we calculate it by column (row) expansion as we are used to for 3 × 3 matrices: column i row i z }| { z }| { n n X X det(A) = ak,i Aki = ai,k Aik k=1 (2) k=1 ˜ ki ) where A ˜ ki is equal to A with the k:th row and where Aki is equal to (−1)k+i det(A i:th column removed. Aki is called the ki-cofactor of A. 12 1.3 Eigenvalues and eigenvectors We are often interested in finding eigenvalues or eigenvectors of a matrix. Definition 1.1 If Av = λv where A ∈ Mn×n (K), λ ∈ K, v ∈ Mn×1 (K) then λ is a eigenvalue of A and v is a (right) eigenvector if v 6= 0. We call an eigenvalue λ with corresponding eigenvector v an eigenpair (λ, v). A matrix can have (and usually do) have multiple eigenvalues and eigenvectors. A single eigenvalue can also be associated with multiple eigenvectors. Often we are interested in the ”dominant” eigenvalue or the spectral radius of a matrix A. Definition 1.2 The set of eigenvalues for a matrix A is called the spectrum of A and is denoted by Sp(A). The absolute value of the largest eigenvalue is called the spectral radius ρ(A) = max |λ| λ∈Sp(A) The eigenvalues whith the largest absolute value (if there is only one) is often called the dominant eigenvalue. This and it’s corresponding eigenvector is often of particular interest as we will see for example while working with stochastic matrices in chapter 5. We can find the eigenvalues by solving det(A − λI) = 0 which is an nth degree polynomial in λ. pA (λ) = det(A − λI) is called the characteristic polynomial and pA (λ) = 0 is called the characteristic equation. Some properties of eigenvalues are: • The trace of A is the sum of the eigenvalues of A: tr(A) = n X i=1 • det(A) = n Y ai,i = n X λi i=1 λi i=1 • If A ∈ Mn×n (K) have n different eigenvalues then A is diagonalizable. • If A have no eigenvalue λ = 0 then the determinant is non-zero and the linear system Ax = b have a single unique solution. Often when interested in finding the eigenvalues or eigenvectors we first find a ”similar” matrix with the same unknown eigenvalues and then solve the problem for this new matrix instead. 13 Definition 1.3 Given two n × n matrices A, B we say that they are similar if there is an invertible n × n matrix P such that B = P−1 AP Similar matrices share not only eigenvalues but also many other properties because of that. Theorem 1.2. Let matrices A, B be similar, then A, B have the same • Eigenvalues (but generally not the same eigenvectors) • Determinant, det(A) = det(B). • Rank • Trace We will look much more at eigenvalues and eigenvectors for special types of matrices in later chapters, as well as how to compute them efficiently in chapter 11. 1.4 Exercises a) Show that this linear equation system has one single −x + 2y − z = −x + y = 2x − y + 2z = b) Show that this linear equation system has infinitely x + y − z = 2x + y + 2z = 3x + 2y − z = solution. 2 3 1 many solutions. 1 2 3 c) Show that this linear equation system has no solution. x − 2y + z = 3 x + 3y − 2z = 2 2x + y − z = 1 Calculate these determinants 2 3 1 5 4 , c = 2 −1 5 a = 5 , b = 3 2 5 0 −1 14 1.4.1 Let 3 0 A= 0 0 3 1 0 0 4 2 3 0 4 3 2 4 ,B = 2 3 3 2 3 1 1 1 4 2 3 3 4 3 2 4 ,C = 2 3 3 3 3 1 1 2 4 2 3 1 4 2 , det(C) = −26 3 3 The following determinants should all be very simple to calculate. Calculate them. a) det(A) b) det(B) c) det(C> ) 1.4.2 If the linear equation systems in 1.4 were written on the form Ax = y, for which systems would A−1 exist? 1.4.3 4 1 Find the eigenvalues and eigenvectors of M = 3 2 1.4.4 Given 1 0 0 1 1 2 1 2 1 A = −1 1 0 , B = 0 3 2 , P = 0 0 1 0 1 0 −1 0 1 0 2 3 Show that A and B are ( similar). 15 2 Matrix factorization and canonical forms Matrix factorization, also known as matrix decomposition, are methods in which we write one matrix as a product of several matrices. The aim is typically to change the problem into another easier problem through the choice of these factor matrices. Matrix factorization is very often used in order to speed up and/or stabilize calculations in numerical algorithms such as when finding eigenvalues of a matrix or solving a linear system. We will give some examples of applications but primary we will introduce different factorizations which we will then use in later chapters. One example we should already be familiar with is the diagonalization of a diagonalizable matrix A by elementary row (or column) operations. Since elementary row consist of multiplying rows with non-zero scalars and adding them together the application of a series of elementary row operations on a linear equation system, Ax = y, can be written as multiplication by a matrix, S, from the left. Elementary row operations:Ax = y → SAx = Sy The matrix S is always square and since it is easy to revert any elementary row operations it will also be invertible. Thus it is always allowed to write ˆ=y ˆ, x ˆ = Sx, y ˆ = Sy SASS−1 x = Sy ⇔ SAS−1 x Suppose that the matrix A was diagonalized in this manner. SAS−1 = D, D diagonal matrix This relation makes the following definition of a diagonalizable matrix quite natural Definition 2.1 A matrix A is diagonalizable if there exists an invertible matrix S such that A = S−1 DS Where D is a diagonal matrix. This if course an interesting property since a matrix equation Ax = y has a unique solution if A is diagonalizable. But there are many other important factorizations such as: • Spectral decomposition QΛQ−1 • LU-factorization • QR-factorization • Rank factorization CF • Jordan canonical form S−1 JS 16 • Singular value factorization UΣVH • Cholesky factorization GGH Before we look at how to find these factorizations and why they are useful we will start by looking at a couple of important types of matrices with some useful properties. 2.1 Important types of matrices and some useful properties We have already looked extensively at non-negative matrices in the earlier chapters, however there are many other types of matrices which are useful when trying to find a suitable matrix factorization. 2.1.1 Triangular and Hessenberg matrices A triangular matrix as we should already know is a matrix with all zeros above or below the diagonal. A matrix where all elements above the diagonal are zero we call lower triangular and a matrix where all elements below the diagonal are zero such as the one below, we call upper triangular. F F ... F 0 F . . . F A=. .. . . .. .. . . . 0 0 ... F We note that a triangular matrix does not need to be square although that is when they are the most useful. For a non-square matrix the diagonal is always the one starting in the upper left corner. To understand why triangular matrices are desirable we will consider to important properties. First of all we can very easily solve a linear system Ax = b if A is triangular since the last row is a equation of the form axn = b, Solving this we can substitute this answer into the row above that and get a new equation in just one variable. A second important property of triangular matrices us that the eigenvalues of a triangular matrix is the same as the elements on the diagonal. You can easily check this by looking at what happens with the characteristic polynomial det(A − Iλ) when A is diagonal. This gives rise to one important question: Is it possible to rewrite a matrix as a product of a triangular matrix and another matrix which doesn’t change the eigenvalues when multiplied with this matrix and if possible, can it be done fast? Related to triangular matrices are the Hessenberg matrices. These are matrices that are ’almost’ triangular, instead of demanding that all elements below the diagonal is zeros as for an upper triangular matrix, a matrix that is an upper Hessenberg matrix if 17 we also allow the elements immediately below the diagonal to be non zero. This gives an upper Hessenberg matrix A the form: F F F ··· F F F F F F · · · F F F 0 F F · · · F F F A = 0 0 F · · · F F F .. .. .. . . .. .. .. . . . . . . . 0 0 0 · · · F F F 0 0 0 ··· 0 F F While not as immediately useful as a triangular matrix, Hessenberg matrices are useful just from the fact that they are nearly triangular. The diagonal doesn’t hold the eigenvalues as in a triangular matrix, but they often give a good approximation or initial guess for iterative methods. Multiplication of two (both upper or lower) Hessenberg matrices also result in a new Hessenberg matrix. We will see examples of how and why Hessenberg matrices are useful later when we look at the QR-method in section 11.2. 2.1.2 Hermitian matrices We start by defining the Hermitian conjugate of a matrix: Definition 2.2 The Hermitian conjugate of a matrix A is denoted AH and is defined by (AH )i,j = (A)j,i . In other words we transpose it and change the sign of the imaginary parts. So if A is real valued, the hermitian conjugate is the same as the transpose. Definition 2.3 A matrix is said to be Hermitian (or self-adjoint) if AH = A Notice the similarities with a symmetric matrix A> = A, a Hermitian matrix can be seen as the complex extension of real symmetric matrices. One property of Hermitian (and real symmetric) are that they are normal and thus diagonalizable. 2.1.3 Unitary matrices Matrix factorizations using unitary matrices are often used over other factorization techniques because of their stability properties for numerical methods. Definition 2.4 A matrix, A, is said to be unitary if AH = A−1 . We will look more at unitary matrices when we look at linear spaces and projections in chapter 6.2. Unitary matrices have a large number of ”good” properties. Theorem 2.1. Let U be a unitary matrix, then 18 a) U is always invertible. b) U−1 is also unitary. c) | det(U)| = 1 d) (UV)H = (UV)−1 if V is also unitary. e) For any λ that is an eigenvalue of U, λ = eiω , 0 ≤ ω ≤ 2π. f ) Let v be a vector, then |Uv| = |v| (for any vector norm). H g) The rows/columns of U are orthonormal, that is Ui. Uj.H = 0, i 6= j, Uk. Uk. = 1. One of the important things we can find from this is that if we multiply a matrix A with a unitary matrix A, then UA will have the same eigenvalues as A. We will meet unitary matrices again in later chapters when we will need them and easier can illustrate why they are useful. One example of a unitary matrix is the C matrix below which rotates a vector by the angle θ around the x-axis 1 0 0 C = 0 cos(θ) − sin(θ) 0 sin(θ) cos(θ) 2.1.4 Positive definite matrices Positive definite matrices often come up in practical applications such as covariance matrices in statistics. Positive definite matrices have many properties which makes them easier to work with than general matrices. Definition 2.5 We consider a square symmetric real valued n × n matrix A, then: • A is positive definite if x> Ax is positive for all non-zero vectors x. • A is positive semidefinite if x> Ax is non-negative for all non-zero vectors x. We define negative definite and negative semidefinite in the same way except demanding that the resulting vector is negative or non-positive instead. Similarly we can define it in the same way by replacing the transpose with the hermitian for matrices with complex valued elements. 2.2 Factorization As was mentioned in the introduction factorization (also called decomposition) is taking a matrix and rewriting it as a product of several matrices. This can make it easier to solve certain problems, similarly to how factorization of numbers can made division easier. 19 A well known and previously visited example of factorization is rewriting a diagonalizable matrix as a product of an invertible matrix and a diagonal matrix A = S−1 DS, D is diagonal 2.2.1 Spectral factorization Spectral factorization is a special version of diagonal factorization. It is sometimes referred to as eigendecomposition. Theorem 2.2. Let A be an square (n × n) matrix with linearly independent rows. Then A = QΛQ−1 where Q is a square n × n matrix whose columns are the eigenvectors of A and Λ is a diagonal matrix with the corresponding eigenvalues. The spectral factorization can be thought of as a ’standardized’ version of diagonal factorization. Also relates to projection, see section ??. 2.2.2 Rank factorization Theorem 2.3. Let A be an m × n matrix with rank(A) = r (A has r independent rows/columns). Then there exist a factorization such that A = CF where C ∈ Mm×r and F ∈ Mr×n We take a short look at how to find such a factorization, We start by rewriting the matrix on reduced row echelon form 1 F 0 0 0 F 0 0 0 0 1 0 0 F 0 0 0 0 0 1 0 F 0 0 0 0 0 0 1 0 F 0 B= 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 We then create C by removing all columns in A that corresponds to a non-pivot column in B. In this example we get: C = A.2 A.4 A.5 A.6 A.8 We then create F by removing all zero rows in B. In this example we get: F = B1. B2. B3. B4. B5. This gives the Rank factorization of A. 20 2.2.3 LU and Cholesky factorization The LU factorization is a factorization of the form: A = LU Where L is lower triangular and U is upper triangular. If we can find such a factorization we can for example easily find the solution to a linear system Ax = L(Ux) = b by first solving Ly = b and then solve Ux = y. Both these systems are easy to solve since L and U are both triangular. However not every matrix A have a LU factorization, not even every square invertible matrix. Instead we often talk about LUP factorization instead: Theorem 2.4. If A is a square matrix, then there exist a factorization such that: PA = LU Where P is a permutation matrix, U is a upper triangular matrix and L is a lower triangular matrix. P−1 LU is called the LUP factorization of A. We can still, provided that we found such a factorization for example easily solve the equation system Ax = b by instead solving PAx = Pb. The LU factorization can be seen as a variation of the Gaussian elimination which we often use when attempting to solve a linear system by hand. LU decomposition can also be used to calculate the determinant of A = P−1 LU. We get the determinant as: det(A) = det(P −1 ) det(L) det(U ) = det(P −1 ) n Y i=1 ! li,i n Y ! ui,i i=1 This since we remember that the eigenvalues of a triangular matrix is the values on the diagonal. The determinant of a permutation matrix can be found to be det(P −1 ) = (−1)S where S is the number of row exchanges. If A is Hermitian and positive-definite however then we can find another related factorization, namely a Cholesky factorization: Theorem 2.5. If A is a Hermitian, positive-definite matrix, then there exist a factorization such that: A = LLH Where L is a lower triangular matrix. LLH is called the Cholesky factorization of A. The Cholesky factorization is mainly used instead of the LUP factorization when possible to solve linear systems. Some example where such matrices naturally occur is. 21 • Any covariance matrix is positive-definite and any covariance matrix based on measured data is going to be symmetric and real-valued. This means it’s hermitian and we can use the Cholesky factorization. • When simulating multiple correlated variables using Monte Carlo simulation the correlation matrix can be decomposed using the Cholesky factorization. L can then be used to simulate the dependent samples. 2.2.4 QR factorization The QR-factorization is one of the most useful factorizations. While we will only show what it is and give one example of what it can be used for, section 11 is dedicated to how we can find the QR-factorization efficiently as well as it’s use in the QR-method, one of the most widely used method to find the eigenvalues of a general square matrix. Theorem 2.6. Every n × m matrix A have a matrix decomposition A = QR where • R is a n × m upper triangular matrix. • Q is a n × n unitary matrix. If A is a square real matrix then Q is real and therefore an orthogonal matrix Q> = Q. We can find a QR-factorization can be computed using for example Householder transformations or the Gram-Schmidt process. We will look more into this in ??. The QR-factorization have uses in for example: • Given a QR-factorization we can solve a linear system Ax = b by solving Rx = Q−1 b = QH b. Which is can be done fast since R is a triangular matrix. • QR-factorization can also used in solving the linear least square problem. • It plays an important role in the QR-method used to calculate eigenvalues of a matrix. 2.2.5 Canonical forms Some factorizations are referred to as canonical forms. A canonical form is a standard way of describing an object. A single object can have several different canonical forms and which one we use might depend on what we want to find or which canonical form we can easily calculate. We have already seen one example of a canonical form which we used when looking at reducible stochastic matrices. Some other examples of canonical forms for matrices are: • Diagonal form (for diagonalizable matrices) 22 • Reduced row echelon form (for all matrices) • Jordan canonical form (for square matrices) • Singular value factorization form (for all matrices) 2.2.6 Reduced row echelon form Definition 2.6 A matrix is written on reduced row echelon form when they are written on echelon form and their pivot elements are all equal to one and all other elements in a pivot column are zero. B= 0 0 0 0 0 0 0 0 1 F 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 F F F F 0 0 0 0 0 0 0 0 1 0 0 0 Theorem 2.7. All matrices are similar to some reduced row echelon matrix. We recognize this as the result of applying Gauss-Jordan elimination to the original matrix. 2.2.7 Jordan normal form The Jordan normal form, while not especially useful in practical applications because of its sensitivity to rounding errors in numerical calculations is still good to be familiar with since it is often usefor for developing mathematical theory. We start by defining the building blocks which which we build the Jordan normal form. Definition 2.7 (Jordan block) A Jordan block is a square matrix of the form λ 0 Jm (λ) = ... 0 0 1 0 ... λ 1 ... .. . . . . . . . 0 ... λ 0 ... 0 0 0 .. . 1 λ We call a matrix made up of Jordan blocks a Jordan matrix: 23 Definition 2.8 (Jordan matrix) A Jordan matrix is a square matrix of the form 0 Jm1 (λ1 ) 0 J m 2 (λ2 ) J= .. .. . . 0 0 ... ... .. . 0 0 .. . . . . Jm1 (λk ) Theorem 2.8. All square matrices are similar to a Jordan matrix. The Jordan matrix is unique except for the order of the Jordan blocks. This Jordan matrix is called the Jordan normal form of the matrix. We remember that two matrices A, B are similar if there is a unitary matrix S such that A = S−1 BS. This means that if we have found a Jordan matrix J similar to A, they have the same eigenvalues. And since Jordan matrices are upper triangular, we can easily find the eigenvalues along the diagonal of J. Theorem 2.9 (Some other interesting properties of the Jordan normal form). Let A = S−1 JS a) The eigenvalues of J is the same as the diagonal elements of J. b) J has one eigenvector per Jordan block. c) The rank of J is equal to the number of Jordan blocks. d) The normal form is sensitive to perturbations. This means that a small change in the normal form can mean a large change in the A matrix and vice versa. 2.2.8 Singular value decomposition (SVD) Singular value decomposition is a canonical form used for example in statistics and information processing in dimension reducing methods such as Principal component analysis or calculating pseudoinverses (see section 8.4). Theorem 2.10. All A ∈ Mm×n can be factorized as A = UΣVH where U and V are unitary matrices and Sr 0 Σ= 0 0 where Sr is a diagonal matrix with r = rank(A). The diagonal elements of Sr are called the singular values. The singular values are uniquely determined by the matrix A (but not necessarily their order). 24 The singular value decomposition is related to the spectral decomposition. Given a singular value decomposition A = UΣVH we have: • The columns of U are the eigenvectors of AAH • The columns of V are the eigenvectors of AH A • The singular values on the diagonal of Σ are the square root of the non-zero eigenvalues of both AAH and AH A. Having the singular value decomposition it’s easy to solve a lot of problems found in many practical applications, for example if we have a singular values decomposition A = UΣVH we can: • Calculate pseudoinverse A+ = VΣ+ UH where Σ+ is easily found by taking the reciprocal (x−1 ) of the non-zero elements of the diagonal and transposing the resulting matrix. • It is also used in Principal component analysis (PCA) as we will see later. 2.3 Applications 2.3.1 Portfolio evaluation using Monte Carlo simulation This example uses a large amount of staistics and terminology not otherwise described in this book. For the interested reader we refer to books or courses in Monte Carlo methods and stochastic processes for the details. Suppose we have two stocks A, B with prices at time t as S a (t), S b (t). Initially we start with na stocks of A and nb stocks of B. This gives initial wealth: W0 = na S a (0) + nb S b (0) We let the stocks be for a time T after which we are interested in the new wealth of the portfolio or more specifically how sure we can be that there have not occurred any great losses. We assume the stock pricing can be modelled as geometric brownian motions (GBM) with parameters S a GBM(µa , σa ) and S b GBM(µb , σb ). We have estimated correlation ρ between the return on A and B. To evaluate our portfolio we are interested in the probability that the total wealth of our portfolio WT at the end of the period T have dropped by more than 10%. We can write this as: P WT ≤ 0.9 W0 25 Finding this probability analytically is rather hard, especially if we work with not 2 stock but say 10 dependent stocks. Instead we use Monte Carlo simulation to simulate the process a large amount of times. Every time checking at the end if WT ≤ 0.9W0 . The value of stock A at time T can be written: STa = S0a exp((µa − σa2 /2)T + √ T Va ) Where Va , Vb are multivariate normal distributed variables Va , Vb N (0, Σ). 2 σa σa σb ρ Σ= σa σb ρ σb2 So in order to generate one sample of STa , STa we need to simulate one sample from N (0, Σ). We can usually rather easily simulate independent normal distributed variables N (0, I). Our goal is then to use these independent samples to generate the dependent samples N (0, Σ). Normal distribution have the property that a linear combination of normal distributed variables is itself normal. In fact if we have normal distributed variables with mean zero and variance one Z N (0, 1) then C> Z is distributed as follows. C> Z N (0, C> C) In other words, if we can find a Cholesky decomposition C> C = Σ then we need only to calculate the product C> Z to get samples from the desired distribution. Σ is obviously real valued and symmetric, but it’s also at least positive semi definite. In fact the only time it isn’t strictly positive definite is when one one of the variables is set to be an exact linear combination of the others. Since we can be almost sure Σ is a real, symmetric, positive definite matrix we can use Cholesky factorization to find a decomposition C> C = Σ. We thus generate a large amount of samples C> C and check how often WT ≤ 0.9W0 to find the probability P (WT ≤ 0.9W0 ). Some things to think about: • Can you think of another situation where you would like to take samples from another multivariate normal distribution? • Can you think of why the covariance matrix Σ is always a real, symmetric, positive (semi)definite matrix? 26 3 Non-negative matrices and graphs Non-negative matrices are common in many applications such as when working with stochastic processes and Markov chains as well as when working with things we can represent with a graph, such as electrical networks, water flows, random walks, and so on. Non-negative and positive matrices (not to be confused with positive (semi) definite matrices) have many known properties, especially concerning their eigenvalues and eigenvectors which makes them easier to work with than general matrices. This and the next to chapters are very much related in that we will start by looking at some graph theory, how to represent a graph with a matrix and some things which we can calculate here. After which we will take a look at the properties of non-negative matrices themselves in chapter 4 and then last look at a special type of non-negative matrix related to random walks on graphs in chapter 5. 3.1 Introduction to graphs A graph is a representation of a set of objects (vertices) where some pair of objects have links (edges) connecting them. The vertices can represent a state in a Markov chain, a point on a grid, a crossing in a road network among other things. Likewise the edges could in the same examples represent the possible transitions in the Markov chain, the lines on the grid or the actual roads in a road network. Sometimes in litterature vertices are called nodes and edges are called links. An example of two graphs can be seen in Figure 1. B A B C D A C D Figure 1: A directed graph (left) and a undirected graph (right) We will concern ourselves with mainly two types of graphs, directed graphs as in the one to the left where every edge connecting two vertices have a direction. And undirected graphs as in the one to the right where the edges doesn’t have a direction but instead only denote that the two vertices are connected. Sometimes we will allow edges which goes from a vertice back to itself (so called loops) and sometimes we won’t, generally loops will be allows for directed graphs but often not when working with undirected 27 graphs. In both cases we can also assign weights to the edges as we will see later. We can represent graphs such as the ones in Figure 1 with a matrix by defining the adjacency matrix for a graph. Definition 3.1 The adjacency matrix A of a graph G with n vertices (|V | = n) is a square n × n matrix with elements ai,j such that: 1, if there is an edge from vertice i to vertice j ai,j = 0, otherwise If the graph is undirected we consider every edge between two vertices as linking in both directions. We note one important thing, we don’t define in which order we take the vertices. Sometimes it’s natural to order the vertices in a certain way, for example depending on vertice labels, however any order is valid as long as you know what order you choose. B A C 0 0 A= 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 D E Figure 2: A directed graph and corresponding adjacency matrix with vertices ordered {A, B, C, D, E} If the graph instead was undirected we would get the graph and adjacency matrix in ??. We note that we don’t need to order the vertices in that order, for example the matrix: 0 1 0 0 1 1 1 1 1 0 A= 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 Is also a adjacency matrix of the graph in Figure 3 where we instead have ordered the vertices as {C, B, A, D, E}. 28 B A C 0 1 A= 0 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 D E Figure 3: A undirected graph and corresponding adjacency matrix with vertices ordered {A, B, C, D, E} One common addition to graphs is that we also assign weights to the edges, usually we only allow the weights to be positive real values. The weights can for example be used to represent transition probabilities in a Markov chain, distance of a part of road in a road network, time needed to travel between two vertices or the capacity in a flow network. If we have a graph with weighted edges we often talk about the distance matrix rather than the adjacency matrix. Definition 3.2 The distance matrix D of a graph G with n vertices and weighted edges E is a square n × n matrix with elements di,j such that: Ei,j , if there is an edge from vertice i to vertice j di,j = 0, otherwise Where Ei,j is the weight of the edge from vertice i to vertice j. As with the adjacency matrix, if the graph is undirected we consider every edge between two vertices as linking in both directions. An example of a weighted directed graph and corresponding distance matrix can be seen in 3.2. As in the case with the adjacency matrix we don’t have a defined order of the vertices, so we can use any as long as we know what order we use. While there would be no problem using negative or non real weights on the edges in the matrix representation of the graph, in practice we still usually consider only the cases with only positive weights. This is because we have a lot of usefull theory concerning specifically nonnegative matrices (which we get if we allow only positive weights). In most real world examples where we are interested in representing the problem with a graph we also naturally end up with only positive weights, such as when the weights represent for example distance, time, probability, capacity or a flow. 29 2 B 1 1 2 A C 1 2 1 0 0 D= 0 1 0 1 2 1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 2 0 0 D E Figure 4: A weighted directed graph and corresponding distance matrix with vertices ordered {A, B, C, D, E} 3.2 Connectivity and irreducible matrices It’s often interesting to see not only if there is an edge between two vertices but also if there is a connection between two vertices if we are allowed to traverse more than one edge. This is interesting both in the case of directed graphs as well as undirected ones: can we get from A to B by following the edges? And if so in how many steps? Or what is the shortest distance between A and B? We take a look at so called paths in a graph. Definition 3.3 Given a weighted or unweighted graph. • A path in a graph is a sequence of vertices e1 , e2, . . . , en such that for every vertice in the sequence there is an edge to the next vertice in the sequence. • If the first vertice in a path is the same as the last vertice we call it a cycle. • The length of a path is the number of edges in the path (counting multiple uses of the same edge). We take a look at the adjacency matrix although the same is true for the distance matrix as long as we only allow positive weights. (k) Theorem 3.1. If element ai,j of the adjacency matrix A to the power of k is non zero, there is a path from vertice i to vertice j of exactly length k. In other words: (k) ai,j ∈ Ak ⇒ There is a path of length k from vertice i to vertice j 30 The proof is left as an exercise. The problem of finding a path or the existence of a path between two vertices is an important problem in many applications regarding graphs. For example in finding the shortest path in for example a road network as well as finding if and how likely it’s to get from one state to another in a Markov chain. We divide the graph’s in two different groups depending on if we can find a path from all vertices to all other vertices. Definition 3.4 Given a weighted or unweighted graph. • Two vertices is said to be connected if there exists a path from either vertice to the other in the undirected graph. • A graph is said to be connected if it is connected for all pair of vertices in the undirected graph. • If it is also connected for all pairs in the directed graph we say that it’s strongly connected. We see that if a graph is connected or strongly connected in the case of directed graphs, we have a guarantee that we have a path from any vertex to another. Before continuing to we take a short look at connected parts of a graph. Definition 3.5 Given a (possible weighted) graph: • A connected component in a undirected graph is a maximal part of the graph where all vertices are connected with eachother. • A strongly connected component is a part of the directed graph which is strongly connected. Example of connected components in an undirected graph can be seen in Figure 5 below. We see that the the connected components can themselves be considered connected graphs. A example of strongly connected components in a directed graph can be seen in Figure 6. How many edges would you need to make the graph strongly connected? We are now ready to move over to the matrix side of things. We look at general square non-negative matrices, for which both the adjacency matrix as well as the distance matrix belong if we only allow positive weights . Theorem 3.2. A non-negative square n × n matrix A is said to be Irreducible if any of the following equivalent properties is true: 1. The graph corresponding to A is strongly connected. 2. For every pair i, j there exists a natural number m such that am i,j > 0. 31 Figure 5: A undirected graph with 3 connected components Figure 6: A directed graph with 3 strongly connected components 3. (I + A)n−1 is a positive matrix. 4. A cannot be conjugated into a block upper triangular matrix by a permutation matrix P. We leave the last property for later and give a short motivation of why the first 3 are equivalent. It’s clear that since the graph is strongly connected (1) there exist a path from any vertice to any other vertice. However if the length of the path between vertice i and vertice j is m we know that am i,j > 0. So if (1) is fulfilled so is (2) and if two is fulfilled we know there is a path between all vertices so the graph needs to be strongly connected as well (1). For the third we can take the binomial expansion and get: (I + A) n−1 n X k k = A n k=0 k Since is positive the matrix is positive if there is at least one 0 ≤ k ≤ n for every n 32 pair i, j such that aki,j > 0. However since the longest possible path without repeating any vertices is n − 1 (go through all vertices in the graph) we see that this is true only if (2) is true since we need not check any m larger than n − 1. The question regarding if a non-negative matrix is irreducible or not is a very important property which we will use repeatedly in the comming chapters both with regards to graphs as well as generall non-negative matrices. Having a clear understanding of what is and isn’t an irreducible matrix and how they relate to graphs and their distance or adjacency matrix will give a much easier understanding when we talk about Perron-Frobenius theory and Markov chains later. 3.3 The Laplacian matrix The Laplacian matrix is another matrix representation of graphs with no self loops (so called simple graphs). It can be used to among others to find the resistance between nodes in an electrical network, estimate the robustness of or find if there are any ”bottlenecks” in a network. It’s also usefull in finding some other properties of the graph as well as it appearing naturally in for example the stiffness matrix of simple spring networks. Before defining the Laplacian matrix we need to define another matrix, namely the degree matrix of a graph. Definition 3.6 The degree matrix of a undirected graph is a diagonal matrix where the elements on the diagonal is the number of edges connected with that vertex. In other words given a graph G with n vertices the degree matrix D is a n × n matrix with elements: deg(i), i = j di,j := 0, i 6= j Where deg(i) is the number of edges connected with vertice vi . • If there are loops present (vertices linking to itself) we count those twice for the degree. • If the graph is directed we use either the in-degree(number of edges pointing towards the vertice) or out-degree (number of edges pointing out from the vertice). Which one we use depend on application. • With a directed graph we also only count loops once rather than twice. You can see a undirected graph and corresponding degree matrix in Figure 7 and Figure 8 Using this and the adjacency matrix we can define the laplacian matrix: 33 2 1 4 3 0 D= 0 0 0 4 0 0 0 0 1 0 0 0 0 2 3 Figure 7: Undirected graph and corresponding degree matrix. 2 1 4 2 0 D= 0 0 0 3 0 0 0 0 1 0 0 0 0 0 3 Figure 8: Directed graph and corresponding degree matrix using in-degree. Definition 3.7 The Laplacian matrix L of a undirected graph with no self loops is defined as: L := D − A where D is the degree matrix and A is the adjacency matrix. This gives elements li,j ∈ L: i=j deg(i), −1, i 6= j, vi links to vj li,j := 0, otherwise Where deg(i) is the number of edges connected with vertice vi . An example of a graph and corresponding Laplacian matrix can be seen in Figure 9. Some properties of the Laplacian matrix L and it’s eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn . • L is positive semidefinite (x> Lx ≥ 0). • λ1 = 0 and the number of connected components is it’s algebraic multiplicity. 34 2 3 −1 −1 −1 −1 2 0 −1 L= −1 0 1 0 −1 −1 0 2 1 4 3 Figure 9: Simple undirected graph and corresponding Laplacian matrix • The smallest non zero eigenvalue is called the spectral gap. • The second smallest eigenvalue of L is the algebraic connectivity of the graph. We immediately see that it seems like the eigenvalues of the Laplacian gives us a lot of information about the graph, this is especially important since it’s positive semidefinite making more effective calculations of the eigenvalues possible, which might be important if we need to work with very large graphs. We will look more into how we can use the Laplacian matrix in the application part. 3.4 Exercises 3.1. 3.5 Applications 3.5.1 Shortest path A common problem when working with networks such as a road network or in routing on the internet is the shortest path problem. You could for example want to find the best order in which to visit a number of places (the traveling salesman problem) or maybe you want to find the shortest path between two places such as in the case of GPS in cars. There are a number of related problems and solutions depending on if we need the shortest distance between all pair of vertices or if it’s enough with a subset such as a group of vertices or a single pair of vertices. One common algorithm for finding the shortests path from one vertex to all others is Djikstras algorithm which most other similar methods are based upon. The method works by assigning a distance from the initial vertex to every other vertex and then traveling through the graph always choosing the next vertex among the unvisited vertices that minimizes the distance. At every we we update the distance to every vertex linked 35 to by the current one. More exact the method works through the following steps using the distance matrix of the resulting graph: 1. Assing distance zero to initial vertex and infinity to all other verticess. 2. Mark all verticess as unvisited and set the initial vertex as current vertex. 3. For the current vertex, calculate the distance to all it’s unvisited neighbors as the current distance plus the distance to the neighbors. Update these vertices distance if the newly found distance is lower. 4. Mark current vertex as visited. 5. If the smallest distance in among the unvisited vertices is infinity: stop! Otherwise set the unvisited vertex with the smallest distance as current vertex and go to step 3. If we are only interested in the shortest distance from the initial vertex to one specific vertex we can stop as soon as the distance to the target vertex is less than the distance to the next current vertex. However for the method to work we need to make a few assumptions, there are also a couple of properties of the graph itself that can say something about the expected result of the algorithm. Take a few minutes to think through the following questions and how they affect the result or what assumptions we need. • What could happen if there are negative weight’s in the graph? • What if the graph is directed? Should a transportation network be considered directed or undirected? • If we know the graph is or isn’t irreducible, can we say something about the result beforehand? 3.5.2 Resistance distance When planning a large electrical network its important to try and get as low resistance between vertices as possible. It is however not an easy task to use the common rules for combining resistors when the network gets large and we might want to know how to expand the network in order to lower the resistance between not just two vertices but lower it overall. The resistance distance between two vertices vi , vj in a graph G is defined as the effective resistance between them when each edge is replaced with a 1 Ω resistor. We define 1 Γ=L+ n Where L is the laplacian matrix, n is the number of vertices in G and 1 is a n × n matrix with elements equal to one. 36 The elements of the resistance distance matrix (Ω) is then: −1 −1 (Ω)i,j = Γ−1 i,i + Γj,j − 2Γi,j . For example if we have the graph below and corresponding Laplacian matrix in Figure 10 below: B 3 −1 0 −1 −1 −1 3 −1 0 −1 L= 0 −1 3 −1 −1 −1 0 −1 3 −1 −1 −1 −1 −1 4 A E C D Figure 10: Undirected graph and corresponding Laplacian matrix We get Γ = L + 1/5 and it’s inverse: Γ−1 0.4267 0.1600 = 0.0933 0.1600 0.1600 0.1600 0.4267 0.1600 0.0933 0.1600 0.0933 0.1600 0.4267 0.1600 0.1600 0.1600 0.0933 0.1600 0.4267 0.1600 0.1600 0.1600 0.1600 0.1600 0.3600 −1 −1 We get the resistance distance matrix by computing (Ω)i,j = Γ−1 i,i + Γj,j − 2Γi,j . for every element: 0 0.5333 (Ω) = 0.6667 0.5333 0.4667 0.5333 0 0.5333 0.6667 0.4667 0.6667 0.5333 0 0.5333 0.4667 0.5333 0.6667 0.5333 0 0.4667 0.4667 0.4667 0.4667 0.4667 0 And we have found the effective resistance between vertices as the value of the elements in the matrix. We can see that we have the largest resistance between the vertices A, C and B, D, which is quite natural since they are the ones furthest from eachother. Between opposite vertices we can also easily check the result since we can see it as 2 ressistors made up of 3 parallel connected resistors each in a series. This would give resistance: RA,D = R4 R5 R6 1 1 2 R1 R2 R3 + = + = R1 R2 + R1 R3 + R2 R3 R4 R5 + R4 R6 + R5 R6 1+1+1 1+1+1 3 37 Since all the resistances is 1 from the definition of resistance distance. Checking the result for the resistance between for example A and E is harder using the usual computation rules though. 38 4 Perron-Frobenius theory 4.1 Perron-Frobenius for non negative matrices Perron-Frobenius theory gives information about many of the most essential properties of non-negative square matrices. We will start by formulating item 4.1 which is essential to this chapter. While it is not immediately easy to see why these properties are useful, we will see throughout the chapter why they are important and how and when we can use them. Be sure to review what we mean by irreducible matrices if your not sure since it’s one of the recurring themes of this chapter. Theorem 4.1. Let A be a square non-negative irreducible matrix with spectral radius ρ(A) = r. Then we have: 1. r, called the Perron-Frobenius eigenvalue, is an eigenvalue of A. r is a positive real number. 2. The (right)eigenvector x belonging to r, Ax = rx, called the (right)Perron vector, is positive ( x > 0). 3. The left eigenvector y belonging to r called the left Perron vector, is positive (y> A = ry> , y > 0). 4. No other eigenvector except those belonging to r have only positive elements. 5. The Perron-Frobenius eigenvalue r has algebraic and geometric multiplicity 1. X X 6. The Perron-Frobenius eigenvalue r satisfies: min ai,j ≤ r ≤ max ai,j . i j i j We remember the spectral radius being the largest (in absolute value) eigenvalue. The theorem says a lot about the ”largest” eigenvalue as well as the corresponding eigenvector. This is also often called the dominant eigenvalue and its value and eigenvector are in many applications one of the most important things to know about the system as can be seen in the application part. We give a short motivation for the correctness of the theorem. Given (2) we can easily prove that the eigenvalue r is positive: If x is a positive vector and A is non-negative, then Ax > 0 since A is irreducible. Since Ax = rx if r is an eigenvalue belonging to the eigenvector x we have r > 0. If r was negative, zero or complex rx would be as well, but Ax is a positive vector. We also prove (4) that the only positive eigenvectors are the ones belonging to r: let A be a non-negative irreducible square matrix, set ρ(A) = r and suppose y > 0 is the left Perron vector y> A = ry> . Assume we have an (right) eigenpair (λ, v), v ≥ 0. This gives: ry> v = y> Av = y> λv ⇒ r = λ 39 We give no further proof, but leave some of the other statements as exercises. Initially the theorem was stated for positive matrices only and then later generalised for irreducible non-negative matrices. It’s easy to see that if A is positive it’s also non-negative and irreducible. Theorem 4.2. If A is positive, in addition to all properties for irreducible non-negative matrices, the following is true as well: • The Perron-Frobenius eigenvalue r has the strictly largest absolute value of all eigenvalues. I.e., for any eigenvalue λ 6= r we have |λ| < r. This last property ,that there is only one eigenvalue on the spectral radius, is a very useful property as well. In the section on Markov chains, we will see that it determines convergence properties. However this was for positive matrices and we see that it isn’t true for general non-negative irreducible matrices as in the example below. We look at the permutation matrix: 0 1 P = 1 0 It’s obviously non-negative and irreducible. From Perron-Frobenius we get that the spectral radius ρ(A) = r = 1 since the sum of both rows are one. However the eigenvalues can easily be found to be ±1 which are both on the unit circle. While there are irreducible matrices for which there are more than one eigenvalue on the spectral radius, there are some irreducible matrices for which there is only one. We let this last property of having or not having only one eigenvalue on the spectral radius divide the non-negative irreducible matrices in two different classes. We say that a non-negative irreducible matrix is primitive if it has only one eigenvalue with absolute value equal to the spectral radius. If a non-negative irreducible matrix is not primitive, it is imprimitive. Theorem 4.3. We let A be a square non-negative irreducible matrix, then: • A is primitive if and only if there is only one eigenvalue r = ρ(A) on the spectral radius. • A is primitive if and only if lim (A/r)k exists. k→∞ xy> > 0. k→∞ y> x where x, y are the right and left Perron vectors respectively. For the limit we have: lim (A/r)k = • A is imprimitive if and only if there are h > 1 eigenvalues on the spectral radius. h is called it’s index of imprimitivity. 40 We note the easily missed part that for a matrix to be either primitive or imprimitive it needs to first be non-negative and irreducible. We also see that that only one eigenvalue on the spectral radius is equivalent to the limit lim (A/r)k existing. k→∞ We return to the permutation matrix: 0 1 P = 1 0 We consider the limit lim (A/r)k , given r = 1 which we got earlier. That means that k→∞ if lim (A)k exists then P is primitive. However P2 = I and the sequence A, A2 , A3 , . . . k→∞ alternate between P and I. Since it alternates between two matrices it obviously doesn’t converge and is therefore imprimitive. In fact all irreducible permutation matrices are imprimitive since for all of them there is a k such that Pk = I (why?) and r = 1 since every row sum to one. There are two common ways to tell if a non-negative irreducible matrix is primitive: Theorem 4.4. A non-negative irreducible matrix A is primitive if: 1. If any diagonal element ai,i > 0 A is primitive (but A could still be primitive even if there are all zeroes on the diagonal). 2. A is primitive if and only if Am > 0 for some m > 0. 3. A is primitive if and only if A is aperiodic. Proof. • If (1) is true it’s obvious that there is a m > 0 such that Am > 0. This since if we look at the graph: irreducibility guarantees there is a path from all vertices to all others and Am > 0 means there is a path of exactly length m from all vertices to all others. However since we have one diagonal element ai,i > 0 we can for all paths between two vertices going through ai,i increase the length of the path by one simply by looping within ai,i once. • If Am > 0 and A had a non positive eigenvalue λ = ρ(A). Then λm should be an eigenvalue on the spectral radius of Am . However since Am is positive the only eigenvalue on ρ(Am ) is positive and simple, hence A is primitive. Conversely, if A is primitive then the limit } lim Am /rm exists and is positive. But for this to be m→∞ possible we must that Am is positive for large enough m. Especially the first statement gives a way to often immediately conclude that a primitive matrix is in fact primitive. When the first can not be used you are left with trying the second, however you might often need to try quite large m before you can find if a 41 matrix is primitive or not. To show that a matrix is imprimitive it’s often easiest to show that it’s periodic as we did with the permutation matrix. The second statement can also be used however since it can be shown that if A is primitive, the smallest possible m which gives a positive matrix can be bounded by: m ≤ n2 − 2n + 2. So if you try a m at least as large as this and the matrix still isn’t positive, it’s imprimitive. We give a short note on imprimitive matrices as well: Theorem 4.5. If A is a imprimitive matrix with h eigenvalues λ1 , λ2 , . . . , λh on the spectral radius ρ(A). Then the following are true: • alg mult(λn ) = 1, n = 1, 2, . . . , h. • λ1 , λ2 , . . . , λh are the h−th roots of ρ(A), such that: {λ1 , λ2 , . . . , λh } = {r, rω, rω 2 , . . . , rω h−1 }, ω = ei2π/h We will not look at the imprimitive matrices more than this, except noting that the eigenvalues on the spectral radius for imprimitive matrices are actually very special. Not only do they all have algebraic multiplicity 1, they are evenly spaced around the spectral radius. 4.2 Exercises 4.1. Consider the matrices: 1 1/2 0 1/2 0 0 0 1 0 1 0 1/2 1/2 0 0 0 0 1 A= 0 1 0 0 B = 1/3 1/3 0 1/3 C = 2 3 0 0 0 1 1 0 0 0 1 3 1 2 2 1 1 3 3 2 1 1 For each matrix consider: • Is it non-negative? • Is it irreducible? • Is it primitive? • What is the maximum and minimum possible value of the Perron-Frobenius eigenvalue? • What can you say about the other eigenvalues? 4.2. Show that for all irreducible permutation matrices there is a k such that Pk = I. Prove that all irreducible permutation matrices are imprimitive. 42 4.3. Consider the matrix: 1 1 2 A = 1 1 1 2 2 1 • Calculate the eigenvalues of this matrix. • Calculate the (right) eigenvector to the Perron-Frobenius eigenvalue. 4.3 Applications 4.3.1 Leontief input-output model Suppose we have a closed economic system made up of n large industries, each producing one commodity. Let a J − unit be what industry j produces that sells for one unit of money. We note that this could be a part of one commodity, for example a car would typically sell for many K − units. We create a vector S with elements sj containing the produced commodities and a matrix A with consumption of other commodities needed to create the commodities. We have: • Elements sj ∈ S, 0 ≤ sj = number of J − units produced by industry j in one year. • Elements ai,j ∈ A,0 ≤ ai,j = number of I − units needed to produce one J − unit. • ai,j sj = number of I − units consumed by industry J in one year. • If we sum over all j we can then find the total I − units that are availible to to public (not consumed by the industries themselves), we call this di . • di = si − n X ai,j sj . j=1 • S − AS is then a vector with the total availible to the public of all commodities (elements di ). Suppose the public demands a certain amount of every commodity, we create a demand vector d = (d1 , d2 , . . . , dn ) ≥ 0. We want to find the minimal supply vector S = [s1 , s2 , . . . , sn ]> ≥ 0 that satisfy the demand. We are also interested to know if we can guarantee that there actually is a solution to the system or do we need to make further assumptions? We get the linear system: S − AS = d ⇒ S = (I − A)−1 d If (I − A) is invertible we can solve and find a solution relatively easily. However we don’t know if there is one or if the solution could have negative values in S (which would be equivalent to producing a negative amount of a commodity). 43 • We use the fact that if the Neumann series ∞ X ∞ X An converges, then (I − A)−1 = n=0 An . n=0 • If we assume ρ(A) < 1 we get lim Ak = 0, Which is the same as saying the k→∞ Neumann series converge. • Since A is non-negative, ∞ X An is as well and we have guaranteed not only a n=0 solution, but also that S ≥ 0. • The question is what assumptions we need to make in order for ρ(A) < 1? We look at the column sums X ai,j , this can be seen as the total number of units X required to create one I − unit. It seems reasonable to assume ai,j ≤ 1 with strict i i inequality for at least one industry. Since otherwise the industries with go with a loss, and we can assume at least one goes with a profit. Then there is a matrix E ≥ 0, E 6= 0 with elements ei,j such that X ai,j + ei,j = 1, i (every column sum of A + E equal to one).If u is the unit vector with all elements equal to 1, we get: u> = u> (A + E) We assume ρ(A) ≥ 1 and use the Perron vector x of A to get: u> x = u> (A + E)x ⇔ x = Ax + Ex This can’t be true since Ex > 0 (E non-negative, non-zero) and Ax ≥ x (since ρ(A)x = Ax, ρ(A) ≥ 1). So if we assume the column sums of A to be less than or equal to one with at least one less than one, we get that ρ(A) < 1 and we are finished. ∞ X We also note that if A is irreducible, then An is even positive and not only nonn=0 negative. In other words if we increase the demand of any commodity, we will need to increase the supply of all commodities and not only the one we increased the demand for. Take a few minutes to think through the following questions regarding the new assumptions of the system. Apart from assuming the industries aren’t making a loss, and at 44 least one makes a profit, we also assumed A to be irreducible. This corresponds to every commodity being directly or indirectly needed to produce every commodity. This doesn’t seem unreasonable considering we work with large industries. 45 5 Stochastic matrices and Perron-Frobenius 5.1 Stochastic matrices and Markov chains A stochastic matrix, sometimes also called transition matrix or probability matrix, is used to describe the transitions between states in a Markov chain. Markov chains are used in applications in a large number of fields, some example are: • Markov chain Monte Carlo for generating numbers from very complex probability distributions in statistics. • Modeling of market behaviour, such as market crashes or periods of high or low volatility in economics. • Population processes in for example biology, queuing theory and finance. • Other less obvious places like in speech recognition, error correction, thermodynamics and PageRank from the previous chapter. Our aim here is to get an understanding of how and why Markov chains behave as they do using what we know of Perron-Frobenius theory. We start by defining stochastic matrices and Markov chains. Definition 5.1 A square non-negative matrix is • Row stochastic if every row sums to one. • Column stochastic if every column sums to one. • If it is both row and column stochastic we say that it is doubly stochastic. We will from now only say stochastic matrix when we in fact mean row stochastic matrix. We also notice that stochastic matrices need not be irreducible, however when they are we can immediately use Perron-Frobenius to find a lot of information regarding the matrix. A common example of doubly stochastic matrices are the permutation matrices since only one element in every row and column is non-zero and that element is one. Every stochastic matrix corresponds to one Markov chain: Definition 5.2 A Markov chain is a random process, with usually a finite number of states, where the probability of transitions between states only depend on the current state and not any previous state. A Markov chain is often said to be memoryless. • Formal expression of the fact that a discrete time Markov chain is memoryless: P (Xn+1 = x|X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P (Xn+1 = x|Xn = xn ) 46 • Every stochastic matrix defines a Markov chain and every Markov chain defines a stochastic matrix. You can see a discrete time Markov chain as a sequence of states X0 , X1 , X2 , . . . happening one after another in discrete time and the state in the next step depends only on the current state. So if our Markov chain has 3 different states, then this corresponds to a 3 × 3 stochastic matrix with every element ai,j corresponding to the transition probability from state xi to state xj . I.e., ai,j = P (Xn+1 = j|Xn = i)). Another way to represent a Markov chain is with a directed graph by considering the stochastic matrix as the graph’s distance matrix, where the distance is now a probability rather than a true distance. In this case, with the edges representing transition probabilities, it’s natural to only consider graphs with positive weights. As an example of a Markov chain, let us consider this simple weather model: • We know the weather on day zero to be sunny. • If it’s sunny one day it will be sunny the next with probability 0.7, cloudy with probability 0.2 and rainy with probability 0.1. • If it’s cloudy it will be sunny with probability 0.3, cloudy with probability 0.4 and rainy with probability 0.3. • If it’s rainy it will be sunny with probability 0.2, cloudy with probability 0.3 and rainy with probability 0.5. Seeing this in text makes it quite hard to follow, especially if we would like to add more states. However since the weather tomorrow only depend on the weather today, we can describe the model with a Markov chain and corresponding stochastic matrix. We order the states in the stochastic matrix from sunny to rainy and put in the transition probabilities to get the stochastic matrix P and it’s corresponding graph below. The vector X0 represents the probability distribution of the weather on day zero. sunny 1 0.7 0.2 0.1 cloudy X0 = 0 P = 0.3 0.4 0.3 rainy 0 0.2 0.3 0.5 0.2 0.7 sunny cloudy 0.4 0.3 0.1 0.3 0.2 0.3 rainy 0.5 47 When we worked with the distance matrix and adjacency matrix of graphs we saw that we could take the powers of the matrix to find out if there is a path between vertices. In the same way we can take the powers of the transition matrix P and if a element pi,j ∈ Pk 6= 0 then there it’s possible to get from state i to state j in k steps in the corresponding Markov chain. However we can say even more in the case of stochastic matrices, namely we can tell not only if it’s possible to be in a state after k steps, we can tell how likely it is as well. We can easily see that if we multiply the initial state (or distribution) X> 0 with P from the right, we get the probability to be in all the states in the next step X> 1 . Generally since we work with probabilities the initial distribution should be non-negative as well as sum to one. 0.7 0.2 0.1 > 0.3 0.4 0.3 = 0.7 0.2 0.1 X> 1 = X0 P = 1 0 0 0.2 0.3 0.5 And we see there is only a probability 0.1 that it will rain tomorrow provided it is sunny today. Now we can of course multiply this new vector X1 in the same way to get the probabilities for the weather on the day after tomorrow. 0.7 0.2 0.1 > 2 > 0.3 0.4 0.3 = 0.57 0.25 0.18 X> 2 = X0 P = X1 P = 0.7 0.2 0.1 0.2 0.3 0.5 And we get the probability for rain two days from now as 0.18 if it’s sunny today. So to get the probability distribution π k of the weather on day k we can take the transition matrix to the power of k and multiply with the initial distribution X> 0 to the left. > k X> k = X0 P Sometimes the transposed transition matrix is used instead Xk = (P> )k X0 , both obviously work equally well. • One important question in the theory of Markov chains is whether a stationary distribution π > = π > P exist. Or, in other words, if there is a distribution for which the distribution in the next step is the same as the distribution in the current step. k > • We are also interested to know if the limit π = ( lim X> 0 P ) exists. If it exists, k→∞ it means that as long as we iterate the Markov chain for long enough, we will eventually arrive at a stationary distribution π > = π > P • The stationary distribution can also be seen as the distribution of the time the Markov chain spend in every state if it is iterated a large number of times. • Last, when the limit exists, we are also interested to know wheather it converges towards the same stationary distribution π regardless of initial state X0 . 48 In order to answer these questions we divide the stochastic matrices in 4 different classes: 1. Irreducible with lim (Pk ) existing. (P is primitive) k→∞ 2. Irreducible with lim (Pk ) not existing. (P is imprimitive) k→∞ 3. Reducible and lim (Pk ) existing. k→∞ 4. Reducible and lim (Pk ) not existing. k→∞ 5.2 Irreducible, primitive stochastic matrices If P is irreducible we can use Perron-Frobenius for irreducible non-negative matrices. From this we get that there is one eigenvector with only positive values, namely the Perron- Frobenius eigenvector with eigenvalue r. From the same theorem we also have that r = 1 since r is bounded by the minimum and maximum row sum, both of which are equal to one since P is a stochastic matrix. The stationary distribution is then the same as the left Perron vector since we have: y> = y> P, if r = 1. So if P is irreducible, a stationary distribution exist, and it is unique. We remember the requirement for an irreducible matrix to be primitive is for the limit lim (P/r)k exist. Since r = ρ(P) = 1 the limit lim (Pk ) exist if P is primitive, and thus k→∞ k→∞ k > π = ( lim X> 0 P ) must exist as well. k→∞ Using Perron-Frobenius for primitive matrices, for P we get the (right) Perron vector (x = Px) as x = e/n, where e is the vector with every element equal to one and n is the number of states . You can easily verify this since r = 1 and every row sums to one. If we let π = (π1 , π2 , . . . , πn ) be the left Perron vector (π > = π > P) we get the limit: lim Pk = k→∞ (e/n)π > = eπ > > 0 π > (e/n) Calculating the probability distribution of the limit gives: k > > > lim X> 0 (P ) = X0 eπ = π k→∞ And we see that if P is irreducible and primitive, not only does the stationary distribution π exist and is unique, we also get that the Markov chain converge towards this distribution regardless of initial distribution X0 . Additionally Perron-Frobenius gives us an easy to calculate way to find the stationary distribution since it’s equal to the left Perron vector of P. We give a short example using the earlier weather model. 49 We had the transition matrix: 0.7 0.2 0.1 P = 0.3 0.4 0.3 0.2 0.3 0.5 Since P is positive, it’s obviously irreducible and primitive as well and as such we know that there is a stationary distribution and the distribution of the Markov chain will converge towards it. For the limit we get: k > > > lim X> 0 P = X0 eπ = π k→∞ To find the stationary distribution we calculate the left Perron vector π which yields π = [0.46 0.28 0.26]> which is our stationary distribution. From this we can say that on average it will be sunny 46% of the time, cloudy 28% of the time and rainy 26% of the time. 5.3 Irreducible, imprimitive stochastic matrices Since P is irreducible we know that there at least exists a stationary distribution and that it’s unique. However since it’s imprimitive the limit lim Pk doesn’t exist and the k→∞ Markov chain therefore doesn’t necessarily converge towards the stationary distribution. However we can easily see that the (right) Perron vector is the same as in the primitive case (since every row sums to one). Similarly finding the stationary distribution can also be done in the same way for imprimitive matrices as with primitive ones as the left Perron vector. While we in the primitive case would also have been able to find the stationary distribution simply by iterating the chain a large number of times, this is not the case when working with imprimitive matrices. Even though the Markov chain doesn’t converge, we are still interested in seeing how the Markov chain behaves if we let it iterate for a long time.This is useful for examle when we want to sample from a distribution using an imprimitive Markov chain in Markov chain Monte Carlo methods. While the limit lim Pk doesn’t exist, what we are k→∞ interested isn’t necessarily the limit itself but the distribution of time the Markov chain spends in the different states if we let the Markov chain iterate a large number of times. For primitive matrices this is the same as the limiting distribution. In the imprimitive case we can calculate the Ces´aro sum instead: k 1X j P k→∞ k lim j=0 This converges and we get: k 1X j P = eπ > > 0 k→∞ k lim j=0 50 We multiply the limit with the initial vector X0 and get just like in the primitive case: X0 k 1 X j > > P = X> 0 eπ = π k→∞ k lim j=0 In other words the stationary distribution π exist and is the same regardless of the choice of initial state X0 . The Markov chain doesn’t converge to it but rather ’oscillates’ around it. More on why the Ces´ aro sum always exist for these matrices can be read about in [?] We give a short example using our previous 0 P = 1 small permutation matrix: 1 0 We already saw that it’s imprimitive and therefore the limit doesn’t exist. However limit of the Ces´ aro sum does exist and we still get the stationary distribution π > as the Perron vector. This gives the stationary distribution π > = [0.5 0.5]> . So while the Markov chain won’t converge it will on average spend half the time in state 1 and half the time in state 2. In fact we can easily see that it spends exactly every second step in state 1 and the other in state 2. 5.4 Reducible, stochastic matrices Reducible matrices are not as common to work with in practice as with irreducible matrices, often because the model itself is constructed in such a way as to ensure it produces an irreducible matrix. However sometimes it’s desirable to use a reducible chain, for example using an absorbing Markov chain (one or more states which we can’t leave after the Markov chain ends up in it) in risk analysis. Since we can’t immedietly apply Perron Frobenius to reducible Markov chains, we for example don’t know if there exists any stationary distribution or maybe there even exist many. What we can do is try to get as close as possible to the irreducible case. We recall that a reducible matrix A could be permuted into upper triangular form by a permutation matrix P: X Y > P AP = 0 Z If X or Z is reducible as well we can permutate the original matrix further untill we end up with: X1,1 X1,2 · · · X1,k 0 X2,2 · · · X2k > P AP = . .. .. .. .. . . . 0 0 0 Xk,k 51 Where Xi,i is either irreducible or a single zero element. One thing we can immediately see already is that if we leave for example block X2,2 we can never come back since none of the states in any block Xi,i have any connection to any of the previous blocks Xj,j , j < i. Last we look for blocks where only the diagonal block Xi,i contains non-zero elements and permutate those to the bottom so we end up with: > P AP = X1,1 0 .. . X1,2 X2,2 .. . ··· ··· .. . X1,j X2j .. . X1,j+1 X2,j+1 .. . X1,j+2 X2,j+2 .. . 0 0 0 0 0 ··· Xj,j 0 Xj,j+1 Xj+1,j+1 Xj,j+2 0 0 .. . 0 .. . ··· 0 .. . 0 .. . Xj+2,j+2 .. . 0 0 0 0 0 ··· ··· ··· ··· ··· ··· ··· .. . .. . ··· X1,k X2,k .. . Xj,k 0 0 .. . Xk,k This is called the canonical form of the stochastic matrix and we notice a couple of things. First we remind ourselves that this is still the same Markov chain, we have just relabeled the vertices by permutating the matrix using a permutation matrix and we remind ourselves that the blocks Xi,j are matrices and not necessarily single elements. We also note that this form is generally not unique since we could swap the position of two different blocks in the bottom right part or sometimes in the top left as well. Using this form we can say some things about the behaviour of the Markov chain which might not be as apparent otherwise. We call the states in the upper left the transient states. Whenever we leave one of the diagonal blocks Xi,i , i ≤ j we can never get back to the same or any previous block since all elements below the diagonal blocks are zeroes. Also all the blocks in the top left Xi,i at least on of the blocks Xi,i+1 , . . . , Xi,k are non zero (or they would belong to the bottom right part). This means that if we start in any state in the top left part, we will eventually end up in one of the blocks in the bottom right. We call the states in the bottom right the ergodic states or the absorbing states, since once you enter any of the blocks in the bottom right part you can never leave it. Either because it’s a single vertex only linking back to itself, or because it’s in itself an irreducible stochastic matrix. The first thing to note is that the initial state obviously influences the behaviour of reducible Markov chains, for example if we start in one of the ergodic states we obviously get a different sequence than if we start in another ergodic state. So even if there would exist a stationary distribution we can expect it to depend on the initial state. Often with reducible Markov chains we aren’t as interested in finding where the chain is after a long time (especially if there is only one absorbing state), but we are often more 52 interested in seeing how long time we expect it to take in order to reach a absorbing state or the probability to end up in a certain absorbind state when there are more than one. One example would be when modelling the failure rates of a machine, we know it will fail eventually, but for how long can we expect it to run? Assuming a reducible stochastic matrix A is written in canonical form. X Y A= 0 Z We can then find the limit lim Ak if all blocks in Z are primitive matrices or the k→∞ corresponding limit of the Ces´ aro sum if at least one of them are imprimitive. We look at the case where all blocks in Z is primitive: The limit can then be written as: 0 (I − X)−1 YE lim A = 0 E k→∞ k > eπ j+1 0 ··· .. 0 . eπ > j+2 E= . . .. .. ··· 0 ··· 0 0 0 .. . eπ > k Obviously we don’t find a unique stationary distribution (since every block in Z is absorbing), however the limit is still very useful. First of all we see that E contains the individual Perron vectors for the blocks in Z. So if we start in one of the absorbing blocks we find a stationary distribution unique to that block. In fact in that case we can obviously disregard the rest of the Markov chain and only concern ourselves with that one block if we have the transition matrix in canonical form. Secondly the elements ap,q of (I − X)−1 Y holds the probability that starting in p we leave the transient states and eventually hit state q in one of the absorbing blocks. If q is a single absorbing state ap,q is thus the probability to end up in q when starting in p. This is commonly called the hitting probability to hit q starting in p. The vector (I − X)−1 e is of particular interest as well. It holds the average number of steps needed to leave the transient state, we call this the hitting times for reaching the absorbing states. We take a short look at hitting times and hitting probabilities in the next section from a slightly different perspective to understand why this is correct. 5.5 Hitting times and hitting probabilities When we have a reducible Markov chain (with a finite number of states) we know that we will eventually reach one absorbing state. We consider a stochastic matrix P with 53 transient states D and two absorbing states A, B. We want to find the hitting probability hi that we end up in A starting in d ∈ D. We illustrate how to find such a hitting probability with a short example where we consider the Markov chain describes by the graph below. 2/3 1/3 1 1/3 2 3 4 2/3 We ask ourselves: what is the probability to end up in 1 starting in 2? We write this probability as hi = Pi (hit 1). After one step we are either in state 1 with probability 1/3 or we are in 3 with probability 2/3. In other words we could write h2 as: 1 2 h2 = P2,1 h1 + P2,3 h3 = h1 + h3 3 3 But h1 is the probability to hit 1 starting in 1, since we are already in 1 this probability is obviously 1. We then continue and find the expression for h3 which can be written: 1 2 h3 = P3,4 h4 + P3,2 h2 = h4 + h2 3 3 Since state 4 is absorbing so the probability h4 to hit 1 from 4 is zero and we get the equation system: 1 2 h2 = + h3 3 3 h3 = 2 h2 3 Solving this gives the hitting probability h2 = 3/5. While this method can be done by hand with a low number of states, it is mostly used to find a recurrence relation when we have an infinite number of states such as a random walk on the integers. If we have a finite number of states and thus can write out our transition matrix P we can find the hitting probability using the following: Theorem 5.1. Consider the Markov chain with transition matrix P and n states. Let A be a subset of states. The vector of hitting probabilities hA i = Pi (hit A) is the minimal non-negative solution to the linear system: A i∈A hi = 1,n X A pi,j hA i∈ /A j hi = j=1 With minimal solution we mean that for any other solution x to the system we have xi ≥ hi . 54 By writing P in canonical form we can verify that this corresponds to the result in the previous section. The minimality is needed in order to get hA i = 0 in other absorbing states than those in the absorbing block we seek. When we solve the system we can set hA i = 0, if i is any absorbing state not in A. We note that it does not matter whether A consists of absorbing states, since we are not interested in what happens after we have entered A. In terms of the previous example we could use the theorem to find the probability that we reach state 3 at some point before reaching an absorbing state. Next we continue with our example but instead ask ourselves: What is the expected number of steps starting in 2 until we hit either of the absorbing states? We define the hitting time: ki = Ei (time to hit {1,4} ) As with the hitting probability we look at the next step but now also add a 1 for the time it takes to reach the next step: 1 2 2 k2 = P2,1 k1 + P2,3 k3 + 1 = k1 + k3 + 1 = k3 + 1 3 3 3 Where we noted that k1 = 0 since we then have already reach one of the absorbing states. Next we continue with k3 and get: 2 2 1 k3 = P3,4 k4 + P3,2 k2 + 1 = k4 + k2 + 1 = k2 + 1 3 3 3 Solving this system gives the expected number of steps until absorption as k2 = 3. In the same way as for hitting time we get in general: Theorem 5.2. Consider the Markov chain with transition matrix P, n states and absorbing states A. The vector of hitting times kiA = Pi (hit A) is the minimal non-negative solution to the linear system: A i∈A ki = 0, n X A pi,j kjA i ∈ /A ki = 1 + j=1 Once again we can write the system in canonical form to verify that this corresponds to the result in the previous section. We do note however that care needs to be taken so that we don’t have other absorbing states outside of A. This since we would then get the relation for a single absorbing state outside of A as kj = 1 + kj . 5.6 A short look at continuous time Markov chains Here we will seek to give a short introduction of continuous time Markov chains and their relation to what we already know about discrete time chains without going to much into probability theory. This is not meant to give a full understanding of continuous time 55 chains, but rather highlight similarities and relations with discrete time chains and previous chapters. A continuous time Markov chain can be seen as a discrete time chain, but instead of jumping to a new state at fixed intervals 1, 2, 3, . . . we instead jump to a new state randomly in time. With randomly in time we mean that the jump from a state is exponential distributed with intensity λ (which might be different between vertices). Exponential distributed jumps mean two important things, one if we assume there is only one jump T in the interval T ∈ (0, t) then the jump is uniformly distributed on this interval P (T = x|x ∈ (0, t)). Second the jumps are memoryless which means that if there was no jump in (0, s) then the probability of a jump at a point in time after s is independent of s: P (T ∈ (0, t)) = P (T ∈ (s, s + t)|T ≥ s). We will only concern ourselves with chains with a finite state space. With infinite state space there are more things that needs to be taken into consideration such as chances for the chain to ”explode” or when the chance to eventually get back to where we started is less then one (even if it’s irreducible). For a continuous time chain with a finite state space we can make a graph of the chain. Now however we no longer have the probabilities to jump to another state as weights on the edges, but rather the intensity λ for the jump. We also naturally have no self-loops since this is handled by the intensity of the jumps to other states. An example of what it could look like can be seen in subsection 5.6 below. 1 1 2 2 1 1 1 3 4 2 In continuous time we talk about two different matrices. First we have the Q matrix which is the negative Laplacian matrix where we take into consideration the weight’s of the edges. This means that we get elements qi,j = λi,j , i 6= j where λi,j is the intensity of jumps from state i to state j. The diagonal elements we choose such that the sum of every row is zero. For the Markov chain described by the graph above we get Q matrix: −3 1 2 0 0 −1 0 1 Q= 1 0 −2 1 0 0 2 −2 If Q is irreducible and finite then there exist a stationary distribution π which we can find as π > Q = 0. In continuous time we can’t have any periodicities and irreducibility 56 is actually enough for a finite chain to guarantee convergence towards this distribution. While we can solve this system to find the stationary distribution there is also another more familiar matrix we can work with. Namely the transition matrix P(t) = etQ . The element pi,j (t) in P (t) describes the probability of being in state j at time t if the chain started in state i at time 0. We will look more at the exponential of a matrix and how to calculate it later in ??. For the moment it is enough to know that we get : P(t) = etQ = ∞ X (tQ)k n=0 k! P(t) is then a stochastic matrix for all t ≥ 0 and we can find the stationary distribution (using any t > 0) in the same way as we did for a discrete time Markov chain. For Q above we get using t = 1 0.1654 0.2075 0.3895 0.2376 0.0569 0.3841 0.2533 0.3057 P(1) = 0.1663 0.0982 0.4711 0.2645 0.1394 0.0569 0.4720 0.3317 And we can also find the stationary distribution by solving the linear system π > P(1) = π by for example finding the Perron vector π. It’s easy to see that π = [0.1429 0.1429 0.4286 0.2857]> is a stationary distribution and that it satisfies both π > P(t) = π and π > Q = 0. > 5.7 Exercises 5.1. Consider the stochastic matrix 0 0.5 0.5 0 1 P= 0 0.5 0 0.5 • Show that P is irreducible. • If we start in state 1, what is the probability to be in state 3 after 3 steps (P (X3 = s3 |X0 = s1 )). • What is the stationary distribution of this Markov chain? 5.2. Consider the stochastic matrix 0 0.5 0 0.5 1 0 0 0 T= 0 0 1 0 0.25 0.25 0.25 0.25 57 • Write T in canonical form. • Calculate the hitting times of this Markov chain. What is the expected number of steps needed to reach the absorbing state if we start in the first state? 5.8 Applications 5.8.1 Return to the Leontief model We return to the Leontief Input-Output model in subsubsection 4.3.1 for modeling a economic system composed of n industries.. We remember we got a linear system S − AS = d. We found that if A was irreducible (every industry directly or indirectly depend on all others) and every column sum to one or less, with at least one column with sum less than one (at least one goes with a profit and none goes with a loss), we could guarantee that we could satisfy any given demand d. We are now interested to answer some more questions about the system, such as what is the gross national product (GNP) of the system? To answer this we will look at the system as a Markov chain with transition matrix Q = A> . We note that we take the transpose of the previous system matrix such that every row now sums to one or less rather than column and elements pi,j now corresponds to the amount of J−units required by industry I to create one I−unit. To account for the demand vector, which essentially takes units of produced commodities out of the system we introduce a new state d in the transition matrix. Since any produced commodity that enters the ”public” Xcan’t come back into the system we let d be an absorbing state such that Qi,d = 1 − qi,j . j In other words we modify the transition matrix by adding a elements to the column corresponding to the demand d such that every row of A> sum to one. We get a transition matrix of the form: q1,1 . . . q1,n q1,d .. .. .. .. . . . Q= . qn,1 . . . qn,n qn,d 0 ... 0 1 Since the top left n × n part is irreducible, and at least one of qi,d , . . . qn,d > 0 since at least one goes with a profit we see that we can reach the absorbing state d from any other state. Our transition matrix therefore describes an absorbing Markov chain it’s also easy to see that any initial supply S leads to supplying some demand d. GNP is the total supply S produced to both satisfy the demand as well as the internal demand of the system. If we look at the mean number of steps required to reach the absorbing state from the transient states: T = (I − A> )−1 u. • We can see the elements ti as the expected production in units to create one product unit of industry i. 58 • Multiplying the demand vector d with T: (T> d) gives the expected total produced units. i.e. the GNP. 59 6 Linear spaces and projections In this chapter, which will be a bit more theoretical, we will take a look at linear spaces and their relation to matrices. This in order to get a better understanding of some of the underlying theory as well as some commonly used methods or concepts used later. We will start by defining linear spaces and inner product spaces and later in the last part take a closer look at projections and how they can be used to for example find a basis of a matrix. 6.1 Linear spaces We start by defining a linear space: Definition 6.1 A linear space or vector space over a field F is a set V together with two binary operations (+, ∗) defined such that: u, v ∈ V ⇒ u + v ∈ V, and u ∈ V, a ∈ F ⇒ a ∗ u ∈ V With the following being true, where we let u, v, w be vectors in V and a, b be scalars in F . For vector addition (+) we have: • Associativity u + (v + w) = (u + v) + w. • Commutativity: u + v = v + u. • Existence of identity element 0 ∈ V such that v + 0 = v for all v ∈ V . • Existence of a inverse element −v ∈ V such that v + (−v) = 0. For scalar multiplication (∗) we have: • Distributivity with respect to vector addition: a ∗ (u + v) = a ∗ u + a ∗ v. • Distributivity with respect to field addition: (a + b) ∗ v = a ∗ v + b ∗ v • Compatibility with field multiplication: a ∗ (b ∗ v) = (ab) ∗ v. • Existence of identity element 1 ∈ F such that 1 ∗ v = v for all v ∈ V . Usually we omit the sign (∗) for multiplication as we are used to. In order to not go to much into algebraic structures we consider a field a set with two operations (+, ·) following the ”usual” rules such as associativity and commutativity. To read more about fields and other algebraic structures we refer to a book on algebra such as [?]. Two examples of fields are the real and complex numbers R, C. 60 We take a closer look at the definition of a linear space and the requirements on the two operations (+, ∗) for V to be a linear space. First of all we see that + is a operation between two vectors in V such that the result is a new vector also in V . We also need it to be associative and commutative such that we can add in whatever order we want and/or swap place of the two vectors. There also exist a zero(identity) and inverse element. These are all things we are used to from the real numbers. Second we have another operation ∗ between a scalar in F and a vector in V such that the result is a vector in V . Similarly to + we have a few requirements on the operation such as distributivity with respect to both vector and field addition. This means means for example we can either add two vectors and then multiply or multiply with both vectors and then add the result. We also have Compatibility with field multiplication as well as the existence of a multiplicative identity element. Some examples of linear spaces using the common addition and multiplication operations are: • Rn . • Cn . • Matrices of size n × n with elements in F : Mn×n (F ). • {polynomials} In the future, unless otherwise noted we will assume the common addition and multiplication operation is used. We take a look at R2 with vectors u = (u1 , u2 ), v = (v1 , v2 ), w = (w1 , w2 ) ∈ V and scalars a, b ∈ F . If we add two vectors we get: (u1 , u2 ) + (v1 , v2 ) = (u1 + v1 , u2 + v2 ), since the addition of two real number (for example u1 + v1 ) is a new real number this obviously belong to R2 as well. Both assiciativity and commutativity immediately follows from addition in R as well. The identity element with respect to addition is obviously the zero vector 0 = (0, 0) since (u1 , u2 ) + (0, 0) = (u1 + 0, u2 + 0) = (u1 , u2 ). Similarly we find inverse elements simply by multiplying every element of the vector with −1. For scalar multiplication a ∗ (v1 , v2 ) = (av1 , av2 ) we can show that it fullfills the requirements in a similar way and as such is left to the reader. We now introduce something called subspaces of linear spaces, these can be seen as another linear space contained within a linear space. Definition 6.2 If V is a vector space, a subset U ⊆ V is called a subspace of V if U is itself a vector space with the same operations (+, ∗) as V . We also get a way to test if a subset U is a subspace of a linear space V by checking a just a couple of different things rather than all the requirements in the original definition for a linear space. 61 Theorem 6.1. The subset U ⊆ V is a subspace of V if and only if it satisfies the following 3 conditions. • 0 lies in U , where 0 is the zero vector (additive identity) of V . • If u, v ∈ U then u + v ∈ U . • If u ∈ U and a ∈ F then au ∈ U . For example we look at the linear space R2 and want to see if the points on the line y = 2x, x, y ∈ R is a subspace of R2 . We can write the points on the line as: U = {a(1, 2)|a ∈ R} U is obviously a set in R2 so we only need to show the three requirements above. If we choose a = 0 it’s clear that 0 ∈ V also lies in U . Adding two vectors in U gives a(1, 2) + b(1, 2) = (a + b)(1, 2) which is obviously in U since a + b ∈ R if a, b ∈ R. Last we multiply with a scalar b ∗ a(1, 2) = (ba)(1, 2) which also obviously belong to U since ba ∈ R. Definition 6.3 A vector v such that: v = c1 v1 + c2 v2 + . . . + cm vm = m X cj vj j=1 is called a linear combination of vectors {v1 , v2 , . . . , vn } ∈ V with coefficients in F if {c1 , c2 , . . . , cm } ∈ F . We illustrate it with an example: Considering the two vectors u = (1, −2, 1) and v = (1, 1, 2). Then the vector (3, −3, 4) is a linear combination of u, v since 2u + v = 2(1, −2, 1) + (1, 1, 2) = (3, −3, 4). However the vector (1, 0, 1) is not a linear combination of these vectors since if we write it as an equation system su + tv = (1, 0, 1) we get: 1=s+t 0 = −2s + t 1 = s + 2t The only way to satisfy the first and third equation is if s = 1, t = 0 however then the middle equation is not fulfilled. We often talk about not one linear combination of vectors but all linear combinations of a set of vectors: Definition 6.4 If {v1 , v2 , . . . , vm } is a set of vectors in V . Then the set of all linear combinations of these vectors is called their span denoted: span(v1 , v2 , . . . , vm ) 62 • If V = span(v1 , v2 , . . . , vm ) then we call these vectors a spanning set for V . For example given the two vectors u = (1, −2, 1) and v = (1, 1, 2) above we get their spanning set: span(u, v) = {su + tv|s, t ∈ R} = (s + t, −2s + t, s + 2t) So the spanning set is all vectors which we can write in this way. We recognize that the spanning set above actually describes a plane in R3 . In fact all spanning sets in a vector space V are subspaces of V . Theorem 6.2. For a span U = span(v1 , v2 , . . . , vm ) in a vector space V we have: • U is a subspace of V containing each of v1 , v2 , . . . , vm . • U is the smallest subspace containing v1 , v2 , . . . , vm , in that all subspaces containing v1 , v2 , . . . , vm must also contain U . We prove the first statement: Proof. We use the test described earlier to see if this subset is also a subspace. • U obviously contains 0 = 0v1 + 0v2 + . . . + 0vm . • Given u = a1 v1 + a2 v2 + . . . + am vm and v = b1 v1 + b2 v2 + . . . + bm vm and c ∈ F . • u + v = (a1 + b1 )v1 + (a2 + b2 )v2 + . . . + (am + bm )vm ∈ U • cu = (ca1 )v1 + (ca2 )v2 + . . . + (cam )vm ∈ U . • And we have proven that U is a subspace of V . We take a look at the spanning set span{(1, 1, 1), (1, 0, 1), (0, 1, 1)} and want to find if this spans the whole of R3 ? span{(1, 0, 1), (1, 1, 0), (0, 1, 1)} is obviously contained in R3 and it’s clear that R3 = span{(1, 0, 0), (0, 1, 0), (0, 0, 1)}. We can then prove that (1, 0, 0), (0, 1, 0), (0, 0, 1) is contained in our spanning set and use Theorem 6.2 to prove that it contains R3 . We do this by writing (1, 0, 0), (0, 1, 0), (0, 0, 1) as a linear combination of our vectors: (1, 0, 0) = (1, 1, 1) − (0, 1, 1) (0, 1, 0) = (1, 1, 1) − (1, 0, 1) Using the two first found vectors we find the last. (0, 0, 1) = (1, 1, 1) − (1, 0, 0) − (0, 1, 0) 63 It’s clear that there are generally more than one way to write a linear space as a spanning set. We will now move on to look at two different types of spanning sets. We compare two spanning sets in R2 : span((1, −1), (1, 1), (1, 2)) = R2 span((1, 2), (2, 3)) = R2 In the first spanning set we can represent the vector (3, 2) as a linear combination of the spanning vectors in two ways: (3, 2) = 1(1, −1) + 1(1, 1) + 1(1, 2) (3, 2) = 0(1, −1) + 4(1, 1) − 1(1, 2) However in the second spanning set we get: (3, 2) = s(1, 2) + t(2, 3) = (s + 2t, 2s + 3t) ⇒ s = −5 t=4 In fact, any vector (a, b) have only one unique representation (a, b) = s(1, 2) + t(2, 3). We introduce the notion of linearly dependent or independent sets of vectors. Definition 6.5 A set of vectors {v1 , v2 , . . . , vm } is called linearly indepependent if the following is true: a1 v1 + a2 v2 + . . . + am vm = 0 ⇒ a1 = a2 = . . . = am = 0 In other words, a set of vectors {v1 , v2 , . . . , vm } are linearly independent if the only way to represent 0 in these vectors is with all coefficient zero: 0 = 0v1 + 0v2 + . . . + 0vm One important consequence of this is that it also means that any other vector v can only be representes as a linear combination of {v1 , v2 , . . . , vm } in one way as well. Otherwise we could find two representations of 0 by subtracting v − v using two different representations of v. If a set of vectors isn’t linearly independent they are linearly dependent. Definition 6.6 A set of vectors {v1 , v2 , . . . , vm } is called linearly depependent iff some vj is a linear combination of the others. We consider three vectors {(1 + x), (2x + x2 ), (1 + x + x2 )} in (P2 ) and want to find if they are linearly independent or not by finding a linear combination of them that result in the zero vector. 0 = s(1 + x) + t(2x + x2 ) + u(1 + x + x2 ) 64 We get a set of linear equations: s s + 2t t + + + u u u =0 =0 =0 It’s clear that the only solution to the system is s = t = u = 0 and we see that the vectors are linearly independent. Considering this relation between solving an equation system and linear independence we can easily proove the following usefull theorem. Theorem 6.3. If a vector space V can be spanned by n vectors. Then for any set {v1 , v2 , . . . , vm } of linear independent vectors in V , m ≤ n. In the proof which we leave to the reader we show that the m vectors are linear depenm X dent if m > n by setting up the equation system xj vj = 0 We represent the m linear j=1 independent vectors as a linear combination of a span of n vectors. This will result in a system of n equations with m unknown variables, which has a nontrivial solution. In practise the theorem means that every set of linear dependent set of vectors that span a vector space V have the same number of vectors, we call this a basis. Definition 6.7 A set of vectors {v1 , v2 , . . . , vm } is called a basis in V if the following is true: • {v1 , v2 , . . . , vm } is linear independent. • V = span{v1 , v2 , . . . , vm }. We summarize our findings of bases in a linear space. • If {v1 , v2 , . . . , vm } and {w1 , w2 , . . . , wn } are two bases in V , then m = n. • In other words, all bases in a vector space contain the same number of vectors. • If V have a basis of n vectors. We say that V have the dimension n: dim V = n • We say that a vector space is finite dimensional if either V = 0 or dim V < ∞. In fact not only for all bases in a linear space V is it true that they all have the the same amount of vectors, any set of that same amount of linear independent vectors in V is a basis in V . Theorem 6.4. If dim V = n < ∞ and {v1 , v2 , . . . , vn } is a set of linear independent vectors, then {v1 , v2 , . . . , vn } is a basis in V . 65 Proof. It’s already linear independent so we then need to prove that {v1 , v2 , . . . , vn } spans V . If we assume {v1 , v2 , . . . , vn } does not span V , we could then choose a new vector vn+1 outside span{v1 , v2 , . . . , vn }. Then {v1 , v2 , . . . , vn , vn+1 } is linear independent. But we already know there is a basis {w1 , w2 , . . . , wn } which spans V since dim V = n, and for any set of linear independent vectors {v1 , v2 , . . . , vm }, we have m ≤ n. So we get a contradiction and {v1 , v2 , . . . , vn } must span V . One common solution to problems is to fit a new more appropriate basis to the problem and use that instead, this guarantee that any set of same amount of linear independent vectors is also a basis making it unnecessary to check if our new proposed basis is actually a basis or not. In essence it makes it easy to find a new basis if we already have one basis to start with. For example changeing to spherical coordinates, finding eigenvectors of a matrix, when finding many matrix decompositions, or making a fourier approximation of a function can all be seen as more or less as finding a new basis to the problem. Before continuing we give a short relation to matrices and how they can and are often used to represent a basis. Theorem 6.5. For a (real) n × n matrix M the following properties are equivalent: • The rows of M are linearly independent in Rn • The columns of M are linearly independent in Rn • The rows of M span Rn . • The columns of M span Rn . • M is invertible. 6.2 Inner product spaces We note that this far we have only seen vectors as a set of elements in F . However when we talk about vectors we are often talking about length of or angle between vectors. To be able to do this we need to add something to our linear space, namely we add another operation called a inner product and call it a inner product space. Definition 6.8 An inner product on a vector space V over F is a map h·, ·i that takes two vectors u, v ∈ V and maps to a scalar s ∈ F . We denote the inner product of two vectors u, v: hu, vi for which the following statements must be true: • hu, vi = hv, ui, (symmetric in Rn ). • hau, vi = ahu, vi. 66 • hu + v, wi = hu, wi + hv, wi. • hu, ui > 0 for all u 6= 0 in V . • A vector space V with an inner product h·, ·i we call a inner product space. The most commonly used inner product is the dot product sometimes called scalar product: u · v = u1 v1 + u2 v2 + . . . un vn ∈ Rn Using the definition above we can show that the dot product is in fact a inner product. • For the first property we get u · v = u1 v1 + u2 v2 + . . . un vn = v1 u1 + v2 u2 + . . . vn un . Which is obviously true since xy = yx in R. • For the second we get: au·v = au1 v1 +au2 v2 +. . . aun vn = a(v1 u1 +v2 u2 +. . . vn un ). Which also immediately follows from multiplication in R. • Next we get (u + v) · w = (u1 + v1 )w1 + (u2 + v2 )w2 + . . . (un + vn )wn = (v1 w1 + v2 w2 + . . . vn wn ) + (u1 w1 + u2 w2 + . . . un wn ) = (v · w) + (u · w) and the third statement is fulfilled as well. • Last we note that hu, ui is the sum of the square of all elements. Since the square of a real number r is positive if r 6= 0 the only way for the dot product to equal zero is if u = 0. But although the dot product is the most commonly used inner product, there are also other inner products and such as for example the following on the polynomials of order n Pn : Z hp, qi = 1 p(x)q(x) dx 0 Where p, q are polynomials in Pn . It’s easy to show that this is a inner product from the definition as well for the interested reader. √ You should be familiar with the Euclidean norm |v| = the norm of a inner product space in a similar way. v> v of a vector, we define Definition 6.9 If h·, ·i is an inner product in the vector space V , the norm (sometimes called length) kvk of a vector v in V is defined as: p kvk = hv, vi We immediately see that when using the dot product we get the Euclidian norm. The norm in a inner product space share some properties with those of a vector norm: 67 • kvk ≥ 0, kvk = 0 if and only if v = 0. • kcvk = |c| kvk. • kv + wk ≤ kvk + kwk. But we also have one additional important property of norms in a inner product space called the Schwarz inequality: • If u, v are vectors in V , then: hu, vi2 ≤ kuk2 kvk2 • With equality if and only if one of u, v is a scalar multiple of the other. Now that we have a notion of length (norm) in our inner product space we might also want the distance between vectors. Once again we start by looking at how we calculate distance using the dot product and the common geometric distance (in R3 ). Using the dot product we get the distance between two vectors d(u, v) as: d(u, v) = p (u1 − v1 )2 + (u2 − v2 )2 + (u3 − v3 )2 p = (u − v) · (u − v) Which we recognize as the Euclidean norm |u − v|. We define the distance between vectors in a general inner product space in the same way: Definition 6.10 The distance between two vectors u, v in an inner product space V is defined as: d(u, v) = ku − vk For the distance d(u, v) we have the following properties for vectors u, v, w in the inner product space V . • d(u, v) ≥ 0 with equality iff u = v. • d(u, v) = d(v, u). • d(u, w) ≤ d(u, v) + d(v, w). While we can now calculate the length of a vector or the distance between two vectors we still have no idea how to find out if for example two vectors are parallel or orthogonal. To do this we need to define the angle between vectors. Using Schwarz inequality we define the angle between two vectors u, v. Definition 6.11 If 0 ≤ θ ≤ π is the angle between two vectors we have: cos θ = 68 hu, vi kuk kvk We remember Schwarz inequality saying: hu, vi2 ≤ kuk2 kvk2 This gives: ⇔ hu, vi kuk kvk 2 ≤ 1 ⇔ −1 ≤ hu, vi ≤1 kuk kvk Which we then use to define the angle between two vectors. One very important case is when the inner product is zero (hu, vi = 0) and therefor the angle π/2 radians between them. This is not only interesting geometrically but for general inner product spaces as well. Definition 6.12 Two vectors u, v in a inner product space V are said to be orthogonal if: hu, vi = 0 More important is if we have a whole set of vectors, all which are orthogonal to eachother: Definition 6.13 A set of vectors {v1 , v2 , . . . , vm } is called an orthogonal set of vectors if: • vi 6= 0 for all vi . • hvi , vj i = 0 for all i 6= j. • If kvi k = 1 for all i as well, we call it an orthonormal set. If we have a orthogonal set {v1 , v2 , . . . , vm } and want a orthonormal set we can easily find a orthonormal set simply by normalizing every vector. v1 v2 vm , ,..., kv1 k kv2 k kvm k Given a orthogonal set we have a couple of important properties, first of all it’s also linear independent. Theorem 6.6. If {v1 , v2 , . . . , vm } is an orthogonal set of vectors then it is linear independent. We now have another way to find a basis in a inner product space V , rather than trying to find a set of linear independent vectors, we could instead look for a orthogonal set of vectors. The found orthogonal set will then automatically be linearly independent as well. We will look more into how to do this in later when we take a closer look at projections. Additionally we have one additional property concerning the norm and orthogonal vectors. 69 Theorem 6.7. If {v1 , v2 , . . . , vm } is an orthogonal set of vectors then: kv1 + v2 + . . . + vm k2 = kv1 k2 + kv2 k2 + . . . + kvm k2 We give a short motivation showing it for two vectors, the proof can then easily be extended for n vectors as well. Given two orthogonal vectors v, u we get: kv + uk2 = hv + u, v + ui = hv, vi + hu, ui + 2hv, ui Since v, u are orthogonal hv, ui = 0 and we get: kv + uk2 = hv, vi + hu, ui = kvk2 + kuk2 A matrix whose rows or columns form a basis of orthonormal vectors is called a unitary matrix. We remember that for a unitary matrix we have: Theorem 6.8. For a n × n matrix U the following conditions are equivalent. • U is invertible and U−1 = UH . • The rows of U are orthonormal with respect to the inner product. • The columns of U are orthonormal with respect to the inner product. • U is a unitary matrix. If U is a real matrix using the ordinary dot product we say that U i an orthogonal matrix if it satisfies the conditions above. Worth noting is that if U is real the hermitian of a matrix is the same as the transpose and we get U−1 = UT . One example of unitary matrices are the permutation matrices, in fact we can see the unitary matrices as a sort of generalization. In the same way as multiplying with a permutation matrix PAPT can be seen as changing from one basis to another by reordering the basis vectors (relabel a graph). Multiplication with a unitary matrix UAUH can also be seen as a change of basis. This is often used in practice by given M find a suitable matrix decomposition M = UAUH and then continue working with the presumably easier A instead. 6.3 Projections We will now look closer at why we usually want to work with a orthogonal basis and how to construct such a basis. We still haven’t given a reason why we would want to find a new orthogonal basis given that we usually start with the standard orthonormal basis represented by the identity matrix. This however wil mainly have to wait until the examples and applications. One of the large advantages of orthogonal bases is in the ease of finding a representation of any vector in this basis. We have already seen that the representation is unique since it’s basis vectors are linearly independent and we therefor find the new representation by solving a linear system. However if we have an orthogonal basis {e1 , e2 , . . . , em } of a inner product space V . Then for any vector v ∈ V : 70 • v= hv, e1 i hv, e2 i hv, em i e1 + e2 + . . . + em ke1 k2 ke2 k2 kem k2 • We call this the expansion of v as a linear combination of the basis vectors {e1 , e2 , . . . , em }. This gives us an easy way to express v given an orthogonal basis without the need to solve a linear equation system. Even better, if the basis is orthonormal it’s even easier since then all kei k2 = 1 and we simply need to take the inner product of the vector v and all the basis vectors. The next step is in constructing an orthogonal basis, we will take a look at projections and how to use these to create a basis. While we have already constructed bases many times before, for example every time you use Gauss-elimination, we are now interested in how we can ensure that we create a orthogonal basis and rather than only a linear independent one. One method to do this is the Gram-Schmidt process, however before looking at the method we must first show that such a basis actually exist. Theorem 6.9. In every inner product space V with dim V < ∞ there exist an orthogonal basis. Proof. We let V be an inner product space with dimension dim V = n and show it using induction. • If n = 1 then any basis is orthogonal. • We then assume every inner product space of dimension n have an orthogonal basis. Then if dim V = n+1 and {v1 , v2 , . . . , vn+1 } be any basis of V . If U = span(v1 , v2 , . . . , vn ) and {e1 , e2 , . . . , en } is an orthogonal basis of U . It’s then enough to find any vector v such that hv, ei i = 0 for all i. since then {e1 , e2 , . . . , en , v} will be an orthogonal basis in V . We consider: v = vn+1 − t1 e1 − t2 e2 − . . . − tn en Where we need to find all ti . Since vn+1 is not in U , v 6= 0 for all choices of ti . So if we can find ti for all i such that hv, ei i = 0 we are finished. For the orthogonality with ei we get hv, ei i = hvn+1 , ei i − t1 he1 , ei i − . . . − ti hei , ei i − . . . − tn hen , ei i = hvn+1 , ei i − 0 − . . . − ti kei k2 − . . . − 0 = hvn+1 , ei i − ti kei k2 And we get hv, ei i = 0 if ti = hvn+1 , ei i . kei k2 71 We noice that the proof not only shows that there is an orthogonal basis, but also gives us a way to find one given any other basis. It’s is recommended that you take some time to really understand the proof and why it works. We use the results here in order to create the Gram-Schmidt process used in order to find an orthogonal basis. We summarize the previous finding in Theorem 6.10. Let {e1 , e2 , . . . , en } be an orthogonal set of vectors in a inner product space V and v be a vector not in span(e1 , e2 , . . . , en ). • Then if: en+1 = v − hv, e1 i hv, e2 i hv, en i e1 − e2 − . . . − en 2 2 ke1 k ke2 k ken k2 • {e1 , e2 , . . . , en , en+1 } is an orthogonal set of vectors. Using this we construct the Gram-Schmidt process: • Let V be an inner product space, and {v1 , v2 , . . . vn } be a basis of V . • We then construct the vectors e1 , e2 , . . . en of an orthogonal basis in V succesively using: • e1 = v1 . • e2 = v2 − hv2 , e1 i e1 . ke1 k2 • e3 = v3 − hv3 , e1 i hv3 , e2 i e1 − e2 . 2 ke1 k ke2 k2 • ... • en = vn − hvn , e1 i hvn , en−1 i e1 − . . . − en−1 . 2 ke1 k ken−1 k2 We note that to get a orthonormal basis (and orthogonal matrix) we need to also normalize the resulting vectors. We can do that either at the end when we have found an orthogonal basis or during the algorithm itself. While this works in theory, when implemented on a computer there can be some stability problems. When computing the vector ek , there might be some small error in precision resulting in vk being not completely orthogonal. This is then compounded when we further orthogonalize against this vector and resulting vectors repeatedly. Luckily we can make a small adjustment of the method in order to handle these problems. Modified Gram-Schmidt works like Gram-Schmidt with a small modification. • u1 = v 1 . (1) • uk = v k − 72 hvk , u1 i u1 . ku1 k2 (1) (2) (1) • uk = uk − huk , u2 i u2 . ku2 k2 • ... (n−2) • (n−1) uk = (n−2) uk hu , un−1 i − k un−1 . kun−1 k2 (j) • In every step j we orthogonalize all vectors uk , k > j with uj . We note that while we assume that {v1 , v2 , . . . vn } it doesn’t actually need to be linearly independent. Since if we apply it to a set of linear dependent vectors then for some i, ui = 0 and we can discard them and not orthogonalize further vectors with them. In this case the number of resulting vectors will be the same as the dimension of the space spanned by the original vectors. One example of where we can use Gram-Schmidt is in finding a QR-decomposition of a matrix, we look more at that in the applications part here and the QR-method later. While we have already used projections, we haven’t actually defined what a projection is in respect to a inner product space. Definition 6.14 If v = v1 + v2 with v1 ∈ U and v2 ∈ U ⊥ , where U ⊥ is orthogonal to U . • The vector v1 is called the projection of v on U , denoted: projU (v). • The vector v2 is called the component of v orthogonal to U , given by: v2 = v − projU (v). Theorem 6.11. If V is a inner product space and U is a subspace of V . Then every vector v ∈ V can be written uniquely as: v = projU (v) + (v − projU (v)) • If e1 , e2 , . . . , em is any orthogonal basis of U , then: projU (v) = hv, e1 i hv, e2 i hv, em i e1 + e2 + . . . + em ke1 k2 ke2 k2 kem k2 • Additionally, for the subspaces U, U ⊥ we have: dim V = dim U + dimU ⊥ The projection of a vector v in V on a subspace U can be seen as the vector in U most similar to v in U . Looking at the Gram-Schmidt process earlier we also see that another way to see the algorithm is by adding new vectors ei by taking the reminder after subtracting the projection of ei on the subspace spanned by the previous vectors. ei = vi − projU (v), U = span(e1 , . . . , ei−1 ) 73 Theorem 6.12. If V is a inner product space and U is a subspace of V . Then projU (v) is the vector in U closest to v ∈ V . kv − projU (v)k ≤ kv − uk for all u ∈ U . Proof. For v − u we get: v − u = (v − projU (v)) + (projU (v) − u) Since the first lies in U ⊥ and the second lies in U they are orthogonal and we can use pythagoras: kv − uk2 = (v − projU (v))2 + (projU (v) − u)2 ≥ (v − projU (v))2 We look at an example in R3 where we want to find the point on a plane closest to a point in space. We consider the plane given by the equation x − y − 2z = 0 and find the point on the plane closest to the point v = (3, 1, −2). • For the plane we get x = y + 2z and the subspace U = {s + 2t, s, t|s, t ∈ R} = span((1, 1, 0), (2, 0, 1)) • Using Gram-Schmidt we get: {e1 = (1, 1, 0), e2 = (2, 0, 1)− h(2, 0, 1), (1, 1, 0)i (1, 1, 0). k(1, 1, 0)k2 • e2 = (2, 0, 1) − (1, 1, 0) = (1, −1, 1). • Then we find the closest point by finding the projection of v on U : projU (v) = hv, e1 i hv, e2 i e1 + e2 ke1 k2 ke2 k2 • = 2e1 + 0e2 = (2, 2, 0) Since we have found an orthogonal base for the plane we can now easily also find the point on the plane closest to any general point v = (x, y, z)in space as: projU (v) = hv, e1 i hv, e2 i x+y x−y+z e1 + e2 = (1, 1, 0) + (1, −1, 1) 2 2 ke1 k ke2 k 2 3 6.4 Applications 6.4.1 Fourier approximation We inspect the space C[−π, π] of continuous functions on the interval [−π, π] using the inner product: Z π hf, gi = f (x)g(x) dx π . 74 • Then {1, sin x, cos x, sin 2x, cos 2x, . . .} is an orthogonal set. • We can then approximate a function f by projecting it on a finite subspace of this set: Tn = span({1, sin x, cos x, sin 2x, cos 2x, . . . , sin mx, cos mx}) This gives f (x) ≈ tnf (x) = a0 +a1 cos x+b1 sin x+a2 cos 2xb2 sin 2x+. . .+am cos mx+ bm sin mx. • The coefficients ak , bk we get as: Z π hf (x), 1i 1 • a0 = = f (x) dx. k1k2 2π π Z hf (x), cos(kx)i 1 π • ak = f (x) cos(kx) dx. = k cos(kx)k2 π π Z hf (x), sin(kx)i 1 π • ak = = f (x) sin(kx) dx. k sin(kx)k2 π π • It’s obvious that as we take more coefficients we get a better approximation of f since T1 ⊆ T2 ⊆ T3 ⊆ . . .. • If we take an infinite number of coefficients we call this the Fourier series of f . 6.4.2 Finding a QR decomposition using the Gram-Schmidt process We remember that given a matrix M we can decompose it in two matrices M = QR such that Q is a unitary matrix and R is a upper triangular matrix. The QR-decomposition can then in turn be used to solve or find a least square solution to the linear system defined by M . It can also be used to more effectively find the eigenvalues and eigenvectors of M . Both of these uses of the QR decomposition is handled elsewhere in the book however one problem first need to be solved. If we want to use QR decomposition to speed up calculations, we need to be able to find the QR decomposition itself as fast (and stable) as possible. If we look at the columns of A as a (possibly linearly dependent) set of basis vectors. Then QR can be seen as a base change Q with coefficients in R. If we look at the Gram-Schmidt process used on the column vectors v of A we get (before normalization): ei = vi − hvi , e1 i hvi , ei−1 i e1 − . . . − ei−1 2 ke1 k kei−1 k2 ⇔ vi = ei + hvi , e1 i hvi , ei−1 i e1 + . . . + ei−1 2 ke1 k kei−1 k2 So we write column i using i basis vectors. If we collect the coefficient in R and the basis vectors in Q we get a QR decomposition. You can easily check this by considering 75 one column in A as we did above. In practice we usually first calculate Q and then find R using multiplication R = QH A since Q is unitary. We give an example considering the matrix A Schmidt process: 1 −1 2 A = 2 0 2 2 2 3 below using the unmodified Gram 3 1 2 This gives : 1 −4 A.1 1 1 2 , A0.2 = A.2 − QH −2 Q.1 = = A Q = .2 .1 .1 kA.1 k 3 3 2 4 2 6 1 1 0 0 −2 , A.4 = −3 A.3 = 3 3 1 0 Continuing with the next column we get: : −2 2 0 1 1 A.2 00 0 H 0 −1 , A.3 = A.3 − Q.2 A.3 Q.2 = −2 = Q.2 = kA0.2 k 3 3 2 1 4 1 A00.4 = −4 3 2 2 A00.3 1 −2 Q.3 = = kA00.3 k 3 1 We now have a full basis and can stop since Q.4 must be zero. For the QR-decomposition A = QR we then get the matrices: 1 −2 2 1 Q = (Q.1 , Q.2 , Q.3 ) = 2 −1 −2 3 2 2 1 And 9 3 12 9 1 R = QH A = 0 6 0 −3 3 0 0 3 6 Here are a couple of things to consider and think about: • Can you recognize the Gram-Schmidt process when written in this form? Where are we doing the normalization? 76 • Can you find the QR-decomposition using the modified algorithm? • Using that Q is hermitian we can write a linear system Ax = b as QRx = b ⇒ Rx = QH b which is very easy to solve since R is triangular. How does this compare to other standard methods such as Gauss elimination? 6.4.3 Principal component analysis (PCA) and dimension reduction Let us consider n sets of mesurements at time t = 1, 2, 3, . . . , m. For example the value of n stocks, consumption of n commoditys, rainfall at n locations, etc. While making our analysis of the data might be easy if n is small, as n grows the time to make the analysis typically grows faster (for example in n2 ). As n grows large enough we reach a point where we cannot do our analysis in a reasonable time anymore. However let us assume the measurements are dependent, this seems like a rather reasonable assumption. Our goal is to find a new basis in which we can represent the n sets using fewer basis vectors. If one measurement is a linear combination of the others we can use for example Gram-Schmidts to find a new basis. However in practice this is rarely the case, especially considering as in this case typical stochastic processes. Since Gram-Schmidt is guaranteed to find a basis with the minimum number of basis vectors since it gives an orthogonal basis, findign a basis with less basis vectors that captures all the data is impossible. However what if we can find a new basis which captures most of the data? For example we might have points in 3 dimensions which all lie on or close to a plane. In this case projecting the points to this plane should give a good approximation of the data, as well as reducing the dimension of the problem. One method used to reduce the dimension of a problem is principal component analysis (PCA): Our first step is to subtract the mean of every individual data set. We use this to construct a data matrix by inserting the different sets of zero mean measurements as row vectors Xn×m with mean zero of every row. Next we compute the singular value decomposition (SVD): X = WΣVT We remember that W is a unitary matrix of the eigenvectors of XXT , Σ is a diagonal matrix with the singular values on the diagonal and V is a unitary matrix which contains the eigenvectors of XT X. The singular values Σi,i are the square roots of the eigenvalues of XXT . The eigenvalues are proportional to the variance of the data captured by corresponding eigenvector. Essentially we can see the first eigenvector corresponding to the largest eigenvalue as the direction in which our original data have the largest variance. The second eigenvector corresponds to the direction orthogonal to the first which which captures the most of the remaining variance, the third is orthogonal to the first two and 77 so on. If we would project our data onto the space projected by the first L eigenvectors we would thus get the L dimensional space that captures the most variance in the data. If the last principal components are small they can be disregarded without a large loss of information. If we let Yn×m be the projection on the space described by the eigenvectors of W we get: Y = WT X = WT WΣVT = ΣVT If we only want the space described by the first L eigenvectors we get: T YL = WL X = ΣL VT T Where WL is a L × n matrix taken as the first L rows of W, and ΣL is a L × m matrix composed of the first L rows of Σ. We can now work with our L × m matrix YL rather than X with little loss of information if the remaining singular values are small. We note that it is also possible to work with the covariance matrix C = XXT since W is the eigenvectors of XXT And as such, if we have the covariance matrix XXT we can calculate it’s eigendecomposition rather than the singular value decomposition of X. However it is recommended to use SVD decomposition since constructing the covariance matrix XXT can give a loss of precision using the covariance method. While PCA gives the ”best” orthogonal basis, there are other methods as well as other variations of PCA to reduce the dimension of a dataset. Some things to think about concerning the use of PCA is: • PCA assumes tha data are scaled equally. If the data is of temperatures for example, what happens if one dataset is in Fahrenheit while all others are in Celsius? Similar problems arise whenever one dataset have much larger/smaller variance than the others. • How does outliers (datapoints far from the others often the result of error in measurements) effect the results of PCA? Can you think of any way to handle them better? • What happens if the datasets are independent or close to independent? Why can’t we use PCA now? Is there something else you could possible do in this case? • Can you think of any examples in your main field where PCA or similar methods could prove useful? 78 7 Linear transformations A linear transformations also called a linear map is a function mapping elements from one set to another (possibly same) set fulfilling certain conditions. Namely if the two sets are linear spaces then the map preserves vector addition and scalar multiplication. In this chapter we will take a look at linear transformations, what properties makes them important and some examples of where we can use them in applications. This as with the chapter before this will have a heavier focus on theory than other chapters. The definition of a linear transformation is as follows: Definition 7.1 Given two vector spaces V, W over the same field F , then a function T : V → W is a linear transformation if: • T (v1 + v2 ) = T (v1 ) + T (v2 ), v1 , v2 ∈ V . • T (rv) = rT (v), v ∈ V, r ∈ F . We see that the first condition is equivalent to saying that T needs to preserve vector addition and the second that it preserves scalar multiplication. We note that the two vector spaces V, W need not be the same vector space, or even have the same number of elements. This also means that the addition operation need not be the same since v1 + v2 is done in V and T (v1 ) + T (v2 ) is done in W . From the definition follows immediately a couple of properties, it’s recommended to stop and think about why if it isn’t immediately clear. • T (0) = 0 • T (−v) = −T (v), for all v ∈ V . • T (r1 v1 + r2 v2 + . . . rn vn ) = T (r1 v1 ) + T (r2 v2 ) + . . . + T (rn vn ) for all v ∈ V and all r ∈ F . Many important transformations are in fact linear, some important examples are: • The identity map x → x and the zero map x → 0. • Rotation, translation and projection in geometry. • Differentiation. • Change of one basis to another. • The expected value of a random variable. The importance of many linear transformations gives a motivation why we should study them further and try to find properties of them that we can use. In fact, just by knowing a transformation is linear we already know two useful properties from the 79 definition. For example it doesn’t matter if we first add two vectors and then change their basis or if we first change both their basis and then add them. Often it’s possible to check if a transformation is linear using only the definition as in the example below. We define a function T : R2 → R3 such that: x+y x T = 2x − y y y We then want to show that it’s a linear transformation. (x1 + y1 ) + (x2 + y2 ) x1 x T + 2 = (2x1 − y1 ) + (2x2 − y2 ) y1 y2 y1 + y2 (x1 + y1 ) (x2 + y2 ) x1 x = (2x1 − y1 ) + (2x2 − y2 ) = T +T 2 y1 y2 y1 y2 And the first requirement is true. For the second requirement T (rv) = rT (v) we get: (rx + ry) rx x = (r2x − ry) =T T r ry y ry (x + y) x = r (2x − y) = rT y y And the second requirement is true as well. Next we look at how to linear transformations act on basis vectors and how this can be used in order to create a linear transformation. Theorem 7.1. We let V, W be vector spaces over the field F . Given a basis e1 , e2 , . . . , en of V and any set of vectors {w1 , w2 , . . . , wn } ∈ W . There exists a unique linear transformation T : V → W such that T (ei ) = wi . • Additionally we have: T (v1 e1 + v2 e2 + . . . + vn en ) = v1 w1 + v2 w2 + . . . vn wn Proof. We start by looking at the uniqueness. If two transformations S, T exist such that T (ei ) = S(ei ) = wi . Then they must be the same since e1 , e2 , . . . , en spans V . You can easily see that they must be the same 80 transformation by writing any S(v) = T (v) and expanding v in the basis vectors. And we can conclude that the transformation is unique if it exists. Next we want to show that such a linear transformation always exists. Given v = v1 e1 + v2 e2 + . . . + vn en ∈ V , where v1 , v2 , . . . , vn are uniquely determined since e1 , e2 , . . . , en is a basis of V . We define a transformation T : V → W as: T (v) = T (v1 e1 + v2 e2 + . . . + vn en ) = v1 w1 + v2 w2 + . . . + vn wn For all v ∈ V . We clearly have that T (ei ) = wi and showing that T is linear can be done simply by trying: T (av1 + bv2 ) = aT (v1 ) + bT (v2 ) Using this we see that it’s quite easy to define a linear transformation, since we only need to specify where to take the basis vectors. Then everything else is handled by the linearity. This result also gives an easy way to find if two linear transformations are equal. Simply check if they give the same result when applied to the basis vectors. If they do, they need to be equal because of the uniqueness. We continue by looking at matrices multiplied with vectors acting as transformations. Consider the m × n matrix A and define a function TA : Rn → Rm such that TA (v) = Av for all (column vectors) v ∈ Rn . Show that TA is a linear transformation. We use the definition and immediately get: TA (v1 + v2 ) = A(v1 + v2 ) = Av1 + Av2 = TA (v1 ) + TA (v2 ) TA (rv) = A(rv) = r(Av) = rTA (v) And TA must be a linear transformation. We see that every m × n matrix A defines a linear transformation TA : Rn → Rm . This gives us a new perspective in which to view matrices, namely we can view them as linear transformations on vector spaces. But even more importantly we will also see that every linear transformation T : Rn → Rm can be written in this way using a m × n matrix. Theorem 7.2. If T : Rn → Rm is a linear transformation, then there exists a m×n matrix A such that T (v) = Av for all column vectors v ∈ Rn . Additionally if e1 , e2 , . . . , en is the standard basis in Rn , then A can be written as: A = [T (e1 ) T (e2 ) . . . T (en )] Where T (ei ) are the columns of A. The matrix A is called the standard matrix of the linear transformation T . 81 7.1 Linear transformations in geometry The matrix representation of linear transformations is especially useful in geometry and 3d graphics. Especially since most commonly used transformations such as rotation and scaling are in fact linear. If we can find the standard matrix for a certain transformation we can then easily apply that transformation on a large number of vectors with a simple matrix-vector multiplication. This is especially important in for example computer graphics where we often work with a huge number of points even for very primitive objects, which need to be moved whenever the object or camera (view point) moves. Here we will give some examples of linear transformations in geometry and show a couple of methods which we can use to find the standard matrix. Using the previous result we find the standard matrix for a couple of common transformaitons in Rn → Rm . We start in R2 → R2 by finding the standard matrix for rotation clockwise about the origin. By making the unit circle we can easily see that if we rotate the first basis vector [1 0]> by θ degrees we end up at [cos θ sin θ]> And if we rotate the second one [0 1]> we end up at [− sin θ cos θ]> Our standard matrix is thus cos θ − sin θ A= sin θ cos θ And we can use this to rotate any point θ degrees about the origin. For example remembering that the angle between two orthogonal vectors is θ = 90◦ , given a vector v = [x y]> we can find an orthogonal vector in 2d by multiplying with: A= cos 90 − sin 90 0 −1 = sin 90 cos 90 1 0 Giving Av = [−y x]> , which can easily seen to give zero when taking the dot product together with v. But what if we now instead consider this rotation in the xy−plane in R3 ? When looking at rotation around the z axis we get the matrix: cos θ − sin θ 0 A = sin θ cos θ 0 0 0 1 Since for the first two basis vectors we rotate in the plane we had in two dimensions, and the last basis-vector in the direction of the z−axis should remain unchanged. We can find the rotation matrices around the other axes in the same way. 82 Next we take a look at projection in R2 → R2 where we want to find the standard matrix for projection on the line y = ax. We let d = [1 a]> be the direction vector for the line. Then using the projection formula with the basis vectors we get: projd (1, 0) = 1 h(1, 0), (1, a)i d= (1, a) 2 k(1, a)k 1 + a2 projd (0, 1) = h(0, 1), (1, a)i a d= (1, a) 2 k(1, a)k 1 + a2 Which gives the standard matrix 1 1 a A= 1 + a2 a a2 But what if we instead want to find the standard matrix for projection on the plane described by x − z = 0 in R3 ? We introduce a slightly different method more suitable for higher dimensions. We have an orthogonal basis U = {(1, 0, 1), (0, 1, 0)} for points in the plane. Then instead of finding the projection of each basis vector on the plane individually, we use the projection formula to find the projection of a general point onto this plane. Doing this yields: projU (x, y, z) = h(x, y, z), (1, 0, 1)i (1, 0, 1) k(1, 0, 1)k2 h(x, y, z), (0, 1, 0)i + (0, 1, 0) = k(0, 1, 0)k2 We then split ut up in the part belonging to basis vectors: 1/2 A= 0 1/2 x+z x+z , y, 2 2 x, y, z to get the columns for the individual 0 1/2 1 0 0 1/2 While we probably could have found the standard matrix for this projection regardless, we can use the same methodology for more complex subspaces in higher dimensions. • First find an orthogonal basis for the subspace. • Then find the projection of the basis vectors in the original space on the subspace. We then get a matrix for this transformation such that we can calculate this projection of any points onto the subspace with a simple matrix-vector multiplication. Before continuing with some more theory we return to R2 and look at the reflection in the line y = ax • The reflection of v is equal to 2projd (v) − v (make a picture). 83 This gives: 2 1 a 1 0 − A= 0 1 1 + a2 a a2 1 1 − a2 2a = 2a a2 − 1 1 + a2 We note that the methodology presented here is possibly since the transformations in question are all linear. While most needed transformations in this case are linear, we will see later that there are at least one critical non-linear transformation we need to be able to do in many applications. Can you think about which one? 7.2 surjection,injection and combined transformations While nice that we can for example rotate a point, what if we want to combine transformations. For example maybe we want to rotate around a point which itself rotate around another point. We will also give a way in which to make certain non-linear transformations into linear ones, a method essential to many applications in computer graphics. We start by defining what we mean by a surjection or injection. Definition 7.2 A linear transformation (or more generally a function) F : V → U is a: • Surjection, or onto if every element w ∈ W can be written w = T (v at least one vector v ∈ V . • Injection, or one-to-one function if every element in U is mapped to by at most one element in V . • If it’s both surjective and injective it’s called a bijection. One important property is that for F to be invertible it needs to be an injection since otherwise F −1 (u), u ∈ U would map to at least two elements in V for some u. Theorem 7.3. Let F : V → W and G : W → U be two linear transformations. Then the combined transformation G ◦ F : V → U is linear. If F is invertible, then the inverse F −1 : W → V is linear. Proof. For the first part we get: G ◦ F (au + bv) = G(F (au + bv)) = G(F (au)) + G(F (bv)) = aG(F (u)) + bG(F (v)) For the second part we get: F (F −1 (au + bv)) = au + bv = aF (F −1 (u)) + bF (F −1 (v)) = F (aF −1 (u) + bF −1 (v)) 84 Since F is linear and injective. From this follows that: F −1 (au + bv) = aF −1 (u) + bF −1 (v) But if the composition of two linear transformations is itself a linear transformation, how do we represent the combined linear transformation G ◦ F ? If we look at the matrix representations of the transformations we get: AF = [F (e1 ) F (e2 ) . . . F (en )] AG = [G(e1 ) G(e2 ) . . . G(en )] Then: AG◦F = [G(F (e1 )) G(F (e2 )) . . . G(F (en ))] = AG AF In other words if we can combine two linear transformations simply by multiplying their respective standard matrices. We note that since matrix multiplication is not commutative, the order in which we apply the linear transformations does impact the resulting transformation. The transformation we want to apply ”first” should be to the right in the matrix multiplication. 7.3 Application: Transformations in computer graphics and homogeneous coordinates We want to make a simple computer graphics of our solar system containing the sun, planets as well as their moons. For simplicity we assume all bodies are spherical and all bodies move in circular paths around something else. To work with we have the points of a unit sphere located in origo. This is a good example of how we can combine relatively simple transformations in order to create very complex paths such as the movement of someone on the Moons surface relative the sun. We start by assuming the sun is stationary positioned in the middle of the solar system (origo) and everything else revolves around it. Then we only need to scale our unit sphere to appropriate size as well as rotate all the points on the sphere around some axis. Since both rotation and scaling are linear transformations we can apply this transformation without further problems. We let t be a point in time, Rsun be the rotation of the sun around it’s own axis and Ssun be the standard matrix for the scaling needed to get the appropriate size, we then get standard matrix for the points on the unit sphere representing the sun as. Asun = Rsun Ssun Lets say we choose to let the sun rotate around the z−axis, we then get the rotation matrix as before and we get transformation matrix in time t as: cos θs t − sin θs t 0 ssun 0 0 ssun 0 Asun = sin θs t cos θs t 0 0 0 0 1 0 0 ssun 85 ssun cos θs t −ssun sin θs t 0 0 = ssun sin θs t ssun cos θs t 0 0 ssun We are now finished with the sun, in every frame we multiply all the points on the unit sphere with corresponding matrix to get their position in our animation. However as soon as we want to animate our first planet we immediately move into problems, moving a point a fixed amount in any direction (translation) is not a linear translation. You can easily realize this if you let T be translation in R: T (x) = x + a → T (0 + 0) = a 6= T (0) + T (0) = 2a. The same is obviously true for higher dimensions. While we could start by adding a vector to every point and then rotate around origo, we need to take great care in order to get the rotation around it’s own axis right(rotate points, then add vector, then finally rotate around origo) as well as it now requiring 2 matrix multiplications and one vector addition. Additionally as soon as we want to add a moon or want to make more complex objects, the problem becomes more and more complicated. Luckily there is a way to handle these transformations, a trick you could say in which we represent our n dimensional vector in n + 1 dimensions. We represent a vector v = (v1 , v2 , . . . , vn )in n dimensions as a vector v = (v1 , v2 , . . . , vn , 1) in n + 1 dimensions. For two dimensions this can be seen as representing the points in the xy−plane by the points in the plane 1 unit above the xy−plane in three dimensions. Now we can represent translation x0 = x + tx , y 0 = y + ty , z 0 = z + tz (in three dimensions) using matrix multiplication: 0 1 0 0 tx x x y 0 0 1 0 ty y 0 = z 0 0 1 tz z 1 1 0 0 0 1 This is commonly referred to as homogeneous coordinates. We note that the last element of every vector is always 1 and the last row of the standard matrix is always the same as well. So by using one extra dimension even if the value of every vector in this extra dimension is always one, we can make translation into a linear transformation. • We note that origo is no longer represented by the zero vector (0, 0, 0) in homogeneous coordinates since it’s there represented by (0, 0, 0, 1). • It’s also easy to write any other previous already linear transformation in homogeneous coordinates, we simply add another row and column with all zeros except for the bottom right corner which we set to one. We are now ready to return to our animation of the solar system using homogeneous coordinates instead. • We start by rewriting the transformation matrix for the points on the suns surface simply by adding the additional row and column: 86 Asun ssun cos θt −ssun sin θt 0 ssun sin θt ssun cos θt 0 = 0 0 ssun 0 0 0 0 0 0 1 Next we want to find the transformation matrix for the points on the earth’s surface from another unit sphere in the origin. A good guideline is to start by applying all transformations of the object in reference to itself, and then apply the transformation with respect to the ”closest” reference point, in this case the position of the sun. • We start by applying the scaling SE and rotation around it’s own axis RE just like we did with the sun. • Then we need to transpose the planet away from the sun in the origin. We apply the translation matrix TES so that it’s appropriately far away. • We then apply another rotation on this to rotate all of these points around the sun (origin) RES representing the rotation of earth around the sun. • Applying all of the transformations in order gives transformation matrix AE = RES TES RE SE . Especially we note that the first two RES TES denoting the centre position of the earth relative to the sun and the last two REE SE is the position of the points on the earths surface relative a unit sphere positioned at earth’s core. We can use the first (already calculated) half of the transformation when finding the transformation matrix for the moon. In the same way we get the transformation SM , RM to apply the scaling and rotation around the moons own axis. We consider the position of the earth as the origin and translate and rotate around that giving: TM E , RM E . We notice that we now have the position of the moon relative the centre of earth, we apply the transformations ARES TES previously applied on the earth’s centre to get the moons coordinates relative the origin. This gives transformation matrix: AM = RES TES RM E TM E RM SM We notice that we can interpret this as: (from right to left). • RM SM -Position of points on the moons surface relative a unit sphere at the centre of the moon, times: • RM E TM E -Position of the centre of the moon relative the centre of earth, times: • RES TES -Position of the centre of earth relative the centre of the sun (origin). This methodology is used extensively in computer graphics and robotics in order to describe very complex movements using relatively simple transformations and hierarchies. 87 7.4 Kernel and image To every linear transformation T : V → W we associate two linear spaces. One of them a subspace of V and the other a subspace of W . Definition 7.3 Given the linear transformation T . • R(T ) = {w|w = T (v), v ∈ V } ⊂ W , is called the image of T . • In other words the subspace of W which is mapped to from V • N (T ) = {v|T (v) = 0} ⊂ V . is called the Kernel of T or the Nullspace of T . • In other words the subspace of V for which the linear transformation maps to the zero-vector in W If we look at the transformation in terms of the standard matrix we see that the image R(T ) is the solutions to the matrix equation Ax = b. While the Kernel N (T )l is the solutions to Ax = 0. Theorem 7.4. Given the linear transformation T : V → W , the kernel N (T ) is a subspace of V and the image R(T ) is a subspace of W . Proof. Since T (0) = 0 they both contain the zero vector. For the kernel we get: T (av1 + bv2 ) = aT (v1 ) + bT (v2 ) = a0 + b0 = 0, v1 , v2 ∈ N (T ) And av1 + bv2 lies in N (T ) since it satisfies the condition. So N (T ) is a subspace of V . For the image we use: w1 = T (v1 ), w2 = T (v2 ) v1 , v2 ∈ V This gives: aw1 + bw2 = aT (v1 ) + bT (v2 ) = T (av1 ) + T (bv2 ) = T (av1 + bv2 ) And aw1 + bw2 lies in R(T ). So R(T ) is a subspace of W . We can use this to easily check if a linear transformation is injective,surjective and bijective (and invertible). Theorem 7.5. A linear transformation T : V → W is: • surjective ⇔ R(T ) = W • injective ⇔ N (T ) = {0} • bijective (and invertible) ⇔ N (T ) = {0} and R(T ) = W . 88 If the two vector spaces have the same dimension (square standard matrix) it’s actually enough to show only that it’s injective or surjective and it will always be bijective as well. Theorem 7.6. Assume a linear transformation T : V → W , where dim V = dim W . • If N (T ) = {0} or R(T ) = W then T is bijective. In fact we can show even more, we can show something about the combined dimension of the kernel and image: Theorem 7.7. Assume we have a linear transformation T : V → W , where V is finite dimensional, then: • dim V = dim N (T ) + dim R(T ) We move on to a short example showing how we can use what we have learned here. Hermite interpolation is a method in which to find a curve through a given number of points. This is often needed in computer graphics as well as in construction for calculating strength of materials. We will look at one subproblem where we want to fit a curve through two points in the plane where the slope is specified in both points. The easiest possible interpolation function we can think of in this case is a third order polynomial: y = a + bx + cx2 + dx3 Which we want to fit to the points such that the curve goes through the points with the given slope. In order to do this we get a system of four unknowns and four equations: y(x1 ) = a + bx1 + cx21 + dx31 = y1 y 0 (x1 ) = b + 2cx1 + 3dx21 = s1 y(x2 ) = a + bx2 + cx22 + dx32 = y2 y 0 (x2 ) = b + 2cx2 + 3dx22 = s2 Writing it in matrix form we get a linear system in the four unknowns (a, b, c, d) with matrix A: 1 x1 x21 x31 0 1 2x1 3x21 A= 1 x2 x22 x32 0 1 2x2 3x22 While we know how to solve this problem, we ask ourselves, is it always possible to solve this system? Obviously we could simply solve it and see, but doing so (especially by hand) is rather tedious work even for this relatively small matrix. • You could also calculate the determinant, less work than solving the system but still alot of work. 89 • Instead we look at the transformation a + bx1 + cx21 + dx31 y(x1 ) y 0 (x1 ) b + 2cx1 + 3dx21 T : y → T (y) = y(x2 ) = a + bx2 + cx22 + dx32 y 0 (x2 ) b + 2cx2 + 3dx22 As the transformation from V = {y|y(x) ∈ P3 } to W = R4 . Using the basis {1, x, x2 , x3 } in V this is obviously a linear transformation with matrix A above. We take a look at the Kernel of A. For a polynomial in N (T ) we have y(x1 ) = y(x2 ) = 0, y 0 (x1 ) = y 0 (x2 ) = 0 So it has a root in x1 , x2 both with multiplicity two. However y is a third order polynomial and can only have 3 roots (counting multiplicity) except for the zero polynomial. So N (T ) = {0} and T is injective. Since it’s injective and dim P3 = dim R4 T is bijective and invertible so the system have a unique solution. We remember the definition of rank for a square matrix: Definition 7.4 For A ∈ Mn×n (F) the rank(A) is equivalent to: • The number of linearly independent rows or columns. • The dimension of the column space of A. But we can give even more definitions of rank, we now move on to define the rank of a matrix Mm×n (F) using the kernel and image of A. Definition 7.5 Given a linear transformation T : V → W with standard matrix A. • Rank or the column rank r(A) is the dimension of the image of A: dim R(A), with elements Ax, x ∈ F n . • Row rank is the dimension of the image of A> : dim R(A> ), with elements A> y, y ∈ F m. • Nullity or the right nullity v(A) is the dimension of the kernel of A: dim N (A), with elements Ax = 0, x ∈ F n . • Left nullity is the dimension of the kernel of A> : dim N (A> ), with elements Ay = 0, y ∈ F m . From this immediately follows: • Nullity zero: v(A) = 0 ⇔ A have a left inverse. 90 • rank m: r(A) = m ⇔ A have a right inverse. We can of course do the same for A> , however we will see that they are related in such a way that knowing only one of these four values (R(A), R(A> ), N (A), N (A> )), we can immediately find the others. Theorem 7.8. For every matrix A we have: dim R(A) = dim R(A> ) In other words, for non square as with the square matrices the column rank and row rank are the same. As with a square matrix you can show it by writing the matrix in echelon form. We end this part with a short example: Let A be a n × m matrix with rank r. We want to show that the space of all solutions to Ax = 0 of n homogeneous equations in m variables has dimension m − r. This space is the kernel of N (T ) with T : Rn → Rm such that T (v) = Av. We use dim V = dim N (T ) + dim R(T ) We already have the dimension of the image R(T ), which is the rank r. The dimension of V is m And we immediately get the dimension of the kernel as dim N (T ) = m − r 7.5 Isomorphisms Definition 7.6 A linear transformation T : V → W is called an Isomorphism if it’s both surjective and injective (bijective). • Two vector spaces V and W are called isomorphic if there exists an isomorphism T : V → W. • We denote this by writing V ∼ = W. We see that for two vector spaces that are isomorphic, we have a pairing v → T (v), v ∈ V, T (v) ∈ W . For this pairing we preserve the vector addition and scalar multiplication in respective space. Because addition and multiplication in one space is completely determined by the same operation in the other all vector space properties of one can be determined by those of the other. In essence you could say the two vector spaces are identical apart from notation. Theorem 7.9. For two finite dimensional spaces V, W and a linear transformation T : V → W the following are equivalent: • T is an isomorphism. • If {e1 , e2 , . . . , en } is a basis of V then {T (e1 ), T (e2 ), . . . , T (en )} is a basis of W . 91 • There exists a basis {e1 , e2 , . . . , en } of V such that {T (e1 ), T (e2 ), . . . , T (en )} is a basis of W . • There exist a inverse linear transformation T −1 : W → V such that T −1 T = 1V and T T −1 = 1W Here we see one of the important relations between isomorphic vector spaces, if we have a basis in one we can given a linear isomorphic transformation also find a basis in the other. If both vector spaces have a finite dimension we can say even more. Theorem 7.10. For the finite dimensional spaces V, W the following are equivalent. • V ∼ = W , (V and W isomorphic). • dim V = dim W In other words, the dimension of a vector space decides which spaces it’s isomorphic to. For example we can immediately conclude that the linear space R3 is isomorphic to the polynomials of order two: P2 (Polynomials that can be written on the form a + bx + cx2 ). Some examples of important isomorphisms are: • The Laplace transform, mapping (hard) differential equations into algebraic equations. • Graph isomorphisms T : V → W such that if there is an edge between edges u, v in V there is an edge between T (u), T (v) in W . • Others include isomorphisms on groups and rings (such as fields). This also give a little motivation of why the study of isomorphisms are useful, if we have a problematic vector space where something we want to do is very complicated, if we can find another isomorphic vector space where the corresponding operation is easier. We can formulate our problem in this new vector space instead, do our now easier calculations and then change back, effectively sidetracking the whole issue. 92 8 Least Square Method (LSM) Many of the properties and ideas we have examined apply to square matrices, a good example of this is theorem 1.1 which list nine conditions that are all equivalent to a square matrix being invertible. If a square matrix A is invertible then the linear equation system Ax = y is solvable. In this section we will discuss a scenario where we have a system of linear equations where the A-matrix has more rows than columns. This kind of system is called an overdetermined system (we have more equations than unknowns) and if A is a m × n matrix with m > n it is said to have m − n redundant equations. An overdetermined system can be solvable if the elements of vector y represents points on a straight line. Having 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 2 4 6 8 (a) A set of points on a single line. 10 0 0 2 4 6 8 10 (b) A set of points near but not on a single line. a number of points near but not on a straight line is a very common situation. One of the simplest ways to model a relation between two quantities is to use a straight line, l(x) = a + bx, and often experiments are designed such that this kind of relation is expected or the quantities measured are chosen such that the relation should appear. In reality any system, be it the temperature of a piece of machinery, the amplitude of a signal or the price of a bond, will be subject to some disturbance, often referred to as noise, that will move any move the measured values of the ideal line. In this situation the two most common problems are: • Hypothesis testing: does this data fit our theory? • Parameter estimation: how well can we fit this model to our data? In both of these cases it is important to somehow define the ’best-fitting’ line. Usually this is done in what is called the least-square sense. Our problem can be described in 93 the following way: Ac = y y1 1 x1 y2 1 x2 a , y= . A = . . , c = . . b .. . . yn 1 xn Where x = [x1 x2 . . . xn ]> and y is the data and c is the vector of coefficients that we want to find. To define what we mean by least-square sense we first note that for each x there is a corresponding spot on the line, a + bx, and a corresponding measurement y. We can create a vector that describes these distances for each measurement called the residual vector or the error vector . e = Ax − y, is the residual vector We can then measure the normal (Euclidean) length of this vector. v u n √ uX |e| = t |ei |2 = e> e i=1 Minimizing this length gives us the best-fitting line in the least-square sense. This can also easily be expanded into other kinds of functions, not just straight lines. Suppose we want to fit a parabola (second degree polynomial instead, p(x) = a+bx+cx2 ) then the problem can be written: Ac = y 1 x1 x21 y1 a 1 x2 x2 y2 2 A = . . .. , c = b , y = .. .. .. . . c 2 yn 1 xn xn Here the residual vector can easily be defined in the same way as for the straight line. This can be extended in the same way for a polynomial of any degree smaller than n − 1 where n is the number of measurements. If a polynomial of degree n − 1 is chosen we get the following situation: 1 1 A = . .. x1 x2 .. . 1 xm 94 x21 x22 .. . x2m Ac = y . . . xm−1 c1 y1 1 m−1 . . . x2 c2 y2 , c = , y = . . .. .. .. .. . . yn cn . . . xm−1 m Here the matrix A will be square and the system will be solvable as long as all the x-values are different. This means that for every set of n measurements there is a polynomial of degree n − 1 that passes through each point in the measurement. This is often referred to as interpolation of the data set (and in this case we it is equivalent to Lagrange’s method of interpolation). The matrix A here is known as a Vandermonde matrix. It and similar matrices appear in many different applications and it is also known for having a simple formula for calculating its determinant. 8.1 Finding the coefficients How can we actually find the coefficients that minimize the length of the residual vector? First we can define the square of the length of the residual vector as a function > s(e) = e e = n X |ei |2 = (Ax − y)> (Ax − y) i=1 This kind of function is a positive second degree polynomial with no mixed terms and ∂s thus has a global minima where = 0 for all 1 ≤ i ≤ n. We can find the global ∂ei minima by looking at the derivative of the function, ei is determined by ci and ∂ei = Ai,j ∂ci thus n n i=1 i=1 X X ∂s ∂ei = 2ei = 2(Ai. c − yi )Ai,j = 0 ⇔ A> Ac = A> y ∂ci ∂cj The linear equation system A> Ac = A> y is called the normal equations. The normal equations are not overdetermined and will be solvable if the x-values are different. Finding the solution the the normal equations gives the coefficients that minimizes the error function s. The reason that A> Ac = A> y are called the normal equations is that B = A> A is a normal matrix. Definition 8.1 A normal matrix is a matrix for which AH A = AAH . An alternate definition would be A = UDUH where U is a unitary matrix and D is a diagonal matrix. Proof that the definitions are equivalent. Here we can use a type of factorization that we have not discussed previously in the course called Schur factorization: A = URUH 95 where U is unitary and R triangular. AH A = UUH UH URUH = URH RUH = URRH UH = AAH if and only if R is diagonal (check this yourself using matrix multiplication). AH A = UDH UH UDUH = UDH DUH = UDDH UH = AH A since diagonal matrices always commute. Normal matrices are always diagonalizable which means they are always invertible if D has full rank. If B = A> A and A has full rank (equivalent to rank(A) = n) then B = UDUH with rank(D) = n. Thus: if rank(A) = n then x = (A> A)−1 A> y minimizes the error function or ’solves’ Ax = y in the least square sense. Sometimes the normal equations can become easier to solve if the A-matrix is factorized. One useful example of a factorization that can simplify the calculation is the QR-factorization that you will learn how to do in section 11.2. Using QR-factorization we get A = QR with Q being an orthogonal matrix, Q> = Q−1 , and R an upper triangular matrix. The normal equations can then be rewritten in this way: (A> A)c = (QR)> QR c = = R> Q> QR c = R> R c = = A> y = (QR)> y = R> Q> y ⇔ ⇔ Rc = Q> y 8.2 Generalized LSM LSM can be used to fit polynomials, as shown previously, but also other functions that can be written on the form X f (x) = gi (x) i For example, if we want to fit a trigonometric function, f (x), instead of a polynomial to some data and f (x) looks like this f (x) = a sin(x) + b sin(2x) + c cos(x) then the fitting problem can be described in this way Ac = y sin(x1 ) sin(2x1 ) cos(x1 ) y1 a sin(x2 ) sin(2x2 ) cos(x2 ) y2 A= . .. .. , c = b , y = .. .. . . . c yn sin(xn ) sin(2xn ) cos(xn ) 96 For this kind of problems we do not have the same simple guarantees that the A matrix is invertible though. There are also many functions that can be rewritten on the form X f (x) = gi (x) i for example, suppose we have a trigonometric function with a polynomial argument f (x) = tan(a + bx + cx2 ) ⇒ arctan(f (x)) = a + bx + cx2 And even though we have a non-linear model we have turned it into the linear problem Ac = y 2 1 x1 x1 arctan y1 a 1 x2 x2 arctan y2 2 A = . . , c = b , y = .. . .. .. .. . c 2 arctan yn 1 xn xn that can be solved with the regular least square method. Note thought that when doing this we no longer fit the original function but instead the rewritten function. In some applications this is problematic and in some it is not. We will not concern ourselves further with this in this course but in a real application this sort of rewrite should not be used indiscriminately. 8.3 Weighted Least Squares Method (WLS) It was previously described that the least square method was useful ion situations where some data had been collected and we wanted to analyze the data or adapt a specific model to the data.In both these cases certain parts of the data can be more relevant than the other. The data can be a compilation of data from many different sources, it can be measured by a tool that is more efficient for certain x or there is a certain area where it is more important that the model fits the data better than everywhere else, for example if we describe a system that is intended to spend most time in a state corresponding to value x the it probably more important that our model fits well near that x than far away from that x. This can be taken into consideration by using the weighted least square method . When we use the weighted least square method we add a a weight in the error function. s(e) = (e)> W(e) = n X Wi,i |ei |2 i=1 Then we get a new set of normal equations A> WAx = A> Wy 97 How to choose W? For physical experiments W is dictated by physical properties of the measured system or physical properties of the measurement tools. In statistics the 1 and correct weights for independent variables are the reciprocal of the variance wi = σi for variables with the correlation matrix Σ the weights are decided by W> W = Σ−1 to get the best linear unbiased estimate. If the different variables in x are independent then W is diagonal but if there is some correlation between them then there will be off-diagonal elements as well. 8.4 Pseudoinverses For square matrices we could describe the solution of a linear equation system using the inverse matrix Ax = y ⇔ x = A−1 y Not all matrices have an inverse, diagonalizable matrices where one eigenvalue is zero for example, but the idea behind the inverse can be generalized to and made to extend to these matrices as well. For the regular inverse we have AA−1 = A−1 A = I In this section we are going to discuss pseudoinverses, which are denoted by the following rules A∗ AA∗ = A∗ , AA∗ A = A ∗ and obey The inverse of a matrix is unique but a matrix can have many different pseudo inverses, for example, if: 0 1 A= 0 0 then all of the following are pseudo-inverses 1 1 0 0 1 0 + + + , A3 = , A2 = A1 = 1 1 1 1 1 0 In engineering there is a very important pseudoinverse called the Moore-Penrose pseudoinverse. Definition 8.2 The Moore-Penrose pseudoinverse of a matrix A is defined as follows A∗ = (A> A)A> The Moore-Penrose pseudoinverse is interesting because it is closely connected to the least square method and writing Ax = y ⇔ x = A∗ y 98 is the same as solving Ax = y using the least square method. Another good property of the Moore-Penrose pseudoinverse is that if the matrix A is invertible then A∗ = A−1 . In practice calculating the Moore-Penrose pseudoinverse is often done using the singular value factorization (SVD) or QR-factorization. Showing how to calculate the Moore-Penrose pseudoinverse using QR-factorization is very similar to showing how it is done for the least square method. −1 A∗ = (A> A)−1 A> = (QR)> QR (QR)> = −1 −1 = R> Q> QR R> Q> = R> R R> Q> = = R−1 (R> )−1 R> Q> = R−1 Q> 99 9 Matrix functions Functions are a very useful mathematical concept in many different areas. Power function: f (x) = xn , x ∈ C, n ∈ Z Root function: f (x) = Polynomials: p(x) = √ n x, x ∈ R, n ∈ Z+ n X ak xk , a, x ∈ C, n ∈ Z+ k=0 Exponential function: f (x) = ex , x ∈ C Natural logarithm: f (x) = ln(x), x ∈ C Table 1: Some common functions on the real and complex numbers. Each of the functions in table ?? are based on some fundamental idea that makes them useful for solving problems involving numbers of different kinds. If new functions that have corresponding or similar properties to these functions but acts on matrices instead of numbers could be defined then there might be ways to use them to solve problems that involve matrices of some kind. This can actually be done to some extent and this section will explore this further. 9.1 Matrix power function The normal power function with positive integer exponents looks like this f (x) = xn = |x · x ·{z . . . · x}, x ∈ C, n ∈ Z+ n Inspired by this it is quite intuitive to define a corresponding matrix version of the power function Definition 9.1 For any square matrix A ∈ Mk×k , k ∈ Z+ the matrix power function is defined as An = AAA . . . A}, n ∈ Z+ | {z n + for any n ∈ Z . The regular power function is not only defined for positive integers, but also for 0 and negative integers. How can definition 9.1 be extended to let n be any integer, not just a positive one? First consider how that x0 = 1 for any complex number x. For matrices this corresponds to 100 Definition 9.2 For any square matrix A ∈ Mk×k , k ∈ Z+ A0 = I where I is the identity matrix. For negative exponents it is well known that 1 , n ∈ Z+ xn Here it is not as intuitive and simple to find a corresponding definition for matrices. But consider that x−n = x· 1 = xx−1 = 1 x 1 can be viewed as the inverse of x. This means that there is a x natural way to define A−n , n ∈ Z+ for any invertible matrix A. This means that Definition 9.3 For any square and invertible matrix A ∈ Mk×k , k ∈ Z+ + −1 −1 −1 −1 A−n = (A−1 )n = A | A A{z . . . A }, n ∈ Z n where A−1 is the inverse of A. Now a matrix version of the power function An with integer exponent n ≥ 0 has been defined for any square matrix A. If A is also invertible An is defined for all integer n. When discussing (real or complex) numbers, x, the function xn is not limited to integer n. In fact, n can be any (real or complex number). In the next section the definition of An is extended to fractional n and in section 9.5 we will show how to define matrix versions of more general functions, including An for any (real or complex) number n. 9.1.1 Calculating An Calculating the matrix An can be done in a very straightforward manner by multiplying A with itself n times (or the inverse of A with itself −n times if n is negative). But the calculations can sometimes be simplified further. Previously, see page 13 and definition 2.1, the concept of a diagonalizable matrix has been examined. If A is a diagonalizable matrix, A = S−1 DS, then the power function (for positive integer exponents) becomes −1 −1 n An = AA . . . A = S−1 D |SS{z−1} D |SS{z−1} . . . SS | {z } DS = S D S =I =I =I n where D of course becomes 101 n d1,1 0 . . . 0 dn . . . 2,2 Dn = . .. .. . . . . 0 0 ... 0 0 .. . dnm,m Thus if a matrix A is simple to diagonalize then An becomes much less tedious to calculate for high n. The same idea can be generalized to similar matrices. Consider two matrices, A and B, that are similar, in the mathematical sense given in definition 1.3. A = P−1 BP In this case the power function of A can be calculated as follows −1 −1 −1 −1 n An = AA . . . A = P−1 B |PP {z } B |PP {z } . . . PP | {z } BP = P B P =I =I =I In some cases, for example when B is a diagonal matrix and n is large, it can be easier to calculate An in this way. In theorem 2.8 we saw that even if a matrix was not diagonalizable it could be rewritten as a matrix which is almost diagonal Jm1 (λ1 ) 0 ... 0 0 Jm2 (λ2 ) . . . 0 A = P−1 JP = P−1 P .. .. . . . ... . . 0 0 . . . Jmk (λk ) Here Jmi , 1 ≤ i ≤ k are all Jordan blocks, that is matrices with the value λi on the diagonal and 1 in the element above the diagonal, see definition 2.7. k is the number of different eigenvalues of A. This means that for any square matrix A the power function can be written like this Jm1 (λ1 )n 0 ... 0 0 Jm2 (λ2 )n . . . 0 n −1 n −1 A =P J P=P P .. .. . . . . . ... 0 0 . . . Jmk (λk )n Since all the Jordan blocks, Jmi (λi ), are either small or contains a lot of zeros each Jmi (λi )n is simple to calculate and thus Jn will often be easier to calculate than An . One kind of matrix for which it can be easier to calculate An are nilpotent matrices. Definition 9.4 A square matrix An is said to be a nilpotent matrix if there is some positive integer, n ∈ Z+ , such that An = 0 From this it follows that Ak , k ≥ n. 102 For nilpotent matrices it obviously easy to calculate An for sufficiently high numbers. One example of a nilpotent matrix is a strictly triangular matrix (a triangular matrix with zeros on the diagonal). Remember any square matrix is similar to a matrix written on the Jordan canonical form, which is an upper triangular with the matrix eigenvalues along the diagonal. This means that if all the eigenvalues for a matrix are equal to zero then the matrix will be similar to a strictly triangular matrix which meant that it will be nilpotent. 9.2 Matrix root function For (real or complex numbers) √ we have the following relation between the power function xn and the root function n x y = xn ⇔ x = √ n 1 y = yn For any positive integer, n ∈ Z+ , it is easy to do the corresponding definition for matrices Definition 9.5 If the following relation holds for two square matrices A, B ∈ Mk×k A = BBB . . . B} | {z n then B is said to be the nth root of A. This is annotated B = √ n 1 A = An √ n Simply defining the function is not very useful though. In order to make A useful a few questions need to be answered √ n • How do you find A? √ n • For how many different B is B = A √ √ 2 In this course we will only consider the square root of a matrix A = A . 9.2.1 Root of a diagonal matrix It is not obvious how to find the square root of a matrix, it is not even clear whether there exist any roots at all for a given matrix. For this reason it is suitable to first consider a specific kind of matrix for which the roots exist and are easy to find. Diagonal matrices (see page 6) are generally simple to work with. It is also easy to find their roots. Consider a diagonal matrix D ∈ Mm×m . d1,1 0 . . . 0 0 d2,2 . . . 0 D= . .. .. .. .. . . . 0 0 . . . dm,m 103 Then we can define a matrix C in this way p C= d1,1 p0 ... 0 d1,1 0 . . . 0 0 d2,2 . . . 0 0 d2,2 . . . 0 2 ⇒ CC = C = . . .. .. .. . . .. . . . . . . . . . . . p . 0 0 . . . dm,m 0 0 ... dm,m (3) From the definition it follows that C is a root of D. It is not the only root though. Each of the diagonal elements di,i has two roots, one positive and one negative and regardless which root that is chosen in C equation (3) will still hold. This means that there are 2n different Cs that are roots of D. These are not the only possible roots though. For an example of how many roots a matrix can have, see the following theorem. 1 0 Theorem 9.1. All of the following matrices are roots to I2 = 0 1 1 ∓s ∓r 1 ±s ∓r ±1 0 1 0 , , , 0 1 0 ±1 t ∓r ±s t ∓r ∓s if r2 + s2 = t2 . Thus the simple matrix I2 has infinitely many roots. Clearly roots are more complicated for matrices than they are for numbers. Fortunately though it is not always necessary to find all the roots of a matrix, often a single root can be enough, and there are several ways to make it easier to construct. For a diagonal matrix D it is very simple to find a root, simply take the matrix that had the positive root of all the elements in the diagonal of D in its diagonal. This is often referred to as the principal root. 9.2.2 Finding the square root of a matrix using the Jordan canonical form The technique for finding a root for a diagonal matrix can be extended to diagonalizable matrices. Consider a diagonalizable matrix A √ √ A = S−1 DS = S−1 D DS = √ √ √ = S−1 DSS−1 DS = S−1 DS ⇒ √ √ ⇒ A = S−1 DS By an analogous argument you can find a similar relation between any square matrix and its Jordan canonical form √ √ A = P−1 JP Finding the square root for J is not as simple as it is for a diagonal matrix, but there are some similarities 104 p Jm1 (λ1 ) p 0 0 Jm2 (λ2 ) √ √ . .. A = P−1 JP = P−1 .. . 0 0 ... ... .. . 0 0 P . . . q ... Jmk (λk ) √ This means that if a square root of the Jordan blocks are known then A can be calculated. Let J be a Jordan block √ λ 1 0 ... 0 1 λ √0 . . . 0 0 λ 1 . . . 0 0 1 λ . . . 0 1 . . . 0 J = 0 0 λ . . . 0 = λ 0 0 = λ(I + K) .. .. .. . . . .. . . . . .. .. . . .. . . . .. . . 0 0 0 ... λ 0 0 0 ... 1 Consider the function f (x) = f (x) = √ 1 + x, this function can be written as a Taylor series √ f 00 (a) 2 f (3) (0) 3 f (4) (0) 4 f 0 (0) x+ x + x + x = 1 + x = f (0) + 1! 2! 3! 4! 1 1 5 4 1 x + ... = 1 + x − x2 + x3 − 2 8 16 128 It turns out that if you switch 1 for I and x for K the equality still holds (see page 115). K is also a strictly upper-triangular matrix and is therefore a nilpotent matrix (see definition 9.4). Thus the following formula will always converge √ 1 1 1 5 4 I + K = 1 + K − K2 + K3 − K + ... 2 8 16 128 which makes it possible to find a square root for any square matrix. 9.3 Matrix polynomials A polynomial of degree n over a set K (for example the real or complex numbers) is a function p(x) that can be written as p(x) = n X ak xk , ak ∈ K k=0 where n is a finite non-negative integer and x is a variable taken from some set such that xn and addition is defined in a sensible way (more specifically x can only take values from a ring). Examples of sets that the variable x can belong to are real numbers, complex numbers and matrices. 105 Definition 9.6 A matrix polynomial (with complex coefficients) is a function p : Mn×n 7→ Mn×n that can be written as follows p(A) = n X ak Ak , ak ∈ C, A ∈ Mn×n k=0 where n is a non-negative integer. Matrix polynomials has a number of convenient properties, one of them can be said to be inherited from the power function Theorem 9.2. For two similar matrices, A and B, it is true for any polynomial p that p(A) = p(P−1 BP) = P−1 p(B)P Proof. This theorem is easily proved by replacing A with P−1 BP in the expression describing the polynomial and using An = P−1 Bn P. 9.3.1 The Cayley-Hamilton theorem A very useful and important theorem for matrix polynomials is the Cayley-Hamilton theorem. Theorem 9.3 (Cayley-Hamilton theorem). Let pA (λ) = det(λI − A) then pA (A) = 0 The Cayley-Hamilton theorem can be useful both for simplifying the calculations of matrix polynomials and for solving regular polynomial equations. This will be demonstrated in the following sections. The rest of this section focuses on proving the CayleyHamilton theorem. Proving the Cayley-Hamilton theorem becomes easier using the following formula. Theorem 9.4 (The adjugate formula). For any square matrix A the following equality holds A adj(A) = adj(A)A = det(A)I (4) where adj(A) is a square matrix such that adj(A)ki = Aki where Aki is the ki-cofactor of A, (see equation (2)) Proof of the adjugate formula. Using equation (2) gives det(A) = n X ai,k Ai,k k=1 If we collect all the cofactors, Aki , in a matrix 106 (adj(A))ki = Aik Note the transpose. It can then easily be verified that n X n X a1,k A1k k=1 n X a2,k A1k A adj(A) = k=1 .. n . X a A n,k k=1 n X k=1 n X 1k k=1 k=1 n X det(A) n X a2,k A1k = k=1 .. n . X a A n,k a1,k A2k a2,k A2k .. . an,k A2k a1,k A2k det(A) .. . n X k=1 a1,k Ank k=1 n X ... a2,k Ank = k=1 .. .. . . n X ... an,k Ank k=1 k=1 1k ... n X an,k A2k ... n X a1,k Ank ... a2,k Ank k=1 . .. .. . ... det(A) k=1 n X (5) k=1 In order to get equation (4) we need to show that all off-diagonal element are equal to zero. To do this, consider a square matrix B. From this matrix we construct a new matrix, C, by replacing the i:th row with the j:th row (i 6= j). This means that in C row i and row j are identical cik = cj,k = bj,k , 1 ≤ k ≤ n The matrices B and C will also share any cofactors created by removing row i Cik = Bik Using (2) on C gives det(C) = n X k=1 ci,k Cik = n X cj,k Cik , i 6= j = k=1 n X bj,k Bik k=1 Since C has two identical rows the rows are not linearly independent which means det(C) = 0 Thus, for any square matrix B it is true that 107 n X bj,k Bik = 0 k=1 This means that all elements in matrix (5) except the diagonal elements are equal to zero. This gives the adjugate formula A adj(A) = det(AI Checking that the same holds for adj(A)A can be done in an analogous way. Now the adjugate formula can be used to prove the Cayley-Hamilton theorem Proof of the Cayley-Hamilton theorem. Set pA (λ) = det(λI − A) and let B = adj(λI − A) then the adjugate formula gives pA (λ)I = (λI − A)B Every element in B is a polynomial of degree n − 1 or lower, since each element in B is the determinant of a (n − 1) × (n − 1) matrix. Thus B can be rewritten as a matrix polynomial n−1 X λk Kk B= k=0 where Bk is a matrix containing the coefficients of the polynomial terms with degree k in B. Combining the to formulas gives (λI − A)B = (λI − A) n−1 X λk Bk = k=0 = λn Bn−1 + λn−1 (Bn−2 − ABn−1 ) + . . . + λ(B0 − AB1 ) + (−AB0 ) = = λn I + λn−1 an−1 I + . . . + a1 λ + a0 = pA (λ)I If this expression holds for all λ then the following three conditions must be fulfilled ak I = Bk−1 − ABk for 1 ≤ k ≤ n − 1 I = Bn−1 (7) a0 I = −AB0 (8) Next consider the polynomial pA (A) pA (A) = An + An−1 an−1 + . . . + Aa1 + Ia0 Combining this with condition (6)-(8) results in 108 (6) pA (A) = An + An−1 (Bn−2 − ABn−1 ) + . . . + A(B0 − AB1 ) + (−AB0 ) = = An (I − Bn−1 ) +An−1 (Bn−2 − Bn−2 ) + . . . + A (B0 − B0 ) = 0 | | {z } | {z } {z } =0 =0 =0 9.3.2 Calculating matrix polynomials In section 9.1.1 a number of different techniques and properties worth considering when calculating the power function was described. Naturally all of this is applicable to matrix polynomials as well. But there is other techniques that can be used to simplify the calculations of a matrix polynomial. On page 106 it was mentioned that the Cayley-Hamilton theorem could be used to calculate matrix polynomials. It is very simple to see hoe the following theorem follows from Cayley-Hamilton. Theorem 9.5. Let pA (λ) = det(λI − A) where A is a square matrix. Then the following formula holds pA (A) = 0 c · pA (A) = 0 where c ∈ C q(A) · pA (A) = 0 where q is a polynomial Many things that can be done with polynomials of numbers can also be with matrix polynomials. Factorization is one such thing. If p(x) = (x − λ1 )(x − λ2 ) . . . (x − λn ) then p(A) = (A − λ1 I)(A − λ2 I) . . . (A − λn I) This means that if a polynomial on normal numbers can be simplified by factorization, the corresponding matrix polynomial can be simplified in a corresponding way. It also means that polynomial division can be used for matrix polynomials as well as for normal polynomials. That is, any matrix polynomial, p(A), with degree n can be rewritten as p(A) = q(A)π(A) + r(A) where q(A), π(A) and r(A) are all polynomials of degree deg(q) = n − k, deg(π) = k and deg(r) ≤ n with 1 ≤ k ≤ n. If π(A) is chosen as one of the polynomials in theorem 9.5 then π(A) = 0 and p(A) = r(A) with potentially lowers the degree of p(A) making it easier to calculate. One especially interesting consequence of this is the following theorem Theorem 9.6. For two polynomials, a and b, with different degree, deg(a) ≥ deg(b), such that a(A) = 0 and b(A) = 0 then the following relation must apply a(A) = c(A)b(A) where c(A) is a polynomial with degree deg(c) = deg(a) − deg(b). Proof. Dividing a and b gives a(A) = c(A)b(A) + r(A). From b(A) = 0 it follows that a(A) = r(A) but a(A) = 0 thus r(A) = 0. 109 9.3.3 Companion matrices If λ is an eigenvalue of a square matrix A then det(λI − A) = 0. Since an n × n matrix has n eigenvalues (some of which can be the same) and since pA (λ) = det(λI − A) is a polynomial of λ with degree n this means that each eigenvalue corresponds to a root of the polynomial pA (λ). This can be expressed as a formula pA (λ) = det(λI − A) = (λ − λ1 )(λ − λ2 ) . . . (λ − λn ) where λi for 1 ≤ i ≤ n are eigenvalues of A. This means that if we have found the eigenvalues of some matrix we have also found the roots of some polynomial and vice versa. That solving polynomials can be used to find eigenvalues of matrices is something that we have seen before, the standard method of solving det(λI − A) = 0 is exactly this. But there are other ways of finding the eigenvalues of a matrix, in section 11 a number of different ways of calculating the roots numerically will be shown for example. There is also a simple way of constructing a matrix that has the roots of a given polynomial as eigenvalues. This matrix is called the companion matrix of a polynomial. The origin of this kind of matrix is linear differential equations. Consider a linear differential equation of order n y (n) + an−1 y (n−1) + . . . + a1 y (1) + a0 = f where y (k) is the k:th derivative of y and f is a function. When solving this kind of differential equation it is often helpful to find the roots of the characteristic polynomial p(x) = xn + an−1 xn−1 + . . . + a1 x + a0 To get the companion matrix we rewrite the differential equation using state variables x0 = y x1 = y (1) .. . xn−1 = y (n−1) xn = y (n) Using the state variables x1 , . . . xn the differential equation can be written as a system of first order differential equations. If we denote a vector containing all xi with x and a vector containing all x0i with x0 then the differential equation can be written as a matrix equation 110 x0 = C(p)> x ⇔ x00 = x1 x01 = x2 .. . x0n−1 = xn x0n = −a0 x0 − a1 x1 − . . . − an−1 xn + f With C(p) defined as follows. Definition 9.7 Let p(x) = a0 + a1 x + a2 x2 + . . . + an−1 xn−1 + xn 0 0 ... 0 1 0 . . . 0 C(p) = 0 1 . . . 0 .. .. . . .. . . . . then the companion matrix of p is −a0 −a1 −a2 .. . 0 0 . . . 0 −an−1 In some literature C(p)> is referred to as the companion matrix instead of C(p). Theorem 9.7. If C(p) is a companion matrix to the polynomial p then the eigenvalues of C(p) are roots of p. Proof. Let ei denote vectors in the standard basis, that is column vectors with a 1 in element i and 0 everywhere else. For these vectors multiplication with C(p) will result in the following C(p)ei = ei+1 for 0 ≤ i < n n X C(p)en = −ai ei (9) (10) i=1 From (9) it can be easily shown that ei = C(p)i e1 for 0 ≤ i < n n en = C(p) e1 (11) (12) Combining equation (9)-(12) gives C(p)n e1 = n−1 X −ai C(p)i e1 ⇔ i=0 n n−1 ⇔ C(p) + an−1 C(p) + . . . + a1 C(p) + a0 I = 0 ⇔ p(C(p)) = 0 111 By the Cayley-Hamilton theorem pC (C(p)) = 0 with pC (λ) = det(λI − C(p)). Thus we have two polynomials that are zero for the companion matrix. p(C(p)) = 0 pC (C(p)) = 0 Then we can use theorem (9.6) to see that that p(A) = c · pC (A) (since p and pC have the same degree). This means that p(λ) and pC (λ) must have the same roots. The eigenvalues of C(p) are roots of pC (λ) and thus the theorem is proved. Calculating roots of polynomials using companion matrices can be quite efficient, especially since a polynomial of high degree gives a sparse matrix , that is a matrix with only a few non-zero elements. Sparse matrices can be used quite efficiently by computers and many algorithms can be specifically modified to work much faster on sparse matrices than non-sparse matrices. The power method described in section 11.1 can be used to efficiently find numerical eigenvalues of sparse matrices. Many computer programs that finds roots of polynomials uses companion matrices, for example the roots command in MATLAB (see [5]). 9.4 Exponential matrix So far we have constructed matrix functions by taking the definition of a function for numbers and replaced numbers with matrices when it was possible. Now we are going to construct a matrix version of a function in a different way, we are going to identify the most important properties of a function and then construct a matrix function that has the same properties. The function that we are going to construct a matrix version of the exponential function. A common definition for the exponential function is x n ) n The exponential function is useful in many ways but most of its usefulness comes from the following three properties ex = lim (1 + n→∞ d at (e ) = aeat = eat a dt (eat )−1 = e−at ea · eb = ea+b = eb+a = eb · ea (13) (14) (15) Here a, b and t are all real or complex numbers. Multiplication of numbers is usually commutative (a · b = b · a) but matrix multiplication is often not commutative. In equation (13)-(15) the various places where commutativity comes in has been written out explicitly since we want to have commutativity in the same way for the matrix version. 112 The most complex function we have defined so far is the matrix polynomial. Can we construct a polynomial such that it fulfills (13)-(13)? Consider what happens to a general polynomial of At when we take the first time derivative p(At) = n X k=0 n n−1 k=1 l=0 X X d ak (At) ⇔ (p(At)) = ak kAk tk−1 = A al+1 (l + 1)Al tl dt k The last equality is achieved by simply taking l = k − 1. Compare this to Ap(At) = A n X al Al tl l=0 If the polynomial coefficients are chosen such that al+1 (l + 1) = al these expressions would be identical except for the final term. Let us choose the correct coefficients al+1 (l + 1) = al ⇔ al = 1 l! and simply avoid the problem of the last term not matching up by making the polynomial have infinite degree. Definition 9.8 The matrix polynomial function is defined as e At = ∞ k X t k=0 k! Ak , A ∈ Mn×n It is simple to create eA from this by setting t = 1. An important thing to note here is that since the sum in the definition is infinite we should check for convergence. As it turns out the exponential function will always converge. This follows from a more general theorem presented in section 9.5. Theorem 9.8. For the matrix exponential function the following is true d At (e ) = AeAt = eAt A dt (eAt )−1 = e−At A A e ·e =e A+B =e B+A (16) (17) B A = e · e if AB = BA (18) where A and B are square matrices. This definition of eAt has all the properties we want, with the exception that we need the extra condition on (18) compared to (15). There is no way around this unfortunately. That the definition gives a function that has property (16) is clear from the construction. Property (17) and (18) can be shown by direct calculations using the definition. Calculating the exponential matrix can be challenging. First of all we should note that the exponential function for numbers can be written like this 113 ex = ∞ X xk k=0 k! using a Taylor series. Using this we can see how to calculate the matrix exponential for a diagonal matrix D ∞ X dk1,1 0 ... 0 k! k=0 ∞ ed1,1 0 ... 0 k X d 2,2 0 ... 0 ed2,2 . . . 0 0 D k! e = = . . .. . k=0 . . . . . . . .. .. .. .. . . . . dm,m 0 0 ... e ∞ X dkn,n 0 0 ... k! k=0 Since the exponential matrix is essentially a polynomial it should not be surprising that the following relation holds. A = S−1 BS ⇒ eA = S−1 eB S On page 102 it was discussed that nilpotent matrices could make it easier to calculate the matrix power function. This is of course also true for the matrix exponential function, since it turn the infinites sum into a finite sum and thus turns the exponential function into a finite polynomial. There are also many ways of computing the exponential matrix function numerically. A nice collection can be found in [10]. 9.4.1 Solving matrix differential equations One of the most common applications of the exponential function for numbers is solving differential equations. df = kf (x) ⇒ f (x) = c · ekx dx The matrix exponential function can be used in the same way. Theorem 9.9. The exponential matrix exA is the solution to the initial value problem dU = AU dx U (0) = I Proof. From the construction of exA we already know that if fulfills the differential equation. Checking that the initial condition is fulfilled can be done simply by setting x = 0 in the definition. 114 9.5 General matrix functions There is a systematic way of creating a matrix version of a function. When the matrix exponential function was defined in the previous section (see page 113) we found a matrix polynomial of infinite degree that created the matrix version. One way of doing this is given below Definition 9.9 (Matrix function via Hermite interpolation) If a function f (λ) is defined on all λ that are eigenvalues of a square matrix A then the matrix function f (A) is defined as f (A) = h sX i −1 X f (k) (λi ) i=1 k=0 k! φik (A) where φik are the Hermite interpolation polynomials (defined below), λi are the different eigenvalues and si is the multiplicity of each eigenvalue. Definition 9.10 Let λi , 1 ≤ i ≤ n be complex numbers and si , 1 ≤ i ≤ n, be positive integers. Let s be the sum of all si . The Hermitian interpolation polynomials are a set of polynomials with degree lower than s that fulfills the conditions ( (l) 0 if i 6= j or k 6= l φik (λj ) = l! 1 if i = j and k = l Using this kind of definition to create a matrix function works as long as the function f is defined on the eigenvalues of the matrix. There is another definition af a matrix function that is equivalent to definition 9.9 but is only applicable to analytical functions, that is functions that can be written as a power series. Definition 9.11 (Matrix function via Cauchy Integral) Let C be a curve that encloses all eigenvalues of the square matrix A. If the function f is analytical (can be written as a power series) on C and inside C then Z 1 f (A) = f (z)(zI − A)−1 dz 2πi C The matrix RA (z) = (zI − A)−1 is called the resolvent matrix of A. Proofs that these two definitions makes sense and are equivalent can be found in [?] on page 1-8. The formula in definition 9.11 is a matrix version of Cauchy’s integral formula which can be used to derive the formula for a Taylor series. As a consequence of this is that we can construct the matrix version of any analytical function can by replacing the number variable in a Taylor series expansion (or other convergent power series) with a square matrix. For example of this see the formula for the matrix square root (page 105) and compare the definition of the matrix exponential function with the Taylor series for the exponential function for numbers. 115 10 Matrix equations A matrix equation is an equation that involves matrices and vectors instead of numbers. A simple example is Ax = y where A is a matrix, y is a known vector and x is an unknown vector. You are already familiar with some ways of solving this kind of equations, for example Gaussian elimination and matrix inverses. There are of course many other kinds of matrix equations. One example that we have seen before are the first order differential equation that was solved using the exponential matrix function, see page 114. There are many other kinds of matrix equations. A simple example would be exchanging the x and y vectors in the typical matrix equation for a pair of square matrices, X and Y. There are two variations of this kind of equation AX = Y XA = Y Another example of a matrix equation that is important in many applications (first and foremost in control theory) is Sylvester’s equation AX + XB = C There is also a version of Sylvester’s equation that is especially important known as the continous Liapunov1 equation. AX + XAH = −C There area many methods for solving specific matrix equations, but here we will take a look at a basic tool for examining certain properties of the solutions of matrix equations and turning the more exotic kind of matrix equations into the typical matrix-vector equation that we already know how to solve. For this a new tool will be introduced. 10.1 Tensor products The tensor product, usually denoted by ⊗ is an interesting tool that can be defined for many kinds of mathematical objects, not only matrices. It is sometimes called the outer product or direct product. A ⊗ B is usually pronounced ’A tensor B’. Definition 10.1 The tensor product for matrices is a bilinear function of matrices that has two properties: 1. Bilinearity: (µA + ηB) ⊗ C = µA ⊗ C + ηB ⊗ C A ⊗ (µB + ηC) = µA ⊗ B + ηA ⊗ C 2. Associativity: (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) 1 sometimes spelled Lyapunov or Ljapunow 116 It can be possible to find different function that fullfil these condition for at set of matrices but the most general (and most common) tensor product is the Kroenecker product. Definition 10.2 The Kroenecker product ⊗ between two matrices, as a1,1 B a1,2 B a2,1 B a2,2 B A⊗B= . .. .. . A ∈ Mm×n and B ∈ Mp×q , is defined . . . a1,n B . . . a2,n B .. .. . . am,1 B am,2 B . . . am,n B or Ab1,1 Ab2,1 A⊗B= . .. Ab1,2 Ab2,2 .. . ... ... .. . Ab1,n Ab2,n .. . Abm,1 Abm,2 . . . Abm,n these two definitions are equivalent but not equal, one is a permutation of the other. The Kroenecker product is quite simple and intuitive to work with, even if it creates large matrices. Theorem 10.1. For the Kroenecker product the following is true 1. (µA + ηB) ⊗ C = µA ⊗ C + ηB ⊗ C 2. A ⊗ (µB + ηC) = µA ⊗ B + ηA ⊗ C 3. (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) 4. (A ⊗ B)> = A> ⊗ B> 5. A ⊗ B = A ⊗ B (complex conjugate) 6. (A ⊗ B)H = AH ⊗ BH 7. (A ⊗ B)−1 = A−1 ⊗ B−1 , for all invertible A and B 8. det(A ⊗ B) = det(A)k det(B)n , with A ∈ Mn×n , B ∈ Mk×k 9. (A ⊗ B)(C ⊗ D) = (AC ⊗ BD), A ∈ Mm×n , B ∈ Mn×k , C ∈ Mp×q , D ∈ Mq×r 10. A ⊗ B = (A ⊗ Ik×k )(In×n × B), A ∈ Mn×n , B ∈ Mk×k The first 7 properties listed here shows that the Kroenecker product is bilinear, associative and react in a very intuitive way to transpose, complex conjugate, Hermitian conjugate and inverse operations. Number 8 shows that determinant are almost as intuitive but not quite (note that the determinant of A has been raised to the number of rows 117 in B and vice versa). Properties 9 and 10 are examples of how matrix multiplication and Kroenecker products relate to one another. Taking the Kroenecker product of two matrices gives a new matrix. Many of the properties if the old matrices relate to the properties of the old matrix in a fairly uncomplicated manner. One property that this applies to are the eigenvalues Theorem 10.2. Let {λ} be the eigenvalues of A and {µ} be the eigenvalues of B. Then the following is true: {λµ} are the eigenvalues of A ⊗ B {λ + µ} are the eigenvalues of A ⊗ B Sketch of proof. a) Use Shurs lemma A = S−1 RA S, B = T−1 RB T where RA and RB are triangular matrices with eigenvalues along the diagonal. Then A ⊗ B = (S−1 RA S) ⊗ (T−1 RB T) = = (S−1 RA ⊗ T−1 RB )(S ⊗ T) = = (S ⊗ T)−1 (RA ⊗ RB )(S ⊗ T) RA ⊗RB will be triangular and will have the elements ai,i bj,j = λi µj on the diagonal, thus λi µj is an eigenvalue of A⊗B since similar matrices have the same eigenvalues. b) Same argument as above gives A ⊗ I (B ⊗ I) have the same eigenvalues as A (B) adding the two terms together gives λ + µ is an eigenvalue. 10.2 Solving matrix equations The tensor product can be used to make it easy to solve equations like these AX = Y XA = Y Taking the tensor product between the identity matrix and some square matrix results in the following theorem Theorem 10.3. e=Y e AX = Y ⇔ (I ⊗ A)X e=Y e XA = Y ⇔ (A> ⊗ I)X where B.1 B.2 e= . . . B.n , B .. . B = B.1 B.2 B.n 118 This turns an equation consisting of n × n matrices into an equation with one n2 × n2 matrix and two n2 vectors. This equation can be solved the usual way. Solving the matrix this way is equivalent to assigning each element in the unknown matrix a variable and the solve the resulting linear equation system. Using the Kroenecker product can make subsequent manipulations of the matrices involved easier though, thanks to the properties shown in theorem 10.1. Often it is interesting to know if there exists a solution to a given equation and if that solution is unique. Using the Kroenecker product we can see for which matrices A and B that there is a unique solution to Sylvester’s equation. Theorem 10.4. Sylvester’s equation AX + XB = C has a unique solution if and only if A and −B have no common eigenvalues. Proof. Rewrite the equation using the Kroenecker product e=C e AX + XB = C ⇔ (I ⊗ A + B> ⊗ I) X | {z } K This is a normal matrix equation which is solvable if λ = 0 is not an eigenvalue of K. The eigenvalues of K are λ + µ where λ is an eigenvalue of A and µ is an eigenvalue of B. Thus if A and −B have no common eigenvalues the eigenvalues of K will 6= 0. As mentioned previously Sylvester’s equation is important in control theory especially the version of it that is called the continuous Liapunov equation. This will be illustrated using two theorems (neither theorem will be proved). A stable system is a system that will in some sense ’autocorrect’ itself. If left alone it will approach some equilibrium state and not increase indefinitely or behave chaotically. If the system is described by a first order differential matrix equation then the following condition gives stability. Definition 10.3 A matrix, A, is stable when all solutions to dx = Ax(t) dt converges to 0, x(t) → 0 when t → ∞. ⇔ All the eigenvalues of A have negative real component, Re(λ) < 0 for all λ that is an eigenvalue to A. Theorem 10.5. If A and B are stable matrices then AX + XB = −C has the unique solution Z X= ∞ etA CetB dt 0 119 Theorem 10.6. Let P be a positive definite matrix. If and only if there is a positive definite solution to AH X + CA = −P is A a stable matrix. 120 11 Numerical methods for computing eigenvalues and eigenvectors In practise we can rarely use a analytical solution to finding eigenvalues or eigenvectors of a system. Either because the system is two large making Gausselimination, by solving the characteristic polynomial or similar methods to slow or maybe we only need one or a few eigenvalues or eigenvectors or maybe a rough estimate is enough. You have probably seen by now how time consuming it is to find eigenvalues and eigenvectors analytically. Since there are a huge number of different problems where we want to find eigenvalues or eigenvectors there are also a large number of different methods with different strengths and weaknesses depending on the size and structure of the problem. We will look at two different related methods, the Power method and the QR-method and a couple of variations of both. While this chapter gives some insight into how to use or implement some methods, it also serves to help in showing that there is no generall method that somehow always works. If a general method doesn’t work, there might be another specified method for your type of problem. It’s then important to know what to look for both in our matrix as well as what we are actually trying to find. It also serves to show how we can start with a very primitive algorithm and then carefully build upon it in order to remove some or most of it’s limitations while still keeping it’s strengths. Some things to take into consideration when choosing which method to use is: • The size of the matrix, especially very large matrices which we maybe can’t even fit in memory. • Density of non-zero elements, Is the matrix a full matrix (few zeroes) or a sparse matrix (few non-zero elements). • Does the matrix have any kind of useful structure we could use, for example if it’s positive definite, symmetric, Hessian or a bandmatrix. • What do we seek? Do we need all eigenvalues and eigenvectors or just one or a few? Do we neeed exact result or is a good approximation enough? Or maybe we don’t need bth eigenvalues and eigenvectors? 11.1 The Power method The Power method is a very simple algorithm for finding the eigenvalue λ1 with the largest absolute value and it’s corresponding eigenvector. While the method have a couple of disadvantages making it unsuitable for general matrices, it proves it’s worth in a couple of specialized problems unsuitable for more general methods such as the QR-method which we will look at later. The Power method also acts as an introduction to many of the thoughts behind the QR-method and it’s variations. 121 The Power method in its basic form is a very simple and easy to implement method. We consider the matrix A for which we want to find the ”largest” eigenvalue and corresponding eigenvector. • We assume there is only one eigenvalue on the spectral radius and that A is diagonalizable. • The method works by iterating xk+1 = Axk . kAxk k Axk approaches a vector parallel to the eigenvector of the eigenkAxk k value with largest absolute value λ1 . • If λ1 6= 0 then • We get the eigenvalue λ1 from the relation Ak x ≈ λ1 Ak−1 x when k is large. It’s easy to see that if the Matrix A> is a stochastic matrix, running the Power method can be seen as iterating the Markov chain. This also motivates why we need there to be only one eigenvalue on the spectral radius since otherwise the Markov chain and therefor neither the Power method would converge in those cases. We will not look into how fast the convergence is here, but be content by noting that the relative error can be shown to be in the order of |λ2 /λ1 |k . In other words it’s not only other eigenvalues on the spectral radius that’s problematic, if there are eigenvalues close to the spectral radius they will slow down the convergence as well. We give a short example using the following matrix: −2 1 1 A = 3 −2 0 1 3 1 We start with x0 = [1 1 1]> and iterate: 0 1 −2 1 1 1 = 1 Ax0 = 3 −2 0 1 5 1 3 1 Next we calculate the norm of this vector: kAx0 k = √ 26 And last we use the norm kAx0 k in order to normalize Ax0 : x1 = √ Ax0 = [0, 1, 5]/ 26 kAx0 k Continuing for the next iteration we −2 1 Ax1 = 3 −2 1 3 122 get: 1 0 6 1 1 0 1 √ = −2 √ 26 26 1 5 8 x2 = √ Ax1 = [6, −2, 8]/ 104 kAx1 k And if we continue for a couple of more iterations we get: • x2 = [0.59, −0.17, 0.78]> . • x3 = [−0.25, 0.91, 0.33]> . • x4 = [0.41, −0.61, 0.67]> . • x7 = [−0.28, 0.85, −0.44]> . • x10 = [0.28, −0.79, 0.55]> . • For reference the true eigenvector is xtrue = [0.267, −0.802, 0.525]. We see that it does seem to converge, but note that it could take quite some time. However since every step gives an a little better approximation, if we only need a rough estimate we could stop considerably earlier. Because of the assumptions regarding the system as well as the problem with convergence if λ1 ≈ λ2 this simple version of the algorithm is rarely used unless we already have a good understanding of the system. In short a couple of things to consider when using the power method are: • If there is more than one eigenvalue on the spectral radius or if it’s multiplicity is larger than one, the method won’t converge. • Since the error is of the order |λ2 /λ1 |k , if λ2 is close to λ1 the convergence can be very slow. • We only get the (in absolute value) largest eigenvalue and corresponding eigenvector. While these are some considerable disadvantages (we can’t use it on all matrices and the convergence speed is uncertain), there are a couple of things we can do to handle these problems. However while the method does have a couple of negatives, it does have a couple of very strong points as well. • In every step we make only a single vector-matrix multiplication, making it very fast if we can ensure a fast convergence. • If A is a sparse matrix (few non-zero elements) the iterations will be fast even for extremely large matrices. • Since we do no matrix decomposition (as we do in most other common methods) we do not ”destroy” the sparcity of the matrix when calculating the matrix decomposition. 123 Especially we see the advantage of the method on sparse matrices since most other commonly used methods uses a matrix decomposition of the matrix, usually destroying the sparsity of the matrix in the process. This means that while the method might not be as reliable or as fast as other methods such as the QR-method in the general case, if we have a very large sparse matrix we might not even be able to use many other methods since they would be to slow to be usable. In some cases the matrix itself might not even fit in memory anymore after matrix decomposition further increasing the time needed. 11.1.1 Inverse Power method In order to handle some of the problems found earlier we will take a look at something called the inverse Power method. While this fixes some of the problems it doesn’t fix all of them, and there is a cost in that the method is no longer as suitible for sparse matrices as before. We consider the invertible matrix A with one eigenpair λ, x. From the definition of eigenvalues and eigenvectors we then have Ax = λx. This gives: Ax = λx ⇔ (A − µI)x = (λ − µ)x ⇔ x = (A − µI)−1 (λ − µ)x ⇔ (λ − µ)−1 x = (A − µI)−1 x And we see that ((λ − µ)−1 , x) is an eigenpair of (A − µI)−1 If we instead iterate using this new matrix (A − µI)−1 the method becomes: xk+1 = (A − µI)−1 xk k(A − µI)−1 xk k And just like the ordinary Power method this will converge towards the eigenvector of the dominant eigenvalue of (A − µI)−1 . The dominant eigenvalue of (A − µI)−1 is not necesarily the same as the dominant eigenvalue of A. Rather we get the dominant eigenvalue (λ − µI)−1 of (A − µI)−1 for the λ closest to µ. If we take a look at the error we get: (λ −1 k − µ) (second closest to µ) (λ(closest to µ) − µ)−1 (λ(closest to µ) − µ) k = (λ − µ) (second closest to µ) We can easy see that if µ is a good approximation of an eigenvalue of A then (λ(closest to µ) ≈ 0 and the error will be close to zero as well, this especially improves the convergence with multiple close eigenvalues. With a good approximation of λ the method will converge very fast, often in only one or a couple of iterations. Especially in the case where we know the eigenvalue exactly beforehand such as when working with irreducible stochastic matrices we get the error equal to zero and the method should converge in only 124 one step. We also see that if we change µ we can find the eigenvectors of not only the dominant eigenvalue, but the eigenvectors to multiple eigenvalues. However we still have problems with eigenvalues with multiplicity larger than one, or when more than one have the same absolute value. The inverse Power method also requires us to solve a linear system or calculate the inverse of the matrix (A − µI). This makes the method unsuitable for large sparse matrices where the basic Power method shows it’s greatest strength, since calculating the inverse matrix usually destroys the sparcity of the matrix. We give a short example using the inverse the dominant eigenvalue of: 3 3 1 1 A= 1 0 1 3 power method to find the eigenvector to 1 0 1 0 0 3 3 2 Since A is non-negative and irreducible (why?) we can use Perron Frobenius Theorem 4.1 to find bounds on the dominant eigenvalue. We remember this means the dominant eigenvalue is at least equal to the lowest row sum and at most equal to the largest row sum. Since at least one of the diagonal elements are non-zero we know that A is primitive as well so that we know that there is only one eigenvalue on the spectral radius. This ensures the convergence (when seeking the dominant eigenvalue). We choose µ as the largest possible value of the dominant eigenvalue: µ = max row sum = 7. This ensures that the dominant eigenvalue is the eigenvalue closest to µ, ensuring we converge towards the right eigenvector. We get: −4 3 1 0 1 −6 0 3 A − µI = A − 7I = 1 0 −6 3 1 3 0 −5 And we get its inverse as: (A − µI)−1 −0.4038 −0.1538 = −0.1538 −0.1731 −0.3173 −0.3590 −0.1923 −0.2788 −0.0673 −0.0256 −0.1923 −0.0288 −0.2308 −0.2308 −0.2308 −0.3846 Starting with x0 = [1 1 1 1]> gives: x1 = (A − µI)−1 x0 k(A − µI)−1 x0 k = [−0.5913 − 0.4463 − 0.4463 − 0.5020]> If we continue to iterate a couple of times and multiply the end result of those that are negative with −1 to end up with a positive vector we get: 125 • x1 = [0.5913 0.4463 0.4463 0.5020]> • x2 = [0.6074 0.4368 0.4368 0.4995]> • x3 = [0.6105 0.4351 0.4351 0.4986]> • xtrue = [0.6113 0.4347 0.4347 0.4983]> We see that we are very close to the true eigenvector after only three iterations, we in fact have a rather good approximation after only one iteration. Using the found eigenvector we can now also find a better approximation of the true dominant eigenvalue λ = 5.8445 as well from the relation Ax = λx. For example using x3 we get: Ax3 = [3.5721 2.5414 2.5414 2.9131]> → λ ≈ 3.5721 = 5.8511 0.6105 Which is very close to the true eigenvalue λ = 5.8445. The Inverse power method is often used when you have a good approximation of an eigenvalue and need to find an approximation of the corresponding eigenvector. This is often the case in real time systems where a particular eigenvector might needs to be computed multiple times during a very short time as the matrix changes. Overall both the Power method and the Inverse Power method are unsuitable for general matrices since we cannot guarantee the convergence without further inspecting the matrix. In order to solve the limitations of these methods we will need to calculate many eigenvalues at the same time as for example in the QR-method below. 11.2 The QR-method Often we might not know anything about the matrix beforehand and we might want to find not only one or a couple of eigenvalues/eigenvectors but all of them. We also don’t want to be limited to only finding the eigenpairs for simple eigenvalues. The QR-method solves all of these problems, it finds all eigenpairs, and it’s generally very fast unless we are working with very large matrices where we might need specified algorithms. Since the method is fast, reliable and finds all eigenpairs it is almost always the QR-method (or a variation there of) that is used whenever you calculate eigenvalues/eigenvectors using computational software. Unless you have very specific requirements or already know a more suitable method it’s usually a good idea to start by trying the QR-method and then continue looking at other methods if that is unsufficient. 126 Like the Power method, the QR-method is a iterative method, which after some time converge and gives us the eigenvalues and eigenvectors. Like the name suggest the method uses the QR-decomposition of the original matrix. We remember that we could find a QR decomposition A = QR where Q is a unitary matrix and R is upper triangular using for example the Gram-Schmidt process. The method works by iterating: • We start by assigning A0 := A • In every step we then compute the QR-decomposition Ak = Qk Rk . −1 H • We then form Ak+1 = Rk Qk = QH k Qk Rk Qk = Qk Ak Qk = Qk Ak Qk • Since Q is unitary all Ak are similar and therefor have the same eigenvalues. • When the method converge Ak will be a triangular matrix and therefor have it’s eigenvalues on the diagonal. The QR-method can be seen as a variation of the Power method, but instead of working with only one vector, we work with a complete basis of vectors, using the QR-decomposition to normalize and orthogonalize in every step. In practice a few modifications are needed in order to handle a couple of problems in it’s most basic form some of which are: • Computing the QR-decomposition of a matrix is expensive, and we need to do it in every step. • The convergence itself like with the Power method can also be very slow. • Like the Power method we have problems with complex and non-simple eigenvalues. However if we can handle these problems we have a very stable and fast algorithm for general matrices. Especially the stability of the algorithm is one of it’s huge selling point and the biggest reason why it’s used over other similar methods using for example the LUP decomposition. But how do we improve the algorithm? We start by looking at how we can speed up the computations of the QR-decomposition since even if we use Gram Schmidt or some other method it’s still a very expensive operation. We remember a matrix (A) is (upper) Hessenberg if all elements ai,j = 0, i > j + 1. In other words A is of the form: F F F ··· F F F F F F · · · F F F 0 F F · · · F F F A = 0 0 F · · · F F F .. .. .. . . . .. .. . . .. . . . . 0 0 0 · · · F F F 0 0 0 ··· 0 F F 127 If A is Hessenberg then we would speed up the algorithm considerably since: • Calculating the QR-factorization of a Hessenberg matrix is much faster than for a generall matrix. • Since multiplication of a Hessenberg matrix with a triangular matrix (both in the same direction) is itself a Hessenberg matrix. If A0 is Hessenberg every other Ak , k = 0, 1, 2, . . . is Hessenberg as well. • Since a Hessenberg matrix is nearly triangular, it’s convergence towards the triangular matrix will be improved, resulting in fewer total steps as well We don’t want to limit our algorithm to only Hessenberg matrices, however what if we can change the matrix we work with to one in Hessenberg form? This is possible since: Theorem 11.1. Every square matrix A have a matrix decomposition A = UH HU where • U is a unitary matrix. • H is a Hessenberg matrix. Since U is a unitary matrix, A and U have the same eigenvalues so in order to find the eigenvalues of A we can instead find the eigenvalues of H which is as we wanted Hessenberg. A Hessenberg decomposition can be calculated effectively using for example householder transformations. While we generally have a faster convergence using H, it can still be very slow. We remember we could solve this problem in the Power method by shifting the eigenvalues of the matrix. In the same way we can shift the eigenvalues in the QR-method using (Ak − σI). For Ak we then get: H Ak+1 = QH k (Ak − σI)Qk + σI = Qk (Ak )Qk We see that Ak , Ak+1 is still similar and therefor will have the same eigenvalues. As in the Power method we want σ to be as close as possible to one of the eigenvalues. Usually the diagonal elements of the hessenberg matrix is used since they prove to give a rather good approximation. Since the diagonal elements of Ak should give increasingly accurate approximation of the eigenvalues we can change our shift to another eigenvalue or improve the current one between iterations as well. As fast as a subdiagonal element is close to zero, we have either found an eigenvalue if it’s one of the outer elements. Or we can divide the problem in two smaller ones since we would have: A1,1 A1,2 A= 0 A2,2 128 This still leaves the problem with complex eigenvalues in real matrices. In order to solve this problem an implicit method using multiple shifts at the same time is used in practice. We leave this implicit method, also called the Francis algorithm by noting that it exist and works by much the same principle as the inverse Power algorithm. We give a summary of the modified QR-method using a single shift: 1. Compute the Hessenberg form (A = UH HU) using for example Householder transformations. 2. Assign A0 = H. Then in every step, choose a shift σ as one of the diagonal elements of Ak . 3. Compute the Qk Rk factorization of (Ak − σI). 4. Compute Ak+1 = Rk Qk + σI 5. Iterate 2 − 4 until Ak is a triangular matrix, optionally dividing the problem in two whenever a subdiagonal element becomes zero. 11.3 Applications 11.3.1 PageRank Pagerank is the method originally designed by Google to rank homepages in their search engine in order to show the most relevant pages first. In the ranking algorithm we consider homepages as nodes in a directed graph, and a link between two homepages as an edge between the corresponding vertices in the graph. Since the number of homepages is huge (and rapidly increasing) this does not only put high demand on the accuracy of the model (few looks at more than the first couple of pages of results) it also puts a huge demand on the speed of the algorithm. The principle used in the ranking is such that homepages that are linked to by many other homepages and/or homepages that are linked to by other important homepages are considered more important (higher rank) than others. An image illustrating the principle can be seen in Figure ??. PageRank uses the adjacency matrix 3.1 of the graph with a couple of modifications. First we do not allow any homepage to link to itself. Secondly, any homepages which links to no other page, usually called dangling nodes needs to be handled separately. Usually they are assumed to instead link to all pages (even itself even though we otherwise don’t allow it). This is the method we will use here, although there are other methods available as well. Before calculating PageRank we make a modification to the adjacency matrix. We weight the matrix by normalizing the sum of every non-zero row such that it sums to one. This is our matrix A. Pagerank is then the eigenvector normalized such that the sum is equal to one of the dominant eigenvalue of matrix M : M = c(A + gv> )> + (1 − c)ve> 129 image by Felipe Micaroni Lalli Where the 0 < c < 1 is a scalar, usually c = 0.85, e is the one vector and A as above is a non-negative square matrix whose rows all sum to one. v is a non-negative weigh vector with the sum of all elements equal to one. g is a vector with elements equal to zero for all elements except those corresponding to dangling nodes which are equal to one. We see that saying that we consider dangling nodes linking to all other nods is not strictly true, rather we assume they link according to a weight vector with sum one, usually however all or at least most of the elements of v are equal such that this is more or less true. To illustrate the method we give a short example where we consider the following system: 1 2 3 4 5 We assume all elements of v equal such that we give no node a higher weight. Using the matrix for the graph we get the matrix (A + gv> ) by normalizing every row to one and add the change on the third row corresponding to the dangling node we get: 0 1/3 1/3 1/3 0 0 0 0 1 0 > (A + gv ) = 1/5 1/5 1/5 1/5 1/5 0 0 0 0 1 0 0 0 1 0 130 Using this we can now 0 0 M= c/5 0 0 get our matrix M = c(A + gv> )> + (1 − c)ve> : > c/3 c/3 c/3 0 1 1 1 1 1 1 1 1 1 1 0 0 c 0 1−c 1 1 1 1 1 c/5 c/5 c/5 c/5 + 5 0 0 0 c 1 1 1 1 1 0 0 c 0 1 1 1 1 1 If we choose c = 0.85 and calculate the eigenvector to the dominant eigenvalue we end up with PageRank p = [0.0384, 0.0492, 0.0492, 0.4458, 0.4173]> . From the result we can see a couple of things, first we see that node one which have no other node linking to it have the lowest PageRank and node two and three which are only linked to by node one have equal and the second lowest PageRank. Node four have a very high PageRank since it’s linked to by many nodes and node five have high PageRank since it’s linked to by node four which have a very high PageRank. We see that the definition seems reasonable, however we still have a large problem to consider: how do we find the eigenvector to the dominant eigenvalue of such a huge matrix in a reasonable time? For reference 2008 − 2009 both Google and Bing reported indexing over 1.000.000.000.000 webpages which in turn means our matrix M would have about 1024 elements. Obviously something needs to be done. We note that the matrix M has been constructed such that it’s a non-negative matrix. And it also seems reasonable to assume it’s irreducible through our vector v. In fact unless some nodes have a zero-weight in v, M will be a positive matrix. We also have that the sum of every column of M equal to one, so we can say that M is actually a (column) stochastic matrix. This gives us a more intuitive way to describe pagerank looking at is from a Markov chain perspective. Consider a internet surfer surfing the web. The internet surfer starts surfing at a random page then with probability c it follows any of the links from the current page ending up in a new page or with probability 1 − c it starts over at a new random page. The Internet surfer then continue repeatedly either follows a link from the current page or starts over at a new random page. The pagerank of a node (page) can then be seen as the probability that the Internet surfer is at that page after a long time. While this by itself doesn’t make it easier, we note something about A. Since every homepage on average only links to a couple of other homepages, A is an extremely sparse matrix. While A have maybe 1024 elements only 1012 m, where m is the average number of links on a homepage is non-zero. If we have a sparse matrix, we could use the Power method to use the sparsity of the data rather than trying (and failing) using other methods such as the QR-method. We note that even though M is a full matrix, the multiplication xn+1 = Mxn in the Power method can be done fast by calculating Axn first. Take a couple of minutes to think through the following: 131 • How can we calculate xn+1 = Mxn by multiplying xn with cA> instead so that we can use that A is sparse? • How can the weightvector v be used in order to combat linkfarms, where people try to artificially increase ones PageRank? • The attribute ’nofollow’ can be used on links to not increase the PageRank of the target page. Can you think of any sites where this could be useful? • What influence does the value c have on the PageRank? For example what is important for a high PageRank if c is close to zero rather than 0.85? 132 Index adjacency matrix, 28 adjugate formula, 106 adjugate matrix, adj(A), 106 aperiodic, 41 associativity, 60 basis, 65 bijection, 84 bijective, 88 canonical form, 22 stochastic matrix, 52 Cayley-Hamilton theorem, 106 Ces´aro sum, 50 characteristic polynomial, 13 Cholesky factorization, 21 commutativity, 60 connected (graph), 31 connected component (graph), 31 continous time Markov chain, 56 covariance matrix, 22 cycle (graph), 30 degree matrix, 33 determinant, 10, 21 determinant function, 10 diagonal matrix, 6 diagonalizable, 13 distance, 68 distance matrix, 29 distributivity, 60 echelon form, 20 eigenvalue, 13, 39 eigenvector, 13, 39 error vector, 94 field, 60 Gauss elimination, 9 Hermitian conjugate, 18 Hermitian matrix, 18, 21 hermitian matrix, 8 hermitian transpose, 8 Hessenberg matrix, 17 hitting probability, 54 hitting time, 55 homogeneous coordinates, 86 Image, R(T), 88 imprimitive matrix, 42 injection, 84 injective, 88 inner product space, 66 interpolation, 95 irreducible matrix, 31, 39 Isomorphism, 91 Jordan block, 23 Jordan normal form, 23 Kernel, N(T), 88 Kroenecker product, 117 Laplacian matrix, 34 least-square sense, 93 linear combination, 62 linear space, 60 linear system, 8 linear transformation, 79 linearly dependent, 64 linearly independent, 64 LU-factorization, 21 LUP-factorization, 21 Markov chain, 46, 46 matrix decomposition, 16 matrix exponential function, eA , 113 matrix factorization, 16 matrix polynomial, 106 matrix power function,√An , 100 1 n matrix root function, A, A n , 103 Moore-Penrose pseudoinverse, 98 multilinear, 11 133 nilpotent matrix, 102 non-negative matrix, 39 normal equations, 95 Nullity, 90 Nullspace, N(T), 88 orthogonal, 69 orthogonal set, 69 orthonormal set, 69 path (graph), 30 permutation matrix, 21 Perron-Frobenius (theorem), 39 positive definite matrix, 19, 21 positive matrix, 40 Power method, 121 primitive matrix, 40 pseudoinverses, 98 QR-factorization, 22 QR-method, 126 rank, 90 rank factorization, 20 reduced row echelon form, 23 residual vector, 94 rotation matrix, 82 row echelon form, 9 Schwarz inequality, 68 similar matrix, 14 simple graph, 33 singular value decomposition, 24 span, 62 spanning set, 63 sparse matrix, 112 spectral factorization, 20 spectral radius, 13, 39 spectrum, 13 √ 1 square root of a matrix, A, A 2 , 103 stable, 119 standard matrix, 81 stationary distribution, 48 stochastic matrix, 46 irreducible and imprimitive, 50 134 irreducible and primitive, 49 reducible, 51 strongly connected (graph), 31 strongly connected component, 31 subspace, 61 surjection, 84 surjective, 88 SVD, 24 symmetric matrix, 8 trace, 6 transpose, 8 triangular matrix, 17, 21, 51 unitary matrix, 18, 22, 70 vector space, 60 weighted least square method, 97 References [1] K. Bryan and T. Leise. The $25, 000, 000, 000 eigenvector: the linear algebra behind google. SIAM Rev., 48(3):569–581, 2006. [2] M. Haugh. The monte carlo framework, examples from finance and generating correlated random variables. Course Notes, 2004. [3] J.R.Norris. Markov Chains. Cambridge university press, New York, 2009. [4] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, Boston, 2001. [5] C. Moler. Cleve’s corner: Chebfun, roots, October 2012. visited November 22 2012. [6] W. Nicholson. Elementary linear algebra with applications, second edition. PWS Publishing company, Boston, 1993. [7] P. G. D. J. L. Snell. Random walks and electric networks. Free Software Foundation, 2000. [8] S. Spanne. F¨ orel¨ asningar i Matristeori. KFS i Lund AB, Lund, 1994. [9] R. Tobias and L. Georg. Markovprocesser. Univ., Lund, 2000. [10] C. M. C. van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review, 45, 2003. 135