COMP 598 – Applied Machine Learning and Support Vector Machines
Transcription
COMP 598 – Applied Machine Learning and Support Vector Machines
COMP 598 – Applied Machine Learning Lecture 12: Ensemble Learning (cont’d) and Support Vector Machines ! Instructor: Joelle Pineau (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Angus Leigh (angus.leigh@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp598 Outline • Perceptrons – Definition – Perceptron learning rule – Convergence • Margin & max margin classifiers • Linear support vector machines – Formulation as optimization problem – Generalized Lagrangian and dual COMP-598: Applied Machine Learning 2 Joelle Pineau 1 Perceptrons • Given a binary classification task with data {xi, yi}i=1:n, yi={-1,+1}. • A perceptron (Rosenblatt, 1957) is a classifier of the form: hw(x) = sign(wTx) = {+1 if wTx≥0; -1 otherwise} – As usual, w is the weight vector (including the bias term w0). • The decision boundary is wTx=0. • Perceptrons output a class, not a probability. • An example <xi, yi> is classified correctly if and only if: yi(wTxi)>0. COMP-598: Applied Machine Learning 3 Joelle Pineau Perceptrons 1 w0 x1 w1 xm … ∑ y wm COMP-598: Applied Machine Learning 4 Joelle Pineau 2 Perceptron learning rule • Consider the following procedure: Initialize wj, j=0:m randomly, While any training examples remain incorrectly classified – Loop through all misclassified examples – For misclassified example xi, perform the update: w ⃪ w + α yi xi where α is the learning rate (or step size). • Intuition: For misclassified positive examples, increase wTx, and reduce it for negative examples. COMP-598: Applied Machine Learning 5 Joelle Pineau Gradient-descent learning • The perceptron learning rule can be interpreted as a gradient descent procedure, with optimization criterion: Err(w) = ∑i=1:n { 0 if wTxi≥0; -yiwTx otherwise } • For correctly classified examples, the error is zero. • For incorrectly classified examples, the error tells by how much wTx is on the wrong side of the decision boundary. • The error is zero when all examples are classified correctly. • The error is piecewise linear, so it has a gradient almost everywhere. COMP-598: Applied Machine Learning 6 Joelle Pineau 3 Linear separability Linear separability separability Linear • The data is linearly separable if and only if there exists a w such The data set set isis linearly linearly separable separable ifif and and only only ifif there there exists exists w, w, ww00 such such •• The data that: that: that: – For all examples, yiwTxi > 0 Forall alli,i, yyii(w (w··xxii++ww00))>>0.0. –– For equivalently, thethe 0-10-1 lossloss zero forsome some setof ofset parameters (w,ww0(w). ). – OrOr equivalently, is zero for some of parameters –– Or equivalently, the 0-1 loss isiszero for set parameters (w, 0). xx22 xx22 ++ ++ -- ++ -- -- ++ ++ xx11 -- xx11 -- (a) (a) Linearly separable COMP-598: Applied Machine Learning (b) (b) Not linearly separable 7 Joelle Pineau COMP-652,Lecture Lecture99- -October October9,9,2012 2012 COMP-652, 55 Perceptronconvergence convergence theorem theorem Perceptron convergence Perceptron theorem The perceptron perceptron convergence convergence theorem theorem states states that that ifif the the perceptron perceptron •• The learning rule is applied to a linearly separable data set, a solution solution learning rule is applied to a linearly separable data set, a • The basic theorem: willbe befound found after after some some finite finite number number of of updates. updates. will – If the perceptron learning ruleonis the applied a linearly separable dataset, The number of updates updates depends datato set, and also also on the the step step •• The number of depends on the data set, and on a solution will be found after some finite number of updates. sizeparameter. parameter. size the data data isis not not linearly linearly separable, separable, there there will will be be oscillation oscillation (which (which can can •• IfIf the bedetected detected automatically). automatically). be Decreasing the learning rate rate to to 00 can can cause cause the the oscillation oscillation to to settle settle on on •• • Decreasing learning Additionalthe comments: someparticular particular solution solution some – The number of updates depends on the dataset, and also on the learning rate. – If the data is not linearly separable, there will be oscillation (which can be detected automatically). – Decreasing the learning rate to 0 can cause the oscillation to settle on some particular solution. COMP-652,Lecture Lecture99- -October October9,9,2012 2012 COMP-652, COMP-598: Applied Machine Learning 66 8 Joelle Pineau 4 Perceptron learning example–separable data w = [0 0] w0 = 0 1 Perceptron learninglearning example–separable Perceptron example data 0.9 0.8 w = [0 0] w0 = 0 1 0.7 x2 0.9 0.6 0.8 0.5 0.7 0.4 x2 0.6 0.3 0.5 0.2 0.4 0.1 0.30 0 0.2 0.2 0.4 0.6 0.8 1 0.6 0.8 1 x1 0.1 0 0 0.2 0.4 x1 COMP-652, Lecture 9 - October 9, 2012 7 COMP-598: Applied Machine Learning 9 Joelle Pineau COMP-652, Lecture 9 - October 9, 2012 7 Perceptron learninglearning example–separable Perceptron example data -./.0$"'''.("&*!$1...-!./.!$ ' Perceptron learning example–separable data !"+ !"& -./.0$"'''.("&*!$1...-!./.!$ ' !"* ,# !"+ !"% !"& !") !"* !"$ ,# !"% !"( !") !"# !"$ !"' !"(! ! !"# !"# !"$ !"% !"& ' !"% !"& ' ,' !"' ! ! !"# COMP-598: Applied Machine Learning !"$ ,' 10 Joelle Pineau COMP-652, Lecture 9 - October 9, 2012 8 COMP-652, Lecture 9 - October 9, 2012 8 5 Weight asasa acombination of input inputvectors vectors Weight combination of • Recall percepton learning rule: • Recall perceptron learning rule: ww⃪← ww + α+yi γy xi i x i , w0 ← w0 + γyi If initial weights zero,then then at weights are a • If • initial weights arearezero, at any anystep, step,thethe weights are a linear combination of featureofvectors the examples: linear combination featureofvectors of the examples: mα y x w = ∑i=1:n � i i i m � w = α y x , w = i i used for 0all updatesαbased i yi on example i. where αi is the sum of stepi sizes i=1 i=1 – By the end of training, some examples may have never participated an update, so will have αi=0 used . where αin for all updates based on example i is the sum of step sizes i. • This is called representation of the classifier. • This is called the the dualdual representation of the classifier. • Even by the end of training, some example may have never participated in an update, so the corresponding αi = 0. COMP-598: Applied Machine Learning 11 Joelle Pineau COMP-652, Lecture 9 - October 9, 2012 9 Perceptron learning example Example used (bold) and not used (faint) in updates • Examples used (bold) and not (faint). What do you notice? -./.0$"'''.("&*!$1...-!./.!$ ' !"+ !"& !"* ,# !"% !") !"$ !"( !"# !"' ! ! !"# !"$ !"% !"& ' ,' COMP-598: Applied Machine Learning COMP-652, Lecture 9 - October 9, 2012 12 Joelle Pineau 10 6 Perceptron learning example • Solutions are often non-unique. The solution depends on the Comment: Solutions are nonunique set of instances and the order of sampling in updates. -./.0#"'(+).'"+(*#1...-!./.!# !"+ !"& !"* !"% ,# !") !"$ !"( !"# !"' ! ! !"# !"$ !"% !"& ' ,' Solutions depend on the set of instances and the order of sampling in updates COMP-598: Applied Machine Learning 13 COMP-652, Lecture 9 - October 9, 2012 Joelle Pineau 11 Perceptron summary Perceptron summary • Perceptrons can be learned to fit linearly separable data, using • Perceptrons can be learned a gradient-descent rule. to fit linearly separable data, using a gradient descent rule. Solutions are non-unique. •• There are other fitting approaches – e.g., formulation as a linear constraint satisfaction problem / linear program. • For non-linearly separable data: • Solutions are non-unique. – Perhaps data can be linearly separated in a different feature • Logistic neurons are often thought of as a “smooth” version of a space? perceptron Perhaps weseparable can relaxdata: the criterion of separating all the data? • For– non-linearly Perhaps canfitting be linearly separatede.g. in aformulation different feature space? • –There aredata other approaches, as a linear – Perhaps we can relax the criterion of separating all the data? constraint satisfaction problem / linear program. • The logistic function offers a “smooth” version of the perceptron. COMP-598: Applied Machine Learning COMP-652, Lecture 9 - October 9, 2012 14 Joelle Pineau 12 7 Support Vector Machines (SVMs) • Support vector machines (SVMs) for binary classification can be viewed as a way of training perceptrons. • Main new ideas: Support Vector Machines – An alternative optimization criterion (the “margin”), eliminates theviewed • Support vector machines (SVMs) for binarywhich classification can be as a way training and perceptrons non-uniqueness of of solutions handles non-separable data. • There are three main new ideas: – An efficient way of operating in expanded feature spaces, which allows – An alternative optimization criterion (the “margin”), which eliminates non-linear functions to be represented (theand “kernel trick”). the non-uniqueness of solutions has theoretical advantages – An efficient way of operating in expanded feature spaces, which allow non-linear functions to be represented – the “kernel trick” • SVMs can also used for multiclass classification and regression. – Abeway of handling overfitting and non-separable data by allowing mistakes • SVMs can also be used for multiclass classification and regression. COMP-598: Applied Machine Learning 15 Joelle Pineau COMP-652, Lecture 9 - October 9, 2012 13 The non-uniqueness issue Returningbinary to theclassification non-uniqueness issue • Consider a linearly separable dataset. • There is an infinite number hyperplanes that separate the{xi, yi}m . • Consider a linearly of separable binary classification data set i=1 classes: • There is an infinite number of hyperplanes that separate the classes: ! ! ! ! ! • Which plane is best? " " " " " • Relatedly,•for a given Which planeplane, is best?for which points should we be most confident •in Relatedly, the classification? for a given plane, for which points should we be most confident in the classification? 9 - October 9, 2012 COMP-598: AppliedCOMP-652, MachineLecture Learning 16 Joelle Pineau 14 8 The margin and linear SVMs • For a given separating hyperplane, the margin is twice the The margin, and linear SVMs (Euclidean) distance from hyperplane to nearest training example. The margin, and linear SVMs It isa the width of the “strip” around the decision boundary no •– For given separating hyperplane, the margin is two times thecontaining (Euclidean) distance from the hyperplane to the nearest training example. training • For aexamples. given separating hyperplane, the margin is two times the (Euclidean) ! hyperplane to the nearest training example. ! distance ! from the ! ! ! ! " " " " " " " ! ! ! ! " " ! ! ! ! ! " ! ! ! ! " " " " " " " " " " It is the “strip” the boundarycontaining containing • the It SVM is width the width of the “strip”around around the decision decision boundary no no • A• linear is of a perceptron for which we chose w such that the training examples. training examples. margin maximized. • Aislinear SVM a perceptronfor forwhich which we we choose choose w, margin • A linear SVM is aisperceptron w,ww0 0sosothat that margin is maximized is maximized COMP-598: Applied Machine Learning 17 Joelle Pineau COMP-652, Lecture 9 - October 9, 2012 15 COMP-652, Lecture 9 - October 9, 2012 15 Distance to the decision boundary Distance to the decision boundary Distance to the decision boundary • Suppose we have a decision boundary that separates the data. • Suppose we have a decision boundary that separates the data. • Suppose we have a decision boundary that separates the data. " # x"!i ! ! $%& ! $%& xi0# • Let γi be the distance from instance xi to the decision boundary. • How can we write γi in term of xi, yi, w, w0? • • Let the distance distancefrom from instance xi the to the decision boundary. Let γɣii be be the instance xi to decision boundary. COMP-652, Lecture 9 - October 9, 2012 • How can we write γ in term of x , y , w, w ? 16 0 • How can we writei ɣi in terms ofi xii, yi, w? COMP-652, Lecture 9 - October 9, 2012 COMP-598: Applied Machine Learning 18 16 Joelle Pineau 9 Distance to the decision boundary • The vector w is normal to the decision boundary, thus w/||w|| is the unit normal. • The vector from xi to xi0 is ɣi w / ||w||. • xi0, the point on the decision boundary nearest xi, is xi-ɣi w / ||w||. • As xi0 is on the decision boundary, wT( xi-ɣi w / ||w||) = 0 • Solving for ɣi yields, for a positive example: ɣi = wTxi / ||w|| COMP-598: Applied Machine Learning 19 Joelle Pineau The margin • The margin of the hyperplane is 2M, where M = mini ɣi . • The most direct statement of the problem of finding a maximum margin separating hyperplane is thus: maxw mini ɣi = maxw mini yiwTxi / ||w|| • Alternately: maximize M with respect to w subject to yiwTxi / ||w|| ≥ M, ∀i • However this turns out to be inconvenient for optimization. – w appears nonlinearly in the constraints. – Problem is underconstrained. If (w, M) is optimal, so is (βw, M), for any β>0. COMP-598: Applied Machine Learning 20 Joelle Pineau 10 The margin • The margin of the hyperplane is 2M, where M = mini ɣi . • The most direct statement of the problem of finding a maximum margin separating hyperplane is thus: maxw mini ɣi = maxw mini yiwTxi / ||w|| • Alternately: maximize M with respect to w subject to yiwTxi / ||w|| ≥ M, ∀i -> min ||w|| -> w.r.t. w -> s.t. yiwTxi ≥ 1 • However this turns out to be inconvenient for optimization. – w appears nonlinearly in the constraints. – Problem is underconstrained. If (w, M) is optimal, so is (βw, M), for any β>0. Add a constraint: ||w||M = 1 COMP-598: Applied Machine Learning 21 Joelle Pineau Final formulation • Let’s maximize ||w||2 instead of ||w||. (Taking the square is a monotone transformation, as ||w|| is positive, so this doesn’t change the optimal solution. • This gets us to: min ||w||2 w.r.t. w s.t. yiwTxi ≥ 1 • This can be solved! How? – It is a quadratic programming (QP) problem – a standard type of optimization problem for which many efficient packages are available. Better yet, it’s a convex (positive semidefinite) QP. COMP-598: Applied Machine Learning 22 Joelle Pineau 11 21 Example Example -./.0''"*+)+.'#"&!%%1...-!./.!'#"+'*$ -./.0$+"%)!$.$%"&+%#1...-!./.!$&"%+(% ' !"+ !"+ !"& !"& !"* !"* !"% !"% !") ,# ,# MP-652, Lecture 9 - October 9, 2012 !") !"$ !"$ !"( !"( !"# !"# !"' !"' ! ! !"# !"$ !"% !"& ' ! ! !"# !"$ !"% !"& ' ,' ,' We have a solution, but no support vectors yet... We have solution, but no support vectors yet. MP-652, Lecture 9 - October 9, 2012 COMP-598: Applied Machine Learning 23 Lagrange multipliers Joelle Pineau 22 • Consider the following optimization problem, called primal: minw f(w) s.t. gi(w) ≤ 0, i=1…k • We define the generalized Lagrangian: L(w, α) = f(w) + ∑i=1:k αi gi(w) where αi, i=1…k are the Lagrange multipliers. COMP-598: Applied Machine Learning 24 Joelle Pineau 12 A different optimization problem • Consider P(w) = maxα:αi≥0 L(w,α) • Observe that the following is true: P(w) = { f(w), +∞, if all constraints are satisfied, otherwise } • Hence, instead of computing minw f(w) subject to the original constraints, we can compute: p* = minw P(w) = minw maxα:αi≥0 L(w,α) COMP-598: Applied Machine Learning Primal 25 Joelle Pineau Dual optimization problem • Let d* = maxα:αi≥0 minw L(w,α) (max and min are reversed) Dual • We can show that d* ≤ p*. – Let p* = L(wp, αp) – Let d* = L(wd, αd) – Then d* = L(wd, αd) ≤ L(wp, αd) ≤ L(wp, αp) = p* . • If f and gi are convex and the gi can all be satisfied simultaneously for some w, then we have equality: d* = p* = L(w*, α*). • Moreover, w*, α* solve the primal and dual if and only if they satisfy the Karush-Kunh-Tucker (KKT conditions). COMP-598: Applied Machine Learning 26 Joelle Pineau 13 What you should know From today: • The perceptron algorithm. • The margin definition for linear SVMs. • The use of Lagrange multipliers to transform optimization problems. After the next class: • The primal and dual optimization problems for SVMs. • Feature space version of SVMs. • The kernel trick and examples of common kernels. COMP-598: Applied Machine Learning 27 Joelle Pineau 14