Kernel machines - PART II
Transcription
Kernel machines - PART II
Kernel machines - PART II September, 2008 MLSS’08 Stéphane Canu stephane.canu@litislab.eu Kernels and the learning problem Tools: the functional framework Algorithms Roadmap 1 Kernels and the learning problem Three learning problems Learning from data: the problem Kernelizing the linear regression Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set Optimization, loss function and the reguarization cost 3 Kernel machines Non sparse kernel machines Sparse kernel machines practical SVM 4 Conclusion Conclusion Kernels and the learning problem Tools: the functional framework Algorithms In the beginning was the kenrel... Definition (Kernel) a function of two variable k from X × X to IR Definition (Positive kernel) A kernel k(s, t) on X is said to be positive I I if it is symetric: k(s, t) = k(t, s) an if for any finite positive interger n: ∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X , n X n X i=1 j=1 it is strictly positive if for αi 6= 0 n n X X αi αj k(xi , xj ) > 0 i=1 j=1 αi αj k(xi , xj ) ≥ 0 Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Examples of positive kernels the linear kernel: s, t ∈ IRd , k(s, t) = s> t symetric: s> t = t> s positive: n n X X αi αj k(xi , xj ) = i =1 j=1 = n n X X αi αj x> i xj i =1 j=1 n X !> αi x i αj xj j=1 i =1 the product kernel: n X k(s, t) = g (s)g (t) n 2 X = αi xi i =1 for some g : IRd → IR, symetric by construction positive: n X n X αi αj k(xi , xj ) = i =1 j=1 = n X n X i =1 j=1 n X αi αj g (xi )g (xj ) ! αi g (xi ) i =1 n X αj g (xj ) = j=1 k is positive ⇔ (its square root exists) ⇔ k(s, t) = hφs , φt i J.P. Vert, 2006 n X i =1 !2 αi g (xi ) Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Example: finite kernel let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials, wavelets...) the feature map and linear kernel Φ : X → IRp s 7→ Φ = φ1 (s), ..., φp (s) feature map: Linear kernel in the feature space: k(s, t) = φ1 (s), ..., φp (s) e.g. the quadratic kernel: s, t ∈ IRd , > φ1 (t), ..., φp (t) k(s, t) = s> t + b 2 feature map: Φ : IRd → s 7→ d (d +1) IRp=1+d+√ 2 √ √ √ Φ = 1, 2s1 , ..., 2sj , ..., 2sd , s12 , ..., sj2 , ..., sd2 , ..., 2si sj , ... p multiplications vs. d + 1 use kernel to save computration Kernels and the learning problem Tools: the functional framework Algorithms Conclusion positive definite Kernel (PDK) algebra (closure) if k1 (s, t) and k2 (s, t) are two positive kernels ∀a1 ∈ IR+ I DPK are a convex cone: I for any measurable function ψ from X to IR I product kernel a1 k1 (s, t) + k2 (s, t) k(s, t) = ψ(s)ψ(t) k1 (s, t)k2 (s, t) proofs I by linearity: n X n X n X n n X n X X αi αj a1 k1 (i, j)k2 (i, j) = a1 αi αj k1 (i, j) + αi αj k2 (i, j) i =1 j=1 I by linearity: n X n X i =1 j=1 i =1 j=1 I assuming n X n X ∃ψ` i =1 j=1 n n X X αi αj ψ(xi )ψ(xj ) = αi ψ(xi ) αj ψ(xj ) s.t. k1 (s, t) = i =1 j=1 P ψ` (s)ψ` (t) n X n X X αi αj k1 (xi , xj )k2 (xi , xj ) = αi αj ψ` (xi )ψ` (xj )k2 (xi , xj ) ` i =1 j=1 i =1 j=1 = n X n XX ` i =1 j=1 N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004 ` αi ψ` (xi ) αj ψ` (xj ) k2 (xi , xj ) Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Kernel engineering: building PDK I I I I for any polynomial with positive coef. φ from IR to IR φ k(s, t) if Ψis a function from IRd to IRd k Ψ(s), Ψ(t) if ϕ from IRd to IR+ , is minimum in 0 k(s, t) = ϕ(s + t) − ϕ(s − t) convolution of two positive kernels is a positive kernel K1 ? K2 the Gaussian kernel is a PDK exp(−ks − tk2 ) = exp(−ksk2 − ktk2 − 2s> t) = exp(−ksk2 ) exp(−ktk2 ) exp(2s> t) I s> t is a PDK and function exp as the limit of positive series expansion, so exp(2s> t) is a PDK I exp(−ksk2 ) exp(−ktk2 ) is a PDK as a product kernel I the product of two PDK is a PDK O. Catoni, master lecture, 2005 Kernels and the learning problem Tools: the functional framework Algorithms Conclusion an attemp at classifying PD kernels I stationary kernels, (also called translation invariant): k(s, t) = ks (s − t) I I I I 2 radial (isotropic) gaussian: exp − rb , r = ks − tk with compact support c.s. Matèrn : max 0, 1 − br κ br k Bk br , κ ≥ (d + 1)/2 locally stationary kernels: k(s, t) = k1 (s + t)ks (s − t) K1 is a non negative function and K2 a radial kernel. non stationary (projective kernels): k(s, t) = kp (s > t) I separable kernels k(s, t) = k1 (s)k2 (t) with k1 and k2 (t) PDK in this case K = k1 k2> where k1 = (k1 (x1 ), ..., k1 (xn )) MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002 Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Kernel sprectral representation I stationary kernels, (Bochner’s theorem): Z ks (s − t) = cos ω > (s − t) F (d ω) IRd with F a positive finite measure I radial kernels Z kr (ks − tk) = ∞ Ψd ωks − tk F (d ω) 0 where F is non deacreasing and bounded, and Ψd a specific function I non stationary (projective kernels): Z kp (s, t) = IRd Z IRd cos ω1> s − ω2> t F (d ω1 , d ω2 ) with F a positive bounded symetric measure Fourier transform may help to design PD kernels Kernels and the learning problem Tools: the functional framework Algorithms some examples of PD kernels... type name 2 − rb k(s, t) , r = ks − tk radial gaussian exp radial radial laplacian rationnal radial loc. gauss. exp(−r /b) 2 1 − r 2r+b 2 r d exp(− rb ) max 0, 1 − 3b non stat. χ2 projective projective projective polynomial affine cosine projective correlation exp(−r /b), r = P k (sk −tk )2 sk +tk (s > t)p + b)p > s t/kskktk s>t exp kskktk −b (s > t Most of the kernels depends on a quantity b called the bandwidth Conclusion Kernels and the learning problem Tools: the functional framework Algorithms the importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais > p k(s, t) = (s t + b) = b p p s >t +1 b for the gaussian Kernel: Bandwidth = influence zone ks − tk2 1 k(s, t) = exp − Z 2σ 2 Illustration 1 d density estimation + data (x1 , x2 , ..., xn ) – Parzen estimate n Ib P(x) = 1 Z X i=1 k(x, xi ) b= b = 2σ 2 1 2 b=2 Conclusion Kernels and the learning problem Tools: the functional framework Algorithms the importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais > p k(s, t) = (s t + b) = b p p s >t +1 b for the gaussian Kernel: Bandwidth = influence zone ks − tk2 1 k(s, t) = exp − Z 2σ 2 Illustration 1 d density estimation + data (x1 , x2 , ..., xn ) – Parzen estimate n Ib P(x) = 1 Z X i=1 k(x, xi ) b= b = 2σ 2 1 2 b=2 Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion kernels for objects and structures kernels on histograms and probability distributions Z k(p(x), q(x)) = ki p(x), q(x) IP(x)dx kernel on strings I spectral string kernel I using sub sequences I similarities by alignements k(s, t) = k(s, t) = P P π u φu (s)φu (t) exp(β(s, t, π)) kernels on graphs I the pseudo inverse of the (regularized) graph Laplacian L=D −A A is the adjency matrixD the degree matrix I diffusion kernels I subgraph kernel convolution (using random walks) 1 Z (b) expbL and kernels on heterogeneous data (image), HMM, automata... Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006 Kernels and the learning problem Tools: the functional framework Algorithms Gram matrix Definition (Gram matrix) let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X . the Gram matrix is the square K of dimension n and of general term Kij = k(xi , xj ). practical trick to check kernel positivity: K is positive ⇔ λi > 0 its eigenvalues are posivies: if K ui = λi ui ; i = 1, n > u> i K ui = λi ui ui = λi matrix K is the one to be used Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Examples of Gram matrices with different bandwidth raw data b = .5 Gram matrix for b = 2 b = 10 Kernels and the learning problem Tools: the functional framework Algorithms Conclusion different point of view about kernels kernel and scalar product k(s, t) = hφ(s), φ(t)iH kernel and distance d (s, t)2 = k(s, s) + k(t, t) − 2k(s, t) kernel and covariance: a positive matrix is a covariance matrix IP(f) = 1 1 exp − (f − f0 )> K −1 (f − f0 ) Z 2 if f0 = 0 and f = K α, IP(α) = 1 Z exp − 12 α> K α Kernel and regularity (green’s function) k(s, t) = P ∗ Pδs−t for some operator P (e.g. some differential) Kernels and the learning problem Tools: the functional framework Algorithms Let’s summarize I positive kernels I there is a lot of them I can be rather complex I 2 classes: radial / projective I the bandwith matters (more than the kenrel itself) I the Gram matrix summarize the pairwise comparizons Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Roadmap 1 Kernels and the learning problem Three learning problems Learning from data: the problem Kernelizing the linear regression Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set Optimization, loss function and the reguarization cost 3 Kernel machines Non sparse kernel machines Sparse kernel machines practical SVM 4 Conclusion Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion From kernel to functions n o Pmf H0 = f mf < ∞; fj ∈ IR; tj ∈ X , f (x) = j=1 fj k(x, tj ) let define the bilinear form (g (x) = Pmg ∀f , g ∈ H0 , hf , g iH0 = i =1 gi k(x, si )) mf mg X X : fj gi k(tj , si ) j=1 i=1 Evaluation functional: ∀x ∈ X f (x) = hf (.), k(x, .)iH0 from k to H with any postive kernel, a hypothesis set can be constructed H with its metric Kernels and the learning problem Tools: the functional framework Algorithms RKHS Definition (reproducing kernel Hibert space (RKHS)) a Hilbert space H embeded with the inner product h., .iH is said to be with reproduicing kernel if it exists a positive kernel k such that ∀s ∈ X , k(., s) ∈ H et ∀f ∈ H, f (s) = hf (.), k(s, .)iH positive kernel ⇔ RKHS I any function is pointwise defined I defines the inner product I it defines the regularity (smoothness) of the hypothesis set Conclusion Kernels and the learning problem Tools: the functional framework Algorithms functional differentiation in RKHS Let J be a functional J: H→ f 7→ IR J(f ) examples: J1 (f ) = kf k2 , J2 (f ) = f (x), J directional derivative in direction g at point f dJ(f , g ) = J(f + εg ) − J(f ) lim ε→0 ε Gradient ∇J (f ) ∇J : H f → H 7→ ∇J (f ) si exercice: find out ∇J1 (f ) et ∇J2 (f ) dJ(f , g ) = h∇J (f ), g iH Conclusion Kernels and the learning problem Tools: the functional framework Hint dJ(f + εg ) dJ(f , g ) = dε ε=0 Algorithms Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Solution dJ1 (f , g ) lim kf +εg k2 −kf k2 ε ε→0 lim kf k2 +ε2 kg k2 +2εhf ,g iH −kf k2 ε ε→0 lim εkg k2 + 2hf , g iH ε→0 = = = ⇔ ∇J1 (f ) = 2f = h2f , g iH dJ2 (f , g ) = lim ε→0 g (x) = hk(x, .), g iH = Minimize J(f ) f ∈H ⇔ f (x)+εg (x)−f (x) ε ⇔ ∀g ∈ H, dJ(f , g ) = 0 ∇J2 (f ) = k(x, .) ⇔ ∇J (f ) = 0 Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Solution dJ1 (f , g ) lim kf +εg k2 −kf k2 ε ε→0 lim kf k2 +ε2 kg k2 +2εhf ,g iH −kf k2 ε ε→0 lim εkg k2 + 2hf , g iH ε→0 = = = ⇔ ∇J1 (f ) = 2f = h2f , g iH dJ2 (f , g ) = lim ε→0 g (x) = hk(x, .), g iH = Minimize J(f ) f ∈H ⇔ f (x)+εg (x)−f (x) ε ⇔ ∀g ∈ H, dJ(f , g ) = 0 ∇J2 (f ) = k(x, .) ⇔ ∇J (f ) = 0 Kernels and the learning problem Tools: the functional framework Algorithms Remark about the regularity matter: from H to k How to build H together with k? let Γt be a family in L2 , t ∈ X define the following maping St : L2 → IR 7→ St (g ) = Z g g (x)Γt (x)dx f (t) = St (g ) = hg , Γt iL2 the following RKHS can be constructed H = Im(S) hf1 , f2 iH = hg1 , g2 iL2 k(s, t) = hΓs , Γt iL2 the reproducing property is verified: hf (.), k(., t)iH = hg , Γt iL2 = f (t) Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Γx examples and associated kernels Cameron Martin Polynomial Γx (u) 1I{x≤u} d X e0 (u) + xi ei (u) 1 Z Gaussian i=1 (x−u)2 − 2 exp K (x, y ) min (x, y ) x> y + 1 1 Z0 exp− (x−y )2 4 {ei }i =1,d are a finite sub sample of an orthonormal basis in L2 the Cameron Martin operator:∀f ∈ H, ∃g ∈ L2 such that f (t) = (Sg )(t) = R Γt (u)g (u) du = kf kH = kSg kH = kg k2L2 = kf 0 k2L2 integrate S = P −1 R 1I{t≤u} g (u) du = G (t) generalized differential differentialize Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Γx examples and associated kernels Cameron Martin Polynomial Γx (u) 1I{x≤u} d X e0 (u) + xi ei (u) 1 Z Gaussian i=1 (x−u)2 − 2 exp K (x, y ) min (x, y ) x> y + 1 1 Z0 exp− (x−y )2 4 {ei }i =1,d are a finite sub sample of an orthonormal basis in L2 the Cameron Martin operator:∀f ∈ H, ∃g ∈ L2 such that f (t) = (Sg )(t) = R Γt (u)g (u) du = kf kH = kSg kH = kg k2L2 = kf 0 k2L2 integrate S = P −1 R 1I{t≤u} g (u) du = G (t) generalized differential differentialize Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Γx examples and associated kernels Cameron Martin Polynomial Γx (u) 1I{x≤u} d X e0 (u) + xi ei (u) 1 Z Gaussian i=1 (x−u)2 − 2 exp K (x, y ) min (x, y ) x> y + 1 1 Z0 exp− (x−y )2 4 {ei }i =1,d are a finite sub sample of an orthonormal basis in L2 the Cameron Martin operator:∀f ∈ H, ∃g ∈ L2 such that f (t) = (Sg )(t) = R Γt (u)g (u) du = kf kH = kSg kH = kg k2L2 = kf 0 k2L2 integrate S = P −1 R 1I{t≤u} g (u) du = G (t) generalized differential differentialize Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion other kernels (what realy matters) I finite kernels k(s, t) = φ1 (s), ..., φp (s) I I I I > φ1 (t), ..., φp (t) Mercer kernels Pp positive on a compact set ⇔ k(s, t) = j=1 λj φj (s)φj (t) positive kernels positive semi-definite conditionnaly positive (for some functions pj ) ∀{xi }i=1,n , ∀αi , n X αi pj (xi ) = 0; j = 1, p , i n X n X αi αj k(xi , xj ) ≥ 0 i=1 j=1 I symetric non positive I non symetric – non positive k(s, t) = tanh(s> t + α0 ) the key property: ∇Jt (f ) = k(t, .) holds C. Ong et al, ICML , 2004 Kernels and the learning problem Tools: the functional framework Algorithms Let’s summarize I positive kernels ⇔ RKHS = H ⇔ regularity kf k2H I the key property: ∇Jt (f ) = k(t, .) holds not only for positive kernels f (xi ) exists (pointwise defined functions) I universal consistency in RKHS I the Gram matrix summarize the pairwise comparizons Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Roadmap 1 Kernels and the learning problem Three learning problems Learning from data: the problem Kernelizing the linear regression Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set Optimization, loss function and the reguarization cost 3 Kernel machines Non sparse kernel machines Sparse kernel machines practical SVM 4 Conclusion Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Convex optimization in one slide I The problem: H is a Hilbert space ( IRn or a RKHS) OP min f ∈H such that and J(f ) objective Pi (f ) ≤ 0 i = 1, p Rj (f ) = 0 j = 1, q constraint ⇔ min f ∈D⊂H J(f ) J, Pi and Rj from H → IR I questions: I I I I modelization (transformations) optimization theory (existence, uniqueness, characterization of the solution) algorithms (no analytical solution)/ implementations The importance of being convex: I I convex OP: objective J and constraints Pj are convex, Rj are afine why convex? It is the solvable class: ⇒the solution exists, is unique and can be found efficiently Boyd and Vandenberg 2004 ; Bonnans et al., 2006 Kernels and the learning problem Tools: the functional framework Algorithms Conclusion examples of convex OP Linear programming LP (standard form) minn x∈I R s.t. and c> x Ax = b x0 hf , kiH fmin ∈H s.t. Tf = y T : H → IRn linear (xi ≥ 0 , i = 1, n) Linear objective and linear constrains Quadratic programming QP ( min 1 x> C x x∈IRn 2 − d> x s.t. Ax b quadratic objective and linear constrains Second order cone programming SOCP minn c> x x∈I R s.t. kAi x − bi k2 ≤ d> i x + ei , i = 1, p and F x = g second order cone constraint Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Convex optimization in three slides J(f ) primal objective fmin ∈H I Initial problem (primal): s. t. Pi (f ) ≤ 0 i = 1, p and R (f ) = 0 j = 1, q j I Lagrangian: L(f , α, β) = J(f ) + p X αi Pi (f ) + |{z} i=1 ≥0 I βj Rj (f ) i=1 Lagrange dual function: Q(α, β) = min L(f , α, β) f ∈H I Lagrange dual function with linear constraints Tf = y, p = 0 Q(α) = −y> α − J ∗ (−T ∗ α) I where J ∗ denotes the conjugate and for any feasable b f (such that Pi (b f ) ≤ 0 and Ri (b f ) = 0) b Q(α, β) = min L(f , α, β) ≤ L(f , α, β) ≤ J(b f) f ∈H ( I q X Lagrange dual problem: max Q(α, β) α,β s.t. α 0 Kernels and the learning problem Tools: the functional framework Algorithms Conclusion Convex optimization in four slides Optimality conditions (Karush-Kuhn-Tucker KKT conditions): f ∗ , α∗ , β ∗ optimum ⇔ p q X X ∗ ∗ ∗ ∇ J(f ) + α ∇ P (f ) + βj∗ ∇f Rj (f ∗ ) f f i i i=1 i=1 Pi (f ∗ ) Rj (f ∗ ) αi∗ ∗ αi Pi (f ∗ ) =0 ≥0 =0 ≥0 =0 duality gap: J(f ∗ ) − Q(α∗ , β ∗ ) = 0; Optimization summary 1. (try to) transform your problem in a convex one (standard form) 2. more variables or more constraints: is the dual simpler? 3. solve it: check the KKT conditions Kernels and the learning problem Tools: the functional framework Algorithms loss function: the fitting cost `: H, X , Y f , x, y −→ IR+ 7−→ `(y , f (x)) Regular logistic Convex NON convexe (x)+ = max(x, 0) log 1 + exp−yf (x) singular hinge (yf (x) − 1)+ L2-gaussian L1-laplacian (f (x) − y )2 |f (x) − y | sigmoid 0/1 1 − tanh(yf (x)) sign(f (x)) − y ) Cauchy Lp - square root log(1 + (f (x) − y )2 ) |f (x) − y |1/2 Conclusion Kernels and the learning problem Tools: the functional framework Fidelity to the data: loss function Algorithms Conclusion Kernels and the learning problem Tools: the functional framework Algorithms regularity criterion: regularization Problem: H is too big... let S0 = {f ∈ H | Pn i=1 `(f , xi , yi ) = 0} the problem is ill posed: solution is not unique. we have to choose one! fI minimal norm solution min kf k2 f ∈S0 regularized: build a regularization path a sequence of problems whose solution fλ converges towards fI 3 way to do it: 1. penalization R(f ): min f ∈H n X `(f , xi , yi ), R(f ) i=1 2. subspace H1 ⊂ ...Hk ... ⊂ H : min f ∈Hk n X `(f , xi , yi ) i=1 3. iterative approach: gradient (Landweber-Friedman) or conjugate gradient (Krylov subspace) Conclusion Kernels and the learning problem Tools: the functional framework Algorithms regularity criterion: regularization Problem: H is too big... let S0 = {f ∈ H | Pn i=1 `(f , xi , yi ) = 0} the problem is ill posed: solution is not unique. we have to choose one! fI minimal norm solution min kf k2 f ∈S0 regularized: build a regularization path a sequence of problems whose solution fλ converges towards fI 3 way to do it: 1. penalization R(f ): min f ∈H n X `(f , xi , yi ), R(f ) i=1 2. subspace H1 ⊂ ...Hk ... ⊂ H : min f ∈Hk n X `(f , xi , yi ) i=1 3. iterative approach: gradient (Landweber-Friedman) or conjugate gradient (Krylov subspace) Conclusion Kernels and the learning problem Tools: the functional framework Algorithms Conclusion penalization choice convex regular L2 singular L1 kf k22 kf k1 normalized-SCAD non convex kf k2H = α> K α X αj2 j 1 + αj2 f (x) = n X n X i=1 no more convex when p < 1 J. Weston et al. JMLR 03, A. Ng ICML’04 kf kp αi k(x, xi ) kf k1 = i=1 kf kpp = Lp , p < 1 n X i=1 |αi |p |αi |, Kernels and the learning problem Tools: the functional framework Algorithms Conclusion an attemp at classifying some Kernel learnig algorithm ` sing. h(H) sing.L1 reg. sing.L1 sing. reg. L2 Regresion K Danzig Selector K LASSO K LARS SVR reg. reg. L2 Splines Classification LP SVM K reg. log. L1 SVM K-logistic reg. Lagrangian SVM Table: SVM and SVR stand for support vector machine and support vector regression. LP linear programming, LARS Least angle regression stagewise and reg. log. logistic regrssion. K. represent the kernelize version o fthe linear algorithms.. Kernels and the learning problem Tools: the functional framework things are changing: why `1 ? The Gaussian Hare and the Laplacian Tortoise Computability of `1 vs. `2 Regression Estimators. Portnoy & Koenker, 1997 Algorithms Conclusion Kernels and the learning problem Tools: the functional framework Algorithms `1 gives sparsity: even faster! Definition: Strongly homogeneous sets (variables) I0 = i ∈ {1, ..., d } βi = 0 Théorème Regular if S(β) + λT (β) differentiable in 0 with I0 (y) 6= ∅ ∀ε > 0, ∃y0 ∈ B(y, ε) such that I0 (y0 ) 6= I0 (y) Singular if S(β) + λT (β) NON differentiable in 0 with I0 (y) 6= ∅ ∃ε > 0, ∀y0 ∈ B(y, ε) we have I0 (y0 ) = I0 (y) a criteria is non smooth at zero =⇒ sparsity Nikolova, 2000 Conclusion Kernels and the learning problem Tools: the functional framework Algorithms let’s summarize I hypothesis k et H I data fidelity (loss): ` convex I regularity: learning as a multicriterion optimization I `1 = sparsity + convexity = some efficency Conclusion Kernels and the learning problem . Tools: the functional framework Algorithms Conclusion
Similar documents
Investor Day 2010 Presentation
MARKET SHARE EVOLUTION BY RETAIL PRICE BY BRAND PRICE RANGE SHARE CHANGE H1 2010 VS. 2004 (PC market - FR, GER, IT, SP)
More information