{"title": "Regularized Winnow Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 709, "abstract": null, "full_text": "Regularized Winnow Methods \n\nTong Zhang \n\nMathematical Sciences Department \nIBM TJ. Watson Research Center \n\nYorktown Heights, NY 10598 \n\ntzhang@watson.ibm.com \n\nAbstract \n\nIn theory, the Winnow multiplicative update has certain advantages over \nthe Perceptron additive update when there are many irrelevant attributes. \nRecently, there has been much effort on enhancing the Perceptron algo(cid:173)\nrithm by using regularization, leading to a class of linear classification \nmethods called support vector machines. Similarly, it is also possible to \napply the regularization idea to the Winnow algorithm, which gives meth(cid:173)\nods we call regularized Winnows. We show that the resulting methods \ncompare with the basic Winnows in a similar way that a support vector \nmachine compares with the Perceptron. We investigate algorithmic is(cid:173)\nsues and learning properties of the derived methods. Some experimental \nresults will also be provided to illustrate different methods. \n\n1 Introduction \n\nIn this paper, we consider the binary classification problem that is to determine a label \ny E {-1, 1} associated with an input vector x. A useful method for solving this problem is \nthrough linear discriminant functions, which consist of linear combinations of the compo(cid:173)\nnents of the input variable. Specifically, we seek a weight vector W and a threshold () such \nthat w T x < () if its label y = -1 and w T x 2: () if its label y = 1. Given a training set of \nlabeled data ( Xl, yl ), . .\n. , (x n , yn ), a number of approaches to finding linear discriminant \nfunctions have been advanced over the years. In this paper, we are especially interested in \nthe following two families of online algorithms: Perceptron [12] and Winnow [10]. These \nalgorithms typically fix the threshold () and update the weight vector w by going through \nthe training data repeatedly. They are mistake driven in the sense that the weight vector is \nupdated only when the algorithm is not able to correctly classify an example. \n\nFor the Perceptron algorithm, the update rule is additive: if the linear discriminant function \nmisclassifies an input training vector xi with true label yi, then we update each component \nj of the weight vector was: Wj f- Wj + T]X~yi, where T] > 0 is a parameter called learning \nrate. The initial weight vector can be taken as W = O. \nFor the (unnormalized) Winnow algorithm (with positive weight), the update rule is mul(cid:173)\ntiplicative: if the linear discriminant function misclassifies an input training vector xi \nwith true label yi, then we update each component j of the weight vector was: Wj f(cid:173)\nWj exp( T]X~yi), where T] > 0 is the learning rate parameter, and the initial weight vector \ncan be taken as Wj = f-Lj > O. The Winnow algorithm belongs to a general family of algo-\n\n\frithms called exponentiated gradient descent with unnormalized weights (EGU) [9]. There \ncan be several variants. One is called balanced Winnow, which is equivalent to an embed(cid:173)\nding of the input space into a higher dimensional space as: x = [x, -x]. This modification \nallows the positive weight Winnow algorithm for the augmented input x to have the effect \nof both positive and negative weights for the original input x. Another modification is to \nnormalize the one-norm of the weight w so that 2:j Wj = W, leading to the normalized \nWinnow. \n\nTheoretical properties of multiplicative update algorithms have been extensively studied \nsince the introduction of Winnow. For linearly separable binary-classification problems, \nboth Perceptron and Winnow are able to find a weight that separate the in-class vectors from \nthe out-of-class vectors in the training set within a finite number of steps. However, the \nnumber of mistakes (updates) before finding a separating hyperplane can be very different \n[10, 9]. This difference suggests that the two algorithms serve for different purposes. \n\nFor linearly separable problems, Vapnik proposed a method that optimizes the Perceptron \nmistake bound which he calls \"optimal hyperplane\" (see [15]). The same method has also \nappeared in the statistical mechanical learning literature (see [1, 8, 11]), and is referred \nto as achieving optimal stability. For non-separable problems, a generalization of optimal \nhyperplane was proposed in [2] by introducing a \"soft-margin\" loss term. In this paper, we \nderive regularized Winnow methods by constructing \"optimal hyperplanes\" that minimize \nthe Winnow mistake bound (rather than the Perceptron mistake bound as in an SVM). We \nthen derive a \"soft-margin\" version of the algorithms for non-separable problems. \n\nFor simplicity, we shall assume 0 = 0 in this paper. The restriction does not cause problems \nin practice since one can always append a constant feature to the input data x, which offset \nthe effect of O. The formulation with 0 = 0 can be more amenable to theoretical analysis. \nFor an SVM, a fixed threshold also allows a simple Perceptron like numerical algorithm as \ndescribed in chapter 12 of [13], and in [7]. Although more complex, a non-fixed 0 does not \nintroduce any fundamental difficulty. \n\nThe paper is organized as follows. In Section 2, we review mistake bounds for Perceptron \nand Winnow. Based on the bounds, we show how regularized Winnow methods can be \nderived by mimicking the optimal stability method (and SVM) for Perceptron. We also \ndiscuss the relationship of the newly derived methods with related methods. In Section 3, \nwe investigate learning aspects of the newly proposed methods in a context similar to some \nknown SVM results. An example will be given in Section 4 to illustrate these methods. \n\n2 SVM and regularized Winnow \n\n2.1 From Perceptron to SVM \n\nWe review the derivation of SVM from Perceptron, which serves as a reference for our \nderivation of regularized Winnow. Consider linearly separable problems and let W be \na weight that separates the in-class vectors from the out-of-class vectors in the training \nIt is well known that the Perceptron algorithm computes a weight that correctly \nset. \nclassifies all training data after at most M updates (a proof can be found in [15]) where \nM = IIwll~ max; Ilxill~/(miI1i wT xi)2. The weight vector w* that minimizes the right \nhand side of the bound is called the optimal hyperplane in [15] or the optimal stability hy(cid:173)\nperplane in [1, 8, 11]. This optimal hyperplane is the solution to the following quadratic \nprogramming problem: \n\n. 1 2 \nmln-w \nw 2 \n\n\fFor non-separable problems, we introduce a slack variable f.i for each data point (xi, yi) \n(i = 1, ... ,n), and compute a weight vector w. (C) that solves \n\nWhere C > 0 is a given parameter [15]. It is known that when C --+ 00, f.i --+ 0 and w. (C) \nconverges to the weight vector w. of the optimal hyperplane. We can write down the KKT \ncondition for the above optimization problem, and let Qi be the Lagrangian multiplier for \nwT xiyi ~ 1 - f.i . After elimination of wand f., we obtain the following dual optimization \nproblem of the dual variable Q (see [15], chapter 10 for details): \n\nm;-x 2: Qi - ~(2: Qixiyi)2 \n\ni \n\ni \n\ns.t. Qi E [0, CJ \n\nfor i = 1, ... ,n. \n\nThe weight w.(C) is given by w.(C) = I:i Qixiyi at the optimal solution. To solve this \nproblem, one can use the following modification of the Perceptron update algorithm (see \n[7] and chapter 12 of [13]): at each data point (xi, yi), we fix all Qk with k f:. i, and update \nQi to maximize the dual objective functional, which gives: \n\nQi --+ max(min(C, Qi + 7](1 _ wT xiyi)), 0), \n\nwhere w = I:i Qixiyi. The learning rate 7] can be set as 7] = 1/ xiT xi which corresponds \nto the exact maximization of the dual objective functional. \n\n2.2 From Winnow to regularized Winnow \n\nSimilar to Perceptron, if a problem is linearly separable with a positive weight w, then \nWinnow computes a solution that correctly classifies all training data after at most M up-\ndates with M = 2W(I:j Wj In ::IIII~II:) max,; Ilxill~/P, where 0 < 6 ::; milli wT xiyi, \nW ~ Ilwlll and the learning rate is 7] = 6 /(W maXi Ilxi II~). The proof of this specific \nbound can be found in [16] which employed techniques in [5] (also see [10] for earlier \nresults). Note that unlike the Perceptron mistake bound, the above bound is learning rate \ndependent. It also depends on the prior J.Lj > 0 which is the initial value of w in the basic \nWinnows. \n\nFor problems separable with positive weights, to obtain an optimal stability hyperplane \nassociated with the Winnow mistake bound, we consider fixing Ilwlll such that Ilwlll = \nW > O. It is then natural to define the optimal hyperplane as the (positive weight) solution \nto the following convex programming problem: \n\n. ' \" 1 Wj \nmm~wj n -\neJ.Lj \n\nj \n\nW \n\ns.t. w x Y _ \n\nT i i> l f \u00b71 \n\nor z = , ... , n. \n\nWe use e to denote the base of natural logarithm. Similar to the derivation of SVM, for \nnon-separable problems, we introduce a slack variable f.i for each data point (xi, yi), and \ncompute a weight vector w. (C) that solves \n. \nmin2: wj ln -J +C2:c \n\nW,(. \nJ \n\nw\u00b7 \neJ.Lj \n\n. \n' \n\nWhere C > 0 is a given parameter. Note that to derive the above methods, we have assumed \nthat Ilwlll is fixed at Ilwlll = 11J.L111 = W, where W is a given parameter. This implies that \nthe derived methods are in fact regularized versions of the normalized Winnow. One can \nalso ignore this normalization constraint so that the derived methods correspond to regular(cid:173)\nized versions of the unnormalized Winnow. The entropy regularization condition is natural \n\n\fto all exponentiated gradient methods [9], as can be observed from the theoretical results \nin [9]. The regularized normalized Winnow is closely related to the maximum entropy \ndiscrimination [6] (the two methods are almost identical for linearly separable problems). \nHowever, in the framework of maximum entropy discrimination, the Winnow connection \nis non-obvious. As we shall show later, it is possible to derive interesting learning bounds \nfor our methods that are connected with the Winnow mistake bound. \n\nSimilar to the SVM formulation, the non-separable formulation of regularized Winnow \napproaches the separable formulation as C -+ 00. We shall thus only focus on the non(cid:173)\nseparable case below. Also similar to an SVM, we can write down the KKT condition and \nlet o:i be the Lagrangian multiplier for w T xiyi ~ 1 - ei . After elimination of wand e, \nwe obtain (the algebra resembles that of [15], chapter 10, which we shall skip due to the \nlimitation of space) the following dual formulation for regularized unnormalized Winnow: \n\ns.t. \n\no:i E [0, CJ \n\nfor i = 1, ... ,n. \n\nm.;x L o:i - L f-Lj exp(L o:ix~yi) \n\n. \n\n. \n\nj \n\nThe j-th component of weight w*(C) is given by w*(C)j = f-Lj exp(2:i o:ix~yi) at the \noptimal solution. For regularized normalized Winnow with IIWllt = W > 0, we obtain \nm.;x L o:i - W In(L f-Lj exp(L o:ix;yi)) \nfor i = 1, ... ,n. \n\no:i E [0, CJ \n\ns.t. \n\ni \n\nj \n\ni \n\nThe weight w*(C) is given by w*(C)j = Wf-Lj exp(2:i o:ix~yi)/ 2:j f-Lj exp(2:i o:ix~yi) \nat the optimal solution. \n\nSimilar to the Perceptron-like update rule for the dual SVM formulation, it is possible to \nderive Winnow-like update rules for the regularized Winnow formulations. At each data \npoint (xi, yi), we fix all O:k with k -# i, and update O:i to maximize the dual objective \nascent method with a learning rate TJ: O:i -+ O:i + TJ a:, LD (O:i), where we use LD to denote \nfunctionals. We shall not try to derive an analytical solution, but rather use a gradient \n\nthe dual objective function to be maximized. TJ can be either fixed as a small number or \ncomputed by the Newton's method. It is not hard to verify that we obtain the following \nupdate rule for regularized unnormalized Winnow: \n\no:i -+ max(min(C, o:i + TJ(1 - wT xiyi)), 0), \n\nwhere Wj = f-Lj exp(2:i o:ixjyi). This gradient ascent on the dual variable gives an EGU \nrule as in [9]. Compared WIth the SVM dual update rule which is a soft-margin version \nof the Perceptron update rule, this method naturally corresponds to a soft-margin version \nof unnormalized Winnow update. Similarly, we obtain the following dual update rule for \nregularized normalized Winnow: \n\no:i -+ max(min(C, o:i + TJ(1 - w T xiyi)), 0), \n\nwhere Wj = Wf-Lj exp(2:i o:ix~yi)/ 2:j f-Lj exp(2:i o:ix~yi). Again, this rule (which is \nan EG rule in [9]) can be naturally regarded as the soft-margin version of the normalized \nWinnow update. In our experience, these update rules are numerically very efficient. Note \nthat for regularized normalized Winnow, the normalization constant W needs to be care(cid:173)\nfully chosen based on the data. For example, if data is infinity-norm bounded by 1, then it \ndoes not seem to be appropriate if we choose W ::; 1 since IwT x I ::; 1: a hyperplane with \nIIWllt ::; 1 does not achieve reasonable margin. This problem is less crucial for unnormal(cid:173)\nized Winnow, but the norm of the initial weight f-Lj still affects the solution. \n\nBesides maximum entropy discrimination which is closely related to regularized normal(cid:173)\nized Winnow, a large margin version of unnormalized Winnow has also been proposed \nbased on some heuristics [3,4]. However, their algorithm was purely mistake driven with(cid:173)\nout dual variables o:i (the algorithm does not compute an optimal stability hyperplane for \nthe Winnow mistake bound). In addition, they did not include a regularization parameter \nC which in practice may be important for non-separable problems. \n\n\f3 Some statistical properties of regularized Winnows \n\nIn this section, we derive some learning bounds based on our formulations that minimize \nthe Winnow mistake bound. The following result is an analogy of a leave-one-out cross(cid:173)\nvalidation bound for separable SVMs - Theorem 10.7 in [15]. \n\nTheorem 3.1 The expected misclassification error errn with the true distribution by \nusing hyperplane W obtained from the linearly separable (C = \n00) unnormal(cid:173)\nized regularized Winnow algorithm with n training samples is bounded by errn < \nn~l Emin(K, 1.5W(2:j Wj In ~) max; Ilxi ll~), where the right-hand side expectation \nis taken with n + 1 random samples (xl, yl), ... , (xn+l, yn+l). K is the number ofsup(cid:173)\nport vectors of the solution. Let W be the optimal solution using all the samples with \ndual a i for i = 1, . .. , n + 1. Let w k be the weight obtained from setting a k = 0, then \nW = max(llwI11, Ilwllll, ... ,llwn+llld. \n\nProof Sketch. We only describe the major steps due to the limitation of space. Denote by \nillk the weight obtained from the optimal solution by removing (xk , yk) from the training \nsample. Similar to the proof of Theorem 10.7 in [15], we need to bound the leave-one(cid:173)\nout cross-validation error, which is at most K. Also note that the leave-one-out cross(cid:173)\nvalidation error is at most I{k : Ilillk - wlllllxklloo ~ I}I. We then use the following \ntwo inequalities: Ilillk - wllf ::::: 2W(2:j ill] - Wj - Wj In( ill] /Wj )); and 2:j ill] - Wj -\nWj In( ill] / Wj) ::::: 2:j wJ - Wj - Wj In( wJ / Wj) -\nthe latter inequality can be obtained by \ncomparing the dual objective functionals and by using the corresponding KKT condition \nof the dual problem. The remaining problem is now reduced to proving that I {k : 2:j wJ -\n::::: ..J2w 2:j Wj In ~ . For the dual formulation, \nWj - Wj In(w] /Wj) ~ 1/(2Wllxkll~)} 1 \nby summing over index k of the KKT first order condition with respective to the dual a k , \nmultiplied by a k , one obtains 2:k a k = 2:j Wj In~. We thus only need to show that if \n2:j w] - Wj - Wj In(w]/wj) ~ 1/(2Wllxkll~), then a k ~ 2/(3Wllxkll~). This can be \nchecked directly through Taylor expansion. 0 \n\nBy using the same technique, we may also obtain a bound for regularized normalized Win(cid:173)\nnow. One disadvantage of the above bound is that it is the expectation of a random estimator \nthat is no better than the leave-one-out cross-validation error based on observed data. How(cid:173)\never, the bound does convey some useful information: for example, we can observe that \nthe expected misclassification error (learning curve) converges at a rate of 0 (1/ n) as long \nas W(2: j Wj In ~) and sup Ilxlloo are reasonably bounded. \nIt is also not difficult to obtain interesting PAC style bounds by using the covering number \nresult for entropy regularization in [16] and ideas in [14]. Although the PAC analysis would \nimply a slightly suboptimal learning curve of O(log n/n) for linearly separable problems, \nthe bound itself provides a probability confidence and can be generalized to non-separable \nproblems. We state below an example for non-separable problems, which justifies the \nentropy regularization. The bound itself is a direct consequence of Theorem 2.2 and a \ncovering number result with entropy regularization in [16]. Note that as in [14], the square \nroot can be removed if k-y = 0; \"( can also be made data-dependent. \nTheorem 3.2 If the data is infinity-norm bounded as Ilxlloo ::::: b, then consider the family \nr of hyperplanes W such that Ilwlll ::::: a and 2:j Wj In(::III~\\\\~) ::::: c. Denote by err( w) \nthe misclassification error of W with the true distribution. Then there is a constant C such \nthat for any \"( > 0, with probability 1 -\n\nTJ over n random samples, any W E r satisfies: \n\nerr(w) < ..2 + \n\nk \nn \n\n-\n\nC \n1 \n-2-b2(a2 + ac) In(- + 2) + In-, \n\"( n \nTJ \n\nnab \n\"( \n\n\fwhere k-y = I{ i : w T xiyi < 'Y} I is the number afsamples with margin less than 'Y-\n\n4 An example \n\nWe use an artificial dataset to show that a regularized Winnow can enhance a Winnow \njust like an SVM can enhance a Perceptron. In addition, it shows that for problems with \nmany irrelevant features, the Winnow algorithms are superior to the Perceptron family \nalgorithms. \n\nThe data in this experiment are generated as follows. We select an input data dimension \nd, with d = 500 or d = 5000. The first 5 components of the target linear weight w are \nset to ones; the 6th component is -1; and the remaining components are zeros. The linear \nthreshold 9 is 2. Data are generated as random vectors with each component randomly \nchosen to be either 0 or 1 with probability 0.5 each. Five percent of the data are given wrong \nlabels. The remaining data are given correct labels, but we remove data with margins that \nare less than 1. One thousand training and one thousand test data are generated. \n\nWe shall only consider balanced versions of the Winnows. We also compensate the effect \nof 9 by appending a constant 1 to each data point, as mentioned earlier. We use UWin \nand NWin to denote the basic unnormalized and normalized Winnows respectively. LM(cid:173)\nUWin and LM -NWin denote the corresponding large margin versions. The SVM sty Ie large \nmargin Perceptron is denoted as LM-Perc. We use 200 iterations over the training data for \nall algorithms. The initial values for the Winnows are set to be the priors: f-Lj = 0.01. \nFor online algorithms, we fix the learning rates at 0.01. For large margin Winnows, we \nuse learning rates TJ = 0.01 in the gradient ascent update. For (2-norm regularized) large \nmargin Perceptron, we use the exact update which corresponds to a choice TJ = 1/ xiT xi. \nAccuracies (in percentage) of different methods are listed in Table 1. For regularization \nmethods, accuracies are reported with the optimal regularization parameters. The superior(cid:173)\nity of the regularized Winnows is obvious, especially for high dimensional data. Accuracies \nof regularized algorithms with different regularization parameters are plotted in Figure 1. \nThese behaviors are very typical for regularized algorithms. In practice, the optimal regu(cid:173)\nlarization parameter can be found by cross-validation. \n\ndimension Perceptron \n\n82.2 \n67.9 \n\nLM-NWin \n\n94.3 \n88.6 \n\nTable 1: Testset accuracy (in percentage) on the artificial dataset \n\n-\n\n'- -\n\n-,f;. -\n\n__ ~ \n\nI \nI \nI \n\n,4 \n\nII , , \n. -, \n\nL \n\n-\n\nI \nI \nI \n\nI \nI \n\n, \nI , , \n\nI \n, \n~ _ _ -o- __ o- __ \n\n70 ___ *- __ 0- __ ..... __ . . __ \u2022 __ .. __ \u2022 __ \\ ... __ . . __ \n\n, \n\n~~\u00b7 ~~,~.\u00b7~~\"\u00b7~~\"~\u00b7~~\"\u00b7~~'if ~~. ~~,~ \u2022. ~~\".~~\"~.~~\",~~,, \n\nd = 500 \n\nd = 5000 \n\nFigure 1: Testset accuracy (in percentage) as a function of>. = n~ \n\n\f5 Conclusion \n\nIn this paper, we derived regularized versions of Winnow online update algorithms. We \nstudied algorithmic and theoretical properties of the newly obtained algorithms, and com(cid:173)\npared them to the Perceptron family algorithms. Experimental results indicated that for \nproblems with many irrelevant features, the Winnow family algorithms are superior to Per(cid:173)\nceptron family algorithms. This is consistent with the implications from both the online \nlearning theory, and learning bounds obtained in this paper. \n\nReferences \n\n[1] lK Anlauf and M. Biehl. The AdaTron: an adaptive perceptron algorithm. Europhys. \n\nLett., 10(7):687-692, 1989. \n\n[2] C. Cortes and V,N. Vapnik. Support vector networks. Machine Learning, 20:273-297, \n\n1995. \n\n[3] I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. In \n\nProceedings of the Second Conference on Empirical Methods in NLP, 1997. \n\n[4] A. Grove and D. Roth. Linear concepts and hidden variables. Machine Learning, \n\n2000. To Appear; early version appeared in NIPS-IO. \n\n[5] A.J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear \ndiscriminant updates. In Proc. lOthAnnu. Con! on Comput. Learning Theory, pages \n171-183,1997. \n\n[6] Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination. \nIn S.A. Solla, T.K Leen, and K-R. Miiller, editors, Advances in Neural Information \nProcessing Systems 12, pages 470-476. MIT Press, 2000. \n\n[7] T.S. Jaakkola, Mark Diekhans, and D. Haussler. A discriminative framework for \n\ndetecting remote protein homologies. Journal of Computational Biology, to appear. \n\n[8] W. Kinzel. Statistical mechanics of the perceptron with maximal stability. In Lecture \n\nNotes in Physics, volume 368, pages 175-188. Springer-Verlag, 1990. \n\n[9] l Kivinen and M.K Warmuth. Additive versus exponentiated gradient updates for \n\nlinear prediction. Journal of Information and Computation, 132: 1-64, 1997. \n\n[10] N. Littlestone. Learning quickly when irrelevant attributes abound: a new linear(cid:173)\n\nthreshold algorithm. Machine Learning, 2:285-318, 1988. \n\n[11] M. Opper. Learning times of neural networks: Exact solution for a perceptron algo(cid:173)\n\nrithm. Phys. Rev. A, 38(7):3824-3826, 1988. \n\n[12] F. Rosenblatt. Principles of Neurodynamics,' Perceptrons and the Theory of Brain \n\nMechanisms. Spartan, New York, 1962. \n\n[l3] Bernhard Scholkopf, Christopher l C. Burges, and Alexander l Smola, editors. Ad(cid:173)\n\nvances in Kernel Methods,' Support Vector Learning. The MIT press, 1999. \n\n[14] l Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk \nminimization over data-dependent hierarchies. IEEE Trans. In! Theory, 44(5):1926-\n1940,1998. \n\n[15] V.N. Vapnik. Statistical learning theory. John Wiley & Sons, New York, 1998. \n[16] Tong Zhang. Analysis of regularized linear functions for classification problems. \n\nTechnical Report RC-21572, IBM, 1999. Abstract in NIPS'99, pp. 370-376. \n\n\f", "award": [], "sourceid": 1833, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}