Hostname: page-component-586b7cd67f-r5fsc Total loading time: 0 Render date: 2024-11-30T15:19:56.109Z Has data issue: false hasContentIssue false

Binary separation and training support vector machines*

Published online by Cambridge University Press:  10 May 2010

Roger Fletcher
Affiliation:
Department of Mathematics, University of Dundee, Dundee DD1 4HN, UK, E-mail: [email protected]
Gaetano Zanghirati
Affiliation:
Department of Mathematics and Math4Tech Center, University of Ferrara, 44100 Ferrara, Italy, E-mail: [email protected]

Extract

We introduce basic ideas of binary separation by a linear hyperplane, which is a technique exploited in the support vector machine (SVM) concept. This is a decision-making tool for pattern recognition and related problems. We describe a fundamental standard problem (SP) and show how this is used in most existing research to develop a dual-based algorithm for its solution. This algorithm is shown to be deficient in certain aspects, and we develop a new primal-based SQP-like algorithm, which has some interesting features. Most practical SVM problems are not adequately handled by a linear hyperplane. We describe the nonlinear SVM technique, which enables a nonlinear separating surface to be computed, and we propose a new primal algorithm based on the use of low-rank Cholesky factors.

It may be, however, that exact separation is not desirable due to the presence of uncertain or mislabelled data. Dealing with this situation is the main challenge in developing suitable algorithms. Existing dual-based algorithms use the idea of L1 penalties, which has merit. We suggest how penalties can be incorporated into a primal-based algorithm. Another aspect of practical SVM problems is often the huge size of the data set, which poses severe challenges both for software package development and for control of ill-conditioning. We illustrate some of these issues with numerical experiments on a range of problems.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES

Agarwal, S. and Niyogi, P. (2009), ‘Generalization bounds for ranking algorithms via algorithmic stability’, J. Mach. Learn. Res. 10, 441474.Google Scholar
Bennett, K. P. and Parrado-Hernández, E. (2006), ‘The interplay of optimization and machine learning research’, J. Mach. Learn. Res. 7, 12651281.Google Scholar
Bordes, A., Ertekin, S., Weston, J. and Bottou, L. (2005), ‘Fast kernel classifiers with online and active learning’, J. Mach. Learn. Res. 6, 15791619.Google Scholar
Boser, B. E., Guyon, I. and Vapnik, V. N. (1992), A training algorithm for optimal margin classifiers. In Proc. 5th Annual ACM Workshop on Computational Learning Theory (Haussler, D., ed.), ACM Press, Pittsburgh, pp. 144152.Google Scholar
Bottou, L. and Lin, C.-J. (2007), Support vector machine solvers. In Large Scale Kernel Machines (Bottou, L., Chapelle, O., DeCoste, D. and Weston, J., eds), The MIT Press, pp. 301320.CrossRefGoogle Scholar
Burges, C. J. and Schölkopf, B. (1997), Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, Vol. 9, The MIT Press, pp. 375381.Google Scholar
Burges, C. J. C. (1998), ‘A tutorial on support vector machines for pattern recognition’, Data Min. Knowl. Discovery 2, 121167.Google Scholar
Caponnetto, A. and Rosasco, L. (2004), Non standard support vector machines and regularization networks. Technical report DISI-TR-04–03, University of Genoa, Italy.Google Scholar
Catanzaro, B., Sundaram, N. and Keutzer, K. (2008), Fast support vector machine training and classification on graphics processors. In Proc. 25th International Conference on Machine Learning, Helsinki, Finland, pp. 104111.Google Scholar
Chang, C.-C. and Lin, C.-J. (2001), LIBSVM: A library for support vector machines. www.csie.ntu.edu.tw/~cjlin/libsvmGoogle Scholar
Chang, C.-C., Hsu, C.-W. and Lin, C.-J. (2000), ‘The analysis of decomposition methods for support vector machines’, IEEE Trans. Neural Networks 11, 10031008.CrossRefGoogle ScholarPubMed
Chang, E., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z. and Cui, H. (2008), Parallelizing support vector machines on distributed computers. In Advances in Neural Information Processing Systems, Vol. 20, The MIT Press, pp. 257264.Google Scholar
Chapelle, O. (2007), ‘Training a support vector machine in the primal’, Neural Comput. 19, 11551178.CrossRefGoogle ScholarPubMed
Chen, P.-H., Fan, R.-E. and Lin, C.-J. (2006), ‘A study on SMO-type decomposition methods for support vector machines’, IEEE Trans. Neural Networks 17, 893908.Google Scholar
Collobert, R. and Bengio, S. (2001), ‘SVMTorch: Support vector machines for large-scale regression problems’, J. Mach. Learn. Res. 1, 143160.Google Scholar
Cristianini, N. and Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines and other Kernel-Based Learning Methods, Cambridge University Press.Google Scholar
Cucker, F. and Smale, S. (2001), ‘On the mathematical foundations of learning’, Bull. Amer. Math. Soc. 39, 149.CrossRefGoogle Scholar
Cucker, F. and Smale, S. (2002), ‘Best choices for regularization parameter in learning theory: On the bias-variance problem’, Found. Comput. Math. 2, 413428.CrossRefGoogle Scholar
Cucker, F. and Zhou, D. X. (2007), Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press.Google Scholar
De Vito, E., Rosasco, L., Caponnetto, A., De Giovannini, U. and Odone, F. (2005), ‘Learning from examples as an inverse problem’, J. Mach. Learn. Res. 6, 883904.Google Scholar
De Vito, E., Rosasco, L., Caponnetto, A., Piana, M. and Verri, A. (2004), ‘Some properties of regularized kernel methods’, J. Mach. Learn. Res. 5, 13631390.Google Scholar
Dong, J.-X., Krzyzak, A. and Suen, C. Y. (2003), A fast parallel optimization for training support vector machine. In Proc. 3rd International Conference on Machine Learning and Data Mining (Perner, P. and Rosenfeld, A., eds), Vol. 2734 of Lecture Notes in Artificial Intelligence, Springer, pp. 96105.Google Scholar
Dong, J.-X., Krzyzak, A. and Suen, C. Y. (2005), ‘Fast SVM training algorithm with decomposition on very large data sets’, IEEE Trans. Pattern Anal. Mach. Intelligence 27, 603618.Google Scholar
Drineas, P. and Mahoney, M. W. (2005), ‘On the Nyström method for approximating a Gram matrix for improved kernel-based learning’, J. Mach. Learn. Res. 6, 21532175.Google Scholar
Durdanovic, I., Cosatto, E. and Graf, H.-P. (2007), Large-scale parallel SVM implementation. In Large Scale Kernel Machines (Bottou, L., Chapelle, O., DeCoste, D. and Weston, J., eds), The MIT Press, pp. 105138.Google Scholar
Evgeniou, T., Pontil, M. and Poggio, T. (2000), ‘Regularization networks and support vector machines’, Adv. Comput. Math. 13, 150.Google Scholar
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. and Lin, C.-J. (2008), ‘LIB-LINEAR: A library for large linear classification’, J. Mach. Learn. Res. 9, 18711874.Google Scholar
Ferris, M. C. and Munson, T. S. (2002), ‘Interior-point methods for massive support vector machines’, SIAM J. Optim. 13, 783804.Google Scholar
Fine, S. and Scheinberg, K. (2001), ‘Efficient SVM training using low-rank kernel representations’, J. Mach. Learn. Res. 2, 243264.Google Scholar
Fine, S. and Scheinberg, K. (2002), INCAS: An incremental active set method for SVM. Technical report, IBM Research Labs, Haifa, Israel.Google Scholar
Fletcher, R. (1987), Practical Methods of Optimization, 2nd edn, Wiley, Chichester.Google Scholar
Fletcher, R. (19962007), BQPD: Linear and quadratic programming solver. www-new.mcs.anl.gov/otc/Guide/SoftwareGuide/Blurbs/bqpd.htmlGoogle Scholar
Franc, V. and Sonnenburg, S. (2008 a), LIBOCAS: Library implementing OCAS solver for training linear SVM classifiers from large-scale data. cmp.felk.cvut.cz/~xfrancv/ocas/htmlGoogle Scholar
Franc, V. and Sonnenburg, S. (2008 b), Optimized cutting plane algorithm for support vector machines. In Proc. 25th International Conference on Machine Learning, Helsinki, Finland, Vol. 307, ACM Press, New York, pp. 320327.Google Scholar
Gertz, E. M. and Griffin, J. D. (2005), Support vector machine classifiers for large data sets. Technical report, Mathematics and Computer Science Division, Argonne National Laboratory, USA.CrossRefGoogle Scholar
Goldfarb, D. and Scheinberg, K. (2004), ‘A product-form cholesky factorization method for handling dense columns in interior point methods for linear programming’, Math. Program. Ser. A 99, 134.CrossRefGoogle Scholar
Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I. and Vapnik, V. N. (2005), Parallel support vector machines: The Cascade SVM. In Advances in Neural Information Processing Systems (Saul, L., Weiss, Y. and Bottou, L., eds), Vol. 17, The MIT Press, pp. 521528.Google Scholar
Groenen, P. J. F., Nalbantov, G. and Bioch, J. C. (2007), Nonlinear support vector machines through iterative majorization and I-splines. In Advances in Data Analysis, Studies in Classification, Data Analysis, and Knowledge Organization, Springer, pp. 149161.CrossRefGoogle Scholar
Groenen, P. J. F., Nalbantov, G. and Bioch, J. C. (2008), ‘SVM-Maj: A majorization approach to linear support vector machines with different hinge errors’, Adv. Data Analysis and Classification 2, 1743.CrossRefGoogle Scholar
Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer.Google Scholar
Herbrich, R. (2002), Learning Kernel Classifiers. Theory and Algorithms, The MIT Press.Google Scholar
Hush, D. and Scovel, C. (2003), ‘Polynomial-time decomposition algorithms for support vector machines’, Machine Learning 51, 5171.Google Scholar
Hush, D., Kelly, P., Scovel, C. and Steinwart, I. (2006), ‘QP algorithms with guaranteed accuracy and run time for support vector machines’, J. Mach. Learn. Res. 7, 733769.Google Scholar
Joachims, T. (1999), Making large-scale SVM learning practical. In Advances in Kernel Methods: Support Vector Learning (Schölkopf, B., Burges, C. J. C. and Smola, A., eds), The MIT Press, pp. 169184.Google Scholar
Joachims, T. (2006), Training linear SVMs in linear time. In Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, ACM Press, New York, pp. 217226.Google Scholar
Keerthi, S. S. and DeCoste, D. M. (2005), ‘A modified finite Newton method for fast solution of large-scale linear SVMs’, J. Mach. Learn. Res. 6, 341361.Google Scholar
Keerthi, S. S. and Gilbert, E. G. (2002), ‘Convergence of a generalized SMO algorithm for SVM classifier design’, Machine Learning 46, 351360.Google Scholar
Keerthi, S. S., Chapelle, O. and DeCoste, D. M. (2006), ‘Building support vector machines with reduced classifier complexity’, J. Mach. Learn. Res. 7, 14931515.Google Scholar
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C. and Murthy, K. R.K. (2001), ‘Improvements to Platt‘s SMO algorithm for SVM classifier design’, Neural Comput. 13, 637649.Google Scholar
Kulis, B., Sustik, M. and Dhillon, I. (2006), Learning low-rank kernel matrices. In Proc. 23rd International Conference on Machine Learning: ICML, pp. 505512.Google Scholar
LeCun, Y. and Cortes, C. (1998), The MNIST database of handwritten digits. www.research.att.com/~yann/ocr/mnistGoogle Scholar
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998), ‘Gradient-based learning applied to document recognition’, 86, 22782324.Google Scholar
Lee, Y.-J. and Mangasarian, O. L. (2001 a), RSVM: Reduced support vector machines. In Proc. 1st SIAM International Conference on Data Mining, Chicago, April 5–7, 2001, SIAM, Philadelphia, pp. 116.Google Scholar
Lee, Y.-J. and Mangasarian, O. L. (2001 b), ‘SSVM: A smooth support vector machine for classification’, Comput. Optim. Appl. 20, 522.Google Scholar
Lin, C.-J. (2001 a), Linear convergence of a decomposition method for support vector machines. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.Google Scholar
Lin, C.-J. (2001 b), ‘On the convergence of the decomposition method for support vector machines’, IEEE Trans. Neural Networks 12, 12881298.Google ScholarPubMed
Lin, C.-J. (2002), ‘Asymptotic convergence of an SMO algorithm without any assumptions’, IEEE Trans. Neural Networks 13, 248250.Google ScholarPubMed
Mangasarian, O. L. (2000), Generalized support vector machines. In Advances in Large Margin Classifiers, The MIT Press, pp. 135146.Google Scholar
Mangasarian, O. L. (2002), ‘A finite Newton method for classification’, Optim. Methods Software 17, 913939.Google Scholar
Mangasarian, O. L. (2006), ‘Exact 1-norm support vector machines via unconstrained convex differentiable minimization’, J. Mach. Learn. Res. 7, 15171530.Google Scholar
Mangasarian, O. L. and Musicant, D. R. (2001), ‘Lagrangian support vector machines’, J. Mach. Learn. Res. 1, 161177.Google Scholar
Osuna, E., Freund, R. and Girosi, F. (1997), Training support vector machines: An application to face detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition: CVPR97, IEEE Computer Society, New York, pp. 130136.CrossRefGoogle Scholar
Platt, J. C. (1998), Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning (Schölkopf, B., Burges, C. and Smola, A., eds), The MIT Press, pp. 185210.Google Scholar
Platt, J. C. (1999), Using analytic QP and sparseness to speed training of support vector machines. In Advances in Neural Information Processing Systems (Kearns, M.et al., eds), Vol. 11, The MIT Press, pp. 557563.Google Scholar
Prato, M., Zanni, L. and Zanghirati, G. (2007) ‘On recent machine learning algorithms for brain activity interpretation’, Applied Computational Electromagnetics Society Journal 22, 19391946.Google Scholar
Scheinberg, K. (2006), ‘An efficient implementation of an active set method for SVMs’, J. Mach. Learn. Res. 7, 22372257.Google Scholar
Schölkopf, B. and Smola, A. J. (2002), Learning with Kernels, The MIT Press.Google Scholar
Serafini, T. and Zanni, L. (2005), ‘On the working set selection in gradient projection-based decomposition techniques for support vector machines’, Optim. Methods Software 20, 583596.Google Scholar
Serafini, T., Zanghirati, G. and Zanni, L. (2005), ‘Gradient projection methods for quadratic programs and applications in training support vector machines’, Optim. Methods Software 20, 353378.Google Scholar
Shawe-Taylor, J. and Cristianini, N. (2004), Kernel Methods for Pattern Analysis, Cambridge University Press.Google Scholar
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B. and Vandewalle, J. (2002), Least Squares Support Vector Machines, World Scientific, Singapore.Google Scholar
Teo, C. H., Le, Q. V., Smola, A. and Vishwanathan, S. (2009), BMRM: Bundle methods for regularized risk minimization. users.rsise.anu.edu.au/~chteo/BMRM.htmlGoogle Scholar
Tsang, I. W., Kwok, J. T. and Cheung, P.-M. (2005), ‘Core vector machines: Fast SVM training on very large data sets’, J. Mach. Learn. Res. 6, 363392.Google Scholar
Vapnik, V. N. (1998), Statistical Learning Theory, Wiley, New York.Google Scholar
Vapnik, V. N. (1999), The Nature of Statistical Learning Theory, Information Science and Statistics, Springer.Google Scholar
Williams, C. K. and Seeger, M. (2001), Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, Vol. 13, The MIT Press, pp. 682688.Google Scholar
Woodsend, K. and Gondzio, J. (2007 a), Exploiting separability in large scale support vector machine training. Technical report MS-07–002, The University of Edinburgh.Google Scholar
Woodsend, K. and Gondzio, J. (2007 b), Parallel support vector machine training with nonlinear kernels. Technical report MS-07–007, The University of Edinburgh.Google Scholar
Woodsend, K. and Gondzio, J. (2009), ‘Hybrid MPI/OpenMP parallel linear support vector machine training’, J. Mach. Learn. Res. 20, 19371953.Google Scholar
Wyganowski, M. (2008), Classification algorithms on the cell processor. PhD thesis, Kate Gleason College of Engineering, Rochester Institute of Technology, Rochester, NY, USA. http://hdl.handle.net/1850/7767Google Scholar
Zanni, L. (2006), ‘An improved gradient projection-based decomposition technique for support vector machines’, Comput. Management Sci. 3, 131145.CrossRefGoogle Scholar
Zanni, L., Serafini, T. and Zanghirati, G. (2006), ‘Parallel software for training large-scale support vector machines on multiprocessors systems’, J. Mach. Learn. Res. 7, 14671492. http://dm.unife.it/gpdtGoogle Scholar