Skip to main content Accessibility help
×
Hostname: page-component-78c5997874-4rdpn Total loading time: 0 Render date: 2024-11-04T19:48:21.478Z Has data issue: false hasContentIssue false

References

Published online by Cambridge University Press:  05 July 2014

Shai Shalev-Shwartz
Affiliation:
Hebrew University of Jerusalem
Shai Ben-David
Affiliation:
University of Waterloo, Ontario
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Understanding Machine Learning
From Theory to Algorithms
, pp. 385 - 394
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abernethy, J., Bartlett, P. L., Rakhlin, A. & Tewari, A. (2008), “Optimal strategies and minimax lower bounds for online convex games,” in Proceedings of the nineteenth annual conference on computational learning theory.Google Scholar
Ackerman, M. & Ben-David, S. (2008), “Measures of clustering quality: A working set of axioms for clustering,” in Proceedings of Neural Information Processing Systems (NIPS), pp. 121-128.Google Scholar
Agarwal, S. & Roth, D. (2005), “Learnability of bipartite ranking functions,” in Proceedings of the 18th annual conference on learning theory, pp. 16-31.Google Scholar
Agmon, S. (1954), “The relaxation method for linear inequalities,” Canadian Journal of Mathematics 6(3), 382-392.CrossRefGoogle Scholar
Aizerman, M. A., Braverman, E. M. & Rozonoer, L. I. (1964), “Theoretical foundations of the potential function method in pattern recognition learning,” Automation and Remote Control 25, 821-837.Google Scholar
Allwein, E. L., Schapire, R. & Singer, Y. (2000), “Reducing multiclass to binary: A unifying approach for margin classifiers,” Journal of Machine Learning Research 1, 113-141.Google Scholar
Alon, N., Ben-David, S., Cesa-Bianchi, N. & Haussler, D. (1997), “Scale-sensitive dimensions, uniform convergence, and learnability,” Journal of the ACM 44(4), 615-631.CrossRefGoogle Scholar
Anthony, M. & Bartlet, P. (1999), Neural Network Learning: Theoretical Foundations, Cambridge University Press.CrossRefGoogle Scholar
Baraniuk, R., Davenport, M., DeVore, R. & Wakin, M. (2008), “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation 28(3), 253-263.CrossRefGoogle Scholar
Barber, D. (2012), Bayesian reasoning and machine learning, Cambridge University Press.Google Scholar
Bartlett, P., Bousquet, O. & Mendelson, S. (2005), “Local rademacher complexities,” Annals of Statistics 33(4), 1497-1537.CrossRefGoogle Scholar
Bartlett, P. L. & Ben-David, S. (2002), “Hardness results for neural network approximation problems,” Theor. Comput. Sci. 284(1), 53-66.CrossRefGoogle Scholar
Bartlett, P. L., Long, P. M. & Williamson, R. C. (1994), “Fat-shattering and the learn-ability of real-valued functions,” in Proceedings of the seventh annual conference on computational learning theory, (ACM), pp. 299-310.Google Scholar
Bartlett, P. L. & Mendelson, S. (2001), “Rademacher and Gaussian complexities: Risk bounds and structural results,” in 14th Annual Conference on Computational Learning Theory (COLT) 2001, Vol. 2111, Springer, Berlin, pp. 224-240.Google Scholar
Bartlett, P. L. & Mendelson, S. (2002), “Rademacher and Gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research 3, 463-482.Google Scholar
Ben-David, S., Cesa-Bianchi, N., Haussler, D. & Long, P. (1995), “Characterizations of learnability for classes of {0,…, n}-valued functions,” Journal of Computer and System Sciences 50, 74-86.Google Scholar
Ben-David, S., Eiron, N. & Long, P. (2003), “On the difficulty of approximately maximizing agreements,” Journal of Computer and System Sciences 66(3), 496-514.CrossRefGoogle Scholar
Ben-David, S. & Litman, A. (1998), “Combinatorial variability of vapnik-chervonenkis classes with applications to sample compression schemes,” Discrete Applied Mathematics 86(1), 3-25.CrossRefGoogle Scholar
Ben-David, S., Pal, D., & Shalev-Shwartz, S. (2009), “Agnostic online learning,” in Conference on Learning Theory (COLT).Google Scholar
Ben-David, S. & Simon, H. (2001), “Efficient learning of linear perceptrons,” Advances in Neural Information Processing Systems, pp. 189-195.Google Scholar
Bengio, Y. (2009), “Learning deep architectures for AI,” Foundations and Trends in Machine Learning 2(1), 1-127.CrossRefGoogle Scholar
Bengio, Y. & LeCun, Y. (2007), “Scaling learning algorithms towards AI,” Large-Scale Kernel Machines 34.Google Scholar
Bertsekas, D. (1999), Nonlinear programming, Athena Scientific.Google Scholar
Beygelzimer, A., Langford, J. & Ravikumar, P. (2007), “Multiclass classification with filter trees,” Preprint, June.Google Scholar
Birkhoff, G. (1946), “Three observations on linear algebra,” Revi. Univ. Nac. Tucuman, ser. A 5, 147-151.Google Scholar
Bishop, C. M. (2006), Pattern recognition and machine learning, Vol. 1, Springer: New York.Google Scholar
Blum, L., Shub, M. & Smale, S. (1989), “On a theory of computation and complexity over the real numbers: Np-completeness, recursive functions and universal machines,” Am. Math. Soc. 21(1), 1-46.Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. (1987), “Occam's razor,” Information Processing Letters 24(6), 377-380.CrossRefGoogle Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. (1989), “Learnability and the Vapnik-Chervonenkis dimension,” Journal of the Association for Computing Machinery 36(4), 929-965.CrossRefGoogle Scholar
Borwein, J. & Lewis, A. (2006), Convex analysis and nonlinear optimization, Springer.CrossRefGoogle Scholar
Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992), “A training algorithm for optimal margin classifiers,” in COLT, pp. 144-152.Google Scholar
Bottou, L. & Bousquet, O. (2008), “The tradeoffs of large scale learning,” in NIPS, pp. 161-168.Google Scholar
Boucheron, S., Bousquet, O. & Lugosi, G. (2005), “Theory of classification: A survey of recent advances,” ESAIM: Probability and Statistics 9, 323-375.Google Scholar
Bousquet, O. (2002), Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms, PhD thesis, Ecole Polytechnique.Google Scholar
Bousquet, O. & Elisseeff, A. (2002), “Stability and generalization,” Journal of Machine Learning Research 2, 499-526.Google Scholar
Boyd, S. & Vandenberghe, L. (2004), Convex optimization, Cambridge University Press.CrossRefGoogle Scholar
Breiman, L. (1996), Bias, variance, and arcing classifiers, Technical Report 460, Statistics Department, University of California at Berkeley.Google Scholar
Breiman, L. (2001), “Random forests,” Machine Learning 45(1), 5-32.Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984), Classification and regression trees, Wadsworth & Brooks.Google Scholar
Candès, E. (2008), “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique 346(9), 589-592.CrossRefGoogle Scholar
Candes, E. J. (2006), “Compressive sampling,” in Proc. of the int. congress of math., Madrid, Spain.Google Scholar
Candes, E. & Tao, T. (2005), “Decoding by linear programming,” IEEE Trans. on Information Theory 51, 4203-4215.CrossRefGoogle Scholar
Cesa-Bianchi, N. & Lugosi, G. (2006), Prediction, learning, and games, Cambridge University Press.CrossRefGoogle Scholar
Chang, H. S., Weiss, Y. & Freeman, W. T. (2009), “Informative sensing,” arXiv preprint arXiv:0901.4275.Google Scholar
Chapelle, O., Le, Q. & Smola, A. (2007), “Large margin optimization of ranking measures,” in NIPS workshop: Machine learning for Web search (Machine Learning).Google Scholar
Collins, M. (2000), “Discriminative reranking for natural language parsing,” in Machine Learning.Google Scholar
Collins, M. (2002), “Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms,” in Conference on Empirical Methods in Natural Language Processing.Google Scholar
Collobert, R. & Weston, J. (2008), “A unified architecture for natural language processing: deep neural networks with multitask learning,” in International Conference on Machine Learning (ICML).CrossRefGoogle Scholar
Cortes, C. & Vapnik, V. (1995), “Support-vector networks,” Machine Learning 20(3), 273-297.CrossRefGoogle Scholar
Cover, T. (1965), “Behavior of sequential predictors of binary sequences,” Trans. 4th Prague conf. information theory statistical decision functions, random processes, pp. 263-272.Google Scholar
Cover, T. & Hart, P. (1967), “Nearest neighbor pattern classification,” Information Theory, IEEE Transactions on 13(1), 21-27.CrossRefGoogle Scholar
Crammer, K. & Singer, Y. (2001), “On the algorithmic implementation of multiclass kernel-based vector machines,” Journal of Machine Learning Research 2, 265-292.Google Scholar
Cristianini, N. & Shawe-Taylor, J. (2000), An introduction to support vector machines, Cambridge University Press.Google Scholar
Daniely, A., Sabato, S., Ben-David, S. & Shalev-Shwartz, S. (2011), “Multiclass learnability and the erm principle,” in COLT.Google Scholar
Daniely, A., Sabato, S. & Shwartz, S. S. (2012), “Multiclass learning approaches: A theoretical comparison with implications,” in NIPS.Google Scholar
Davis, G., Mallat, S. & Avellaneda, M. (1997), “Greedy adaptive approximation,” Journal of Constructive Approximation 13, 57-98.Google Scholar
Devroye, L. & Gyorfi, L. (1985), Nonparametric density estimation: The L B1 S view, Wiley.Google Scholar
Devroye, L., Gyorfi, L. & Lugosi, G. (1996), A probabilistic theory of pattern recognition, Springer.CrossRefGoogle Scholar
Dietterich, T. G. & Bakiri, G. (1995), “Solving multiclass learning problems via error-correcting output codes,” Journal of Artificial Intelligence Research 2, 263-286.Google Scholar
Donoho, D. L. (2006), “Compressed sensing,” Information Theory, IEEE Transactions 52(4), 1289-1306.CrossRefGoogle Scholar
Dudley, R., Gine, E. & Zinn, J. (1991), “Uniform and universal glivenko-cantelli classes,” Journal of Theoretical Probability 4(3), 485-510.CrossRefGoogle Scholar
Dudley, R. M. (1987), “Universal Donsker classes and metric entropy,” Annals of Probability 15(4), 1306-1326.CrossRefGoogle Scholar
Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222, 309-368.Google Scholar
Floyd, S. (1989), “Space-bounded learning and the Vapnik-Chervonenkis dimension,” in COLT, pp. 349-364.Google Scholar
Floyd, S. & Warmuth, M. (1995), “Sample compression, learnability, and the Vapnik-Chervonenkis dimension,” Machine Learning 21(3), 269-304.CrossRefGoogle Scholar
Frank, M. & Wolfe, P. (1956), “An algorithm for quadratic programming,” Naval Res. Logist. Quart. 3, 95-110.CrossRefGoogle Scholar
Freund, Y. & Schapire, R. (1995), “A decision-theoretic generalization of on-line learning and an application to boosting,” in European Conference on Computational Learning Theory (EuroCOLT), Springer-Verlag, pp. 23-37.Google Scholar
Freund, Y. & Schapire, R. E. (1999), “Large margin classification using the perceptron algorithm,” Machine Learning 37(3), 277-296.CrossRefGoogle Scholar
Garcia, J. & Koelling, R. (1996), “Relation of cue to consequence in avoidance learning,” Foundations of animal behavior: classic papers with commentaries 4, 374.Google Scholar
Gentile, C. (2003), “The robustness of the p-norm algorithms,” Machine Learning 53(3), 265-299.CrossRefGoogle Scholar
Georghiades, A., Belhumeur, P. & Kriegman, D. (2001), “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intelligence 23(6), 643-660.CrossRefGoogle Scholar
Gordon, G. (1999), “Regret bounds for prediction problems,” in Conference on Learning Theory (COLT).CrossRefGoogle Scholar
Gottlieb, L.-A., Kontorovich, L. & Krauthgamer, R. (2010), “Efficient classification for metric data,” in 23rd conference on learning theory, pp. 433-440.Google Scholar
Guyon, I. & Elisseeff, A. (2003), “An introduction to variable and feature selection,” Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3, 1157-1182.Google Scholar
Hadamard, J. (1902), “Sur les problèmes aux dérivées partielles et leur signification physique,” Princeton University Bulletin 13, 49-52.Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. (2001), The elements of statistical learning, Springer.CrossRefGoogle Scholar
Haussler, D. (1992), “Decision theoretic generalizations of the PAC model for neural net and other learning applications,” Information and Computation 100(1), 78-150.CrossRefGoogle Scholar
Haussler, D. & Long, P. M. (1995), “A generalization of sauer's lemma,” Journal of Combinatorial Theory, Series A 71(2), 219-240.CrossRefGoogle Scholar
Hazan, E., Agarwal, A. & Kale, S. (2007), “Logarithmic regret algorithms for online convex optimization,” Machine Learning 69(2-3), 169-192.CrossRefGoogle Scholar
Hinton, G. E., Osindero, S. & Teh, Y.-W. (2006), “A fast learning algorithm for deep belief nets,” Neural Computation 18(7), 1527-1554.CrossRefGoogle ScholarPubMed
Hiriart-Urruty, J.-B. & Lemaréchal, C. (1993), Convex analysis and minimization algorithms, Springer.CrossRefGoogle Scholar
Hsu, C.-W., Chang, C. -C., & Lin, C. -J. (2003), “A practical guide to support vector classification.”Google Scholar
Hyafil, L. & Rivest, R. L. (1976), “Constructing optimal binary decision trees is NP-complete,” Information Processing Letters 5(1), 15-17.CrossRefGoogle Scholar
Joachims, T. (2005), “A support vector method for multivariate performance measures,” in Proceedings of the international conference on machine learning (ICML).Google Scholar
Kakade, S., Sridharan, K. & Tewari, A. (2008), “On the complexity of linear prediction: Risk bounds, margin bounds, and regularization,” in NIPS.Google Scholar
Karp, R. M. (1972), Reducibility among combinatorial problems, Springer.CrossRefGoogle Scholar
Kearns, M. & Mansour, Y. (1996), “On the boosting ability of top-down decision tree learning algorithms,” in ACM Symposium on the Theory of Computing (STOC).CrossRefGoogle Scholar
Kearns, M. & Ron, D. (1999), “Algorithmic stability and sanity-check bounds for leave-one-out cross-validation,” Neural Computation 11(6), 1427-1453.CrossRefGoogle ScholarPubMed
Kearns, M. & Valiant, L. G. (1988), “Learning Boolean formulae or finite automata is as hard as factoring”, Technical Report TR-14-88, Harvard University, Aiken Computation Laboratory.Google Scholar
Kearns, M. & Vazirani, U. (1994), An Introduction to Computational Learning Theory, MIT Press.Google Scholar
Kearns, M. J., Schapire, R. E. & Sellie, L. M. (1994), “Toward efficient agnostic learning,” Machine Learning 17, 115-141.CrossRefGoogle Scholar
Kleinberg, J. (2003), “An impossibility theorem for clustering,” NIPS, pp. 463-470.Google Scholar
Klivans, A. R. & Sherstov, A. A. (2006), Cryptographic hardness for learning intersections of halfspaces, in FOCS.CrossRefGoogle Scholar
Koller, D. & Friedman, N. (2009), Probabilistic graphical models: Principles and techniques, MIT Press.Google Scholar
Koltchinskii, V. & Panchenko, D. (2000), “Rademacher processes and bounding the risk of function learning,” in High Dimensional Probability II, Springer, pp. 443-457.Google Scholar
Kuhn, H. W. (1955), “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly 2(1-2), 83-97.CrossRefGoogle Scholar
Kutin, S. & Niyogi, P. (2002), “Almost-everywhere algorithmic stability and generalization error,” in Proceedings of the 18th conference in uncertainty in artificial intelligence, pp. 275-282.Google Scholar
Lafferty, J., McCallum, A. & Pereira, F. (2001), “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in International conference on machine learning, pp. 282-289.Google Scholar
Langford, J. (2006), “Tutorial on practical prediction theory for classification,” Journal of machine learning research 6(1), 273.Google Scholar
Langford, J. & Shawe-Taylor, J. (2003), “PAC-Bayes & margins,” in NIPS, pp. 423-430.Google Scholar
Le, Q. V., Ranzato, M. -A., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J. & Ng, A. Y. (2012), “Building high-level features using large scale unsupervised learning,” in ICML.Google Scholar
Le Cun, L. (2004), “Large scale online learning,” in Advances in neural information processing systems 16: Proceedings of the 2003 conference, Vol. 16, MIT Press, p. 217.Google Scholar
LeCun, Y. & Bengio, Y. (1995), “Convolutional networks for images, speech, and time series,” in The handbook of brain theory and neural networks, The MIT Press.Google Scholar
Lee, H., Grosse, R., Ranganath, R. & Ng, A. (2009), “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in ICML.CrossRefGoogle Scholar
Littlestone, N. (1988), “Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,” Machine Learning 2, 285-318.CrossRefGoogle Scholar
Littlestone, N. & Warmuth, M. (1986), Relating data compression and learnability. Unpublished manuscript.Google Scholar
Littlestone, N. & Warmuth, M. K. (1994), “The weighted majority algorithm,” Information and Computation 108, 212-261.CrossRefGoogle Scholar
Livni, R., Shalev-Shwartz, S. & Shamir, O. (2013), “A provably eficient algorithm for training deep networks,” arXiv preprint arXiv:1304.7045.Google Scholar
Livni, R. & Simon, P. (2013), “Honest compressions and their application to compression schemes,” in COLT.Google Scholar
MacKay, D. J. (2003), Information theory, inference and learning algorithms, Cambridge University Press1.Google Scholar
Mallat, S. & Zhang, Z. (1993), “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing 41, 3397-3415.CrossRefGoogle Scholar
McAllester, D. A. (1998), “Some PAC-Bayesian theorems,” in COLT.CrossRefGoogle Scholar
McAllester, D. A. (1999), “PAC-Bayesian model averaging,” in COLT, pp. 164-170.Google Scholar
McAllester, D. A. (2003), “Simpliied PAC-Bayesian margin bounds,” in COLT, pp. 203-215.Google Scholar
Minsky, M. & Papert, S. (1969), Perceptrons: An introduction to computational geometry, The MIT Press.Google Scholar
Mukherjee, S., Niyogi, P., Poggio, T. & Rifkin, R. (2006), “Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization,” Advances in Computational Mathematics 25(1-3), 161-193.CrossRefGoogle Scholar
Murata, N. (1998), “A statistical study of on-line learning,” Online Learning and Neural Networks, Cambridge University Press.Google Scholar
Murphy, K. P. (2012), Machine learning: a probabilistic perspective, The MIT Press.Google Scholar
Natarajan, B. (1995), “Sparse approximate solutions to linear systems,” SIAM J. Computing 25(2), 227-234.Google Scholar
Natarajan, B. K. (1989), “On learning sets and functions,” Mach. Learn. 4, 67-97.CrossRefGoogle Scholar
Nemirovski, A., Juditsky, A., Lan, G. & Shapiro, A. (2009), “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization 19(4), 1574-1609.CrossRefGoogle Scholar
Nemirovski, A. & Yudin, D. (1978), Problem complexity and method efficiency in optimization, Nauka, Moscow.Google Scholar
Nesterov, Y. (2005), Primal-dual subgradient methods for convex problems, Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL).Google Scholar
Nesterov, Y. & Nesterov, I. (2004), Introductory lectures on convex optimization: A basic course, Vol. 87, Springer, Netherlands.CrossRefGoogle Scholar
Novikoff, A. B. J. (1962), “On convergence proofs on perceptrons,” in Proceedings of the symposium on the mathematical theory of automata, Vol. XII, pp. 615-622.Google Scholar
Parberry, I. (1994), Circuit complexity and neural networks, The MIT press.Google Scholar
Pearson, K. (1901), “On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559-572.CrossRefGoogle Scholar
Phillips, D. L. (1962), “A technique for the numerical solution of certain integral equations of the first kind,” Journal of the ACM 9(1), 84-97.CrossRefGoogle Scholar
Pisier, G. (1980-1981), “Remarques sur un résultat non publié de B. maurey.”Google Scholar
Pitt, L. & Valiant, L. (1988), “Computational limitations on learning from examples,” Journal of the Association for Computing Machinery 35(4), 965-984.CrossRefGoogle Scholar
Poon, H. & Domingos, P. (2011), “Sum-product networks: A new deep architecture,” in Conference on Uncertainty in Artificial Intelligence (UAI).Google Scholar
Quinlan, J. R. (1986), “Induction of decision trees,” Machine Learning 1, 81-106.CrossRefGoogle Scholar
Quinlan, J. R. (1993), C4.5: Programs for machine learning, Morgan Kaufmann.Google Scholar
Rabiner, L. & Juang, B. (1986), “An introduction to hidden markov models,” IEEE ASSP Magazine 3(1), 4-16.CrossRefGoogle Scholar
Rakhlin, A., Shamir, O. & Sridharan, K. (2012), “Making gradient descent optimal for strongly convex stochastic optimization,” in ICML.Google Scholar
Rakhlin, A., Sridharan, K. & Tewari, A. (2010), “Online learning: Random averages, combinatorial parameters, and learnability,” in NIPS.Google Scholar
Rakhlin, S., Mukherjee, S. & Poggio, T. (2005), “Stability results in learning theory,” Analysis and Applications 3(4), 397-419.CrossRefGoogle Scholar
Ranzato, M., Huang, F., Boureau, Y. & Lecun, Y. (2007), “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, IEEE, pp. 1-8.Google Scholar
Rissanen, J. (1978), “Modeling by shortest data description,” Automatica 14, 465-471.CrossRefGoogle Scholar
Rissanen, J. (1983), “A universal prior for integers and estimation by minimum description length,” The Annals of Statistics 11(2), 416-431.CrossRefGoogle Scholar
Robbins, H. & Monro, S. (1951), “A stochastic approximation method,” The Annals of Mathematical Statistics, pp. 400-407.Google Scholar
Rogers, W. & Wagner, T. (1978), “A finite sample distribution-free performance bound for local discrimination rules,” The Annals of Statistics 6(3), 506-514.CrossRefGoogle Scholar
Rokach, L. (2007), Data mining with decision trees: Theory and applications, Vol. 69, World Scientific.CrossRefGoogle Scholar
Rosenblatt, F. (1958), “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review 65, 386-407. (Reprinted in Neurocomputing, MIT Press, 1988).CrossRefGoogle Scholar
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986), “Learning internal representations by error propagation,” in D. E., Rumelhart & J. L., McClelland, eds, Parallel distributed processing - explorations in the microstructure of cognition, MIT Press, chapter 8, pp. 318-362.Google Scholar
Sankaran, J. K. (1993), “A note on resolving infeasibility in linear programs by constraint relaxation,” Operations Research Letters 13(1), 19-20.CrossRefGoogle Scholar
Sauer, N. (1972), “On the density of families of sets,” Journal of Combinatorial Theory Series A 13, 145-147.CrossRefGoogle Scholar
Schapire, R. (1990), “The strength of weak learnability,” Machine Learning 5(2), 197-227.CrossRefGoogle Scholar
Schapire, R. E. & Freund, Y. (2012), Boosting: Foundations and algorithms, MIT Press.Google Scholar
Schölkopf, B. & Smola, A. J. (2002), Learning with kernels: Support vector machines, regularization, optimization and beyond, MIT Press.Google Scholar
Schölkopf, B., Herbrich, R. & Smola, A. (2001), “A generalized representer theorem,” in Computational learning theory, pp. 416-426.Google Scholar
Schölkopf, B., Herbrich, R., Smola, A. & Williamson, R. (2000), “A generalized representer theorem,” in NeuroCOLT.Google Scholar
Schölkopf, B., Smola, A. & Müller, K.-R. (1998), ‘Nonlinear component analysis as a kernel eigenvalue problem’, Neural computation 10(5), 1299-1319.CrossRefGoogle Scholar
Seeger, M. (2003), “Pac-bayesian generalisation error bounds for gaussian process classiication,” The Journal of Machine Learning Research 3, 233-269.Google Scholar
Shakhnarovich, G., Darrell, T. & Indyk, P. (2006), Nearest-neighbor methods in learning and vision: Theory and practice, MIT Press.Google Scholar
Shalev-Shwartz, S. (2007), Online Learning: Theory, Algorithms, and Applications, PhD thesis, The Hebrew University.Google Scholar
Shalev-Shwartz, S. (2011), “Online learning and online convex optimization,” Foundations and Trends ® in Machine Learning 4(2), 107-194.Google Scholar
Shalev-Shwartz, S., Shamir, O., Srebro, N. & Sridharan, K. (2010), “Learnability, stability and uniform convergence,” The Journal of Machine Learning Research 9999, 2635-2670.Google Scholar
Shalev-Shwartz, S., Shamir, O. & Sridharan, K. (2010), “Learning kernel-based halfspaces with the zero-one loss,” in COLT.Google Scholar
Shalev-Shwartz, S., Shamir, O., Sridharan, K. & Srebro, N. (2009), “Stochastic convex optimization,” in COLT.Google Scholar
Shalev-Shwartz, S. & Singer, Y. (2008), “On the equivalence of weak learnability and linear separability: New relaxations and eficient boosting algorithms,” in Proceedings of the nineteenth annual conference on computational learning theory.Google Scholar
Shalev-Shwartz, S., Singer, Y. & Srebro, N. (2007), “Pegasos: Primal Estimated sub-GrAdient SOlver for SVM,” in International conference on machine learning, pp. 807-814.Google Scholar
Shalev-Shwartz, S. & Srebro, N. (2008), “SVM optimization: Inverse dependence on training set size,” in International conference on machine learningICML, pp. 928-935.Google Scholar
Shalev-Shwartz, S., Zhang, T. & Srebro, N. (2010), “Trading accuracy for sparsity in optimization problems with sparsity constraints,” Siam Journal on Optimization 20, 2807-2832.CrossRefGoogle Scholar
Shamir, O. & Zhang, T. (2013), “Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes,” in ICML.Google Scholar
Shapiro, A., Dentcheva, D. & Ruszczyński, A. (2009), Lectures on stochastic programming: modeling and theory, Vol. 9, Society for Industrial and Applied Mathematics.CrossRefGoogle Scholar
Shelah, S. (1972), “A combinatorial problem; stability and order for models and theories in infinitary languages,” Pac. J. Math 4, 247-261.Google Scholar
Sipser, M. (2006), Introduction to the Theory of Computation, Thomson Course Technology.Google Scholar
Slud, E. V. (1977), “Distribution inequalities for the binomial law,” The Annals of Probability 5(3), 404-412.CrossRefGoogle Scholar
Steinwart, I. & Christmann, A. (2008), Support vector machines, Springerverlag, New York.Google Scholar
Stone, C. (1977), “Consistent nonparametric regression,” The Annals of Statistics 5(4), 595-620.CrossRefGoogle Scholar
Taskar, B., Guestrin, C. & Koller, D. (2003), “Max-margin markov networks,” in NIPS.Google Scholar
Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,” J. Royal. Statist. Soc B. 58(1), 267-288.Google Scholar
Tikhonov, A. N. (1943), “On the stability of inverse problems,” Dolk. Akad. Nauk SSSR 39(5), 195-198.Google Scholar
Tishby, N., Pereira, F. & Bialek, W. (1999), “The information bottleneck method,” in The 37'th Allerton conference on communication, control, and computing.Google Scholar
Tsochantaridis, I., Hofmann, T., Joachims, T. & Altun, Y. (2004), “Support vector machine learning for interdependent and structured output spaces,” in Proceedings of the twenty-first international conference on machine learning.Google Scholar
Valiant, L. G. (1984), “A theory of the learnable,” Communications of the ACM 27(11), 1134-1142.CrossRefGoogle Scholar
Vapnik, V. (1992), “Principles of risk minimization for learning theory,” in J. E., Moody, S. J., Hanson & R. P., Lippmann, eds., Advances in Neural Information Processing Systems 4, Morgan Kaufmann, pp. 831-838.Google Scholar
Vapnik, V. (1995), The Nature of Statistical Learning Theory, Springer.CrossRefGoogle Scholar
Vapnik, V. N. (1982), Estimation of Dependences Based on Empirical Data, SpringerVerlag.Google Scholar
Vapnik, V. N. (1998), Statistical Learning Theory, Wiley.Google Scholar
Vapnik, V. N. & Chervonenkis, A. Y. (1971), “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability and Its Applications XVI(2), 264-280.Google Scholar
Vapnik, V. N. & Chervonenkis, A. Y. (1974), Theory of pattern recognition, Nauka, Moscow (In Russian).Google Scholar
Von Luxburg, U. (2007), “A tutorial on spectral clustering,” Statistics and Computing 17(4), 395-416.CrossRefGoogle Scholar
von Neumann, J. (1928), “Zur theorie der gesellschaftsspiele (on the theory of parlor games),” Math. Ann. 100, 295-320.Google Scholar
Von Neumann, J. (1953), “A certain zero-sum two-person game equivalent to the optimal assignment problem,” Contributions to the Theory of Games 2, 5-12.Google Scholar
Vovk, V. G. (1990), “Aggregating strategies,” in COLT, pp. 371-383.Google Scholar
Warmuth, M., Glocer, K. & Vishwanathan, S. (2008), “Entropy regularized lpboost,” in Algorithmic Learning Theory (ALT).Google Scholar
Warmuth, M., Liao, J. & Ratsch, G. (2006), “Totally corrective boosting algorithms that maximize the margin,” in Proceedings of the 23rd international conference on machine learning.Google Scholar
Weston, J., Chapelle, O., Vapnik, V., Elisseeff, A. & Scholkopf, B. (2002), “Kernel dependency estimation,” in Advances in neural information processing systems, pp. 873-880.Google Scholar
Weston, J. & Watkins, C. (1999), “Support vector machines for multi-class pattern recognition,” in Proceedings of the seventh european symposium on artificial neural networks.Google Scholar
Wolpert, D. H. & Macready, W. G. (1997), “No free lunch theorems for optimization,” Evolutionary Computation, IEEE Transactions on 1(1), 67-82.CrossRefGoogle Scholar
Zhang, T. (2004), “Solving large scale linear prediction problems using stochastic gradient descent algorithms,” in Proceedings of the twenty-first international conference on machine learning.Google Scholar
Zhao, P. & Yu, B. (2006), “On model selection consistency of Lasso,” Journal of Machine Learning Research 7, 2541-2567.Google Scholar
Zinkevich, M. (2003), “Online convex programming and generalized infinitesimal gradient ascent,” in International conference on machine learning.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • References
  • Shai Shalev-Shwartz, Hebrew University of Jerusalem, Shai Ben-David, University of Waterloo, Ontario
  • Book: Understanding Machine Learning
  • Online publication: 05 July 2014
  • Chapter DOI: https://doi.org/10.1017/CBO9781107298019.036
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • References
  • Shai Shalev-Shwartz, Hebrew University of Jerusalem, Shai Ben-David, University of Waterloo, Ontario
  • Book: Understanding Machine Learning
  • Online publication: 05 July 2014
  • Chapter DOI: https://doi.org/10.1017/CBO9781107298019.036
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • References
  • Shai Shalev-Shwartz, Hebrew University of Jerusalem, Shai Ben-David, University of Waterloo, Ontario
  • Book: Understanding Machine Learning
  • Online publication: 05 July 2014
  • Chapter DOI: https://doi.org/10.1017/CBO9781107298019.036
Available formats
×