Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2024-12-28T11:35:55.726Z Has data issue: false hasContentIssue false

Deep learning: a statistical viewpoint

Published online by Cambridge University Press:  04 August 2021

Peter L. Bartlett
Affiliation:
Departments of Statistics and EECS, University of California, Berkeley, CA94720, USA E-mail: [email protected]
Andrea Montanari
Affiliation:
Departments of EE and Statistics, Stanford University, Stanford, CA94304, USA E-mail: [email protected]
Alexander Rakhlin
Affiliation:
Department of Brain & Cognitive Sciences and Statistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, MA02139, USA E-mail: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press

References

Achlioptas, D. and Molloy, M. (1997), The analysis of a list-coloring algorithm on a random graph, in Proceedings of the 38th Annual Symposium on Foundations of Computer Science (FOCS 1997), IEEE, pp. 204212.Google Scholar
Aizerman, M. A., Braverman, E. M. and Rozonoer, L. I. (1964), Theoretical foundations of the potential function method in pattern recognition, Avtomat. i Telemeh 25, 917936.Google Scholar
Ali, A., Kolter, J. Z. and Tibshirani, R. J. (2019), A continuous-time view of early stopping for least squares, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 13701378.Google Scholar
Allen-Zhu, Z., Li, Y. and Song, Z. (2019), A convergence theory for deep learning via over-parameterization, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 242252.Google Scholar
Ambrosio, L., Gigli, N. and Savaré, G. (2008), Gradient Flows: In Metric Spaces and in the Space of Probability Measures, Lectures in Mathematics (ETH Zürich), Springer.Google Scholar
Anthony, M. and Bartlett, P. L. (1999), Neural Network Learning: Theoretical Foundations, Cambridge University Press.CrossRefGoogle Scholar
Bach, F. (2013), Sharp analysis of low-rank kernel matrix approximations, in Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013) (Shalev-Shwartz, S. and Steinwart, I., eds), Vol. 30 of Proceedings of Machine Learning Research, PMLR, pp. 185209.Google Scholar
Bach, F. (2017), Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res. 18, 629681.Google Scholar
Bai, Y. and Lee, J. D. (2020), Beyond linearization: On quadratic and higher-order approximation of wide neural networks, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=rkllGyBFPH.Google Scholar
Balcan, M.-F., Blum, A. and Vempala, S. (2006), Kernels as features: On kernels, margins, and low-dimensional mappings, Mach. Learn. 65, 7994.CrossRefGoogle Scholar
Bartlett, P. L. (1998), The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Trans. Inform. Theory 44, 525536.CrossRefGoogle Scholar
Bartlett, P. L. (2008), Fast rates for estimation error and oracle inequalities for model selection, Econometric Theory 24, 545552.CrossRefGoogle Scholar
Bartlett, P. L. and Ben-David, S. (2002), Hardness results for neural network approximation problems, Theoret . Comput. Sci. 284, 5366.Google Scholar
Bartlett, P. L. and Long, P. M. (2020), Failures of model-dependent generalization bounds for least-norm interpolation. Available at arXiv:2010.08479.Google Scholar
Bartlett, P. L. and Lugosi, G. (1999), An inequality for uniform deviations of sample averages from their means, Statist . Probab. Lett. 44, 5562.CrossRefGoogle Scholar
Bartlett, P. L. and Mendelson, S. (2002), Rademacher and Gaussian complexities: Risk bounds and structural results, J. Mach. Learn. Res. 3, 463482.Google Scholar
Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002), Model selection and error estimation, Mach . Learn. 48, 85113.Google Scholar
Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005), Local Rademacher complexities, Ann . Statist. 33, 14971537.CrossRefGoogle Scholar
Bartlett, P. L., Foster, D. J. and Telgarsky, M. (2017), Spectrally-normalized margin bounds for neural networks, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 62406249.Google Scholar
Bartlett, P. L., Harvey, N., Liaw, C. and Mehrabian, A. (2019), Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks, J. Mach. Learn. Res. 20, 117.Google Scholar
Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2006), Convexity, classification, and risk bounds, J. Amer. Statist. Assoc. 101, 138156.CrossRefGoogle Scholar
Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020), Benign overfitting in linear regression, Proc . Nat. Acad. Sci. 117, 3006330070.CrossRefGoogle Scholar
Bartlett, P. L., Maiorov, V. and Meir, R. (1998), Almost linear VC dimension bounds for piecewise polynomial networks, Neural Comput. 10, 21592173.CrossRefGoogle ScholarPubMed
Baum, E. B. and Haussler, D. (1989), What size net gives valid generalization?, Neural Comput. 1, 151160.CrossRefGoogle Scholar
Belkin, M., Hsu, D. and Mitra, P. (2018a), Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 23062317.Google Scholar
Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019a), Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proc . Nat. Acad. Sci. 116, 1584915854.CrossRefGoogle Scholar
Belkin, M., Ma, S. and Mandal, S. (2018b), To understand deep learning we need to understand kernel learning, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 541549.Google Scholar
Belkin, M., Rakhlin, A. and Tsybakov, A. B. (2019b), Does data interpolation contradict statistical optimality?, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 16111619.Google Scholar
Bickel, P. J. and Doksum, K. A. (2007), Mathematical Statistics: Basic Ideas and Selected Topics, Prentice Hall.Google Scholar
Blum, A. and Rivest, R. L. (1992), Training a 3-node neural network is NP-complete, Neural Networks 5, 117127.CrossRefGoogle Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1989), Learnability and the Vapnik–Chervonenkis dimension, J. Assoc. Comput. Mach. 36, 929965.CrossRefGoogle Scholar
Boucheron, S., Lugosi, G. and Massart, P. (2013), Concentration Inequalities: A Nonasymp-totic Theory of Independence , Oxford University Press.CrossRefGoogle Scholar
Bousquet, O. and Elisseeff, A. (2002), Stability and generalization, J. Mach. Learn. Res. 2, 499526.Google Scholar
Breiman, L. (1998), Arcing classifiers, Ann . Statist. 26, 801849.Google Scholar
Caponnetto, A. and Vito, E. De (2007), Optimal rates for the regularized least-squares algorithm, Found . Comput. Math. 7, 331368.Google Scholar
Caruana, R., Lawrence, S. and Giles, C. (2000), Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping, in Advances in Neural Information Processing Systems 13 (NIPS 2000) (Leen, T. et al., eds), MIT Press, pp. 381387.Google Scholar
Chen, L. and Xu, S. (2021), Deep neural tangent kernel and Laplace kernel have the same RKHS, in 9th International Conference on Learning Representations (ICLR 2021). Available at https://openreview.net/forum?id=vK9WrZ0QYQ.Google Scholar
Cheng, X. and Singer, A. (2013), The spectrum of random inner-product kernel matrices, Random Matrices Theory Appl. 2, 1350010.CrossRefGoogle Scholar
Chizat, L. and Bach, F. (2018), On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 30363046.Google Scholar
Chizat, L. and Bach, F. (2020), Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in Proceedings of the 33rd Conference on Learning Theory (COLT 2020) (Abernethy, J. and Agarwal, S., eds), Vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 13051338.Google Scholar
Chizat, L., Oyallon, E. and Bach, F. (2019), On lazy training in differentiable programming, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 29372947.Google Scholar
Coja-Oghlan, A. (2010), A better algorithm for random k-sat, SIAM J. Comput. 39, 28232864.CrossRefGoogle Scholar
Cortes, C. and Vapnik, V. (1995), Support-vector networks, Mach . Learn. 20, 273297.Google Scholar
Cover, T. and Hart, P. (1967), Nearest neighbor pattern classification, IEEE Trans. Inform. Theory 13, 2127.CrossRefGoogle Scholar
DasGupta, B., Siegelmann, H. T. and Sontag, E. D. (1995), On the complexity of training neural networks with continuous activation functions, IEEE Trans. Neural Networks 6, 14901504.CrossRefGoogle ScholarPubMed
Devroye, L. and Wagner, T. (1979), Distribution-free inequalities for the deleted and holdout error estimates, IEEE Trans. Inform. Theory 25, 202207.CrossRefGoogle Scholar
Devroye, L., Györfi, L. and Krzyżak, A. (1998), The Hilbert kernel regression estimate, J. Multivariate Anal. 65, 209227.CrossRefGoogle Scholar
Dhillon, P. S., Foster, D. P., Kakade, S. M. and Ungar, L. H. (2013), A risk comparison of ordinary least squares vs ridge regression, J. Mach. Learn. Res. 14, 15051511.Google Scholar
Do, Y. and Vu, V. (2013), The spectrum of random kernel matrices: Universality results for rough and varying kernels, Random Matrices Theory Appl. 2, 1350005.CrossRefGoogle Scholar
Drucker, H. and Cortes, C. (1995), Boosting decision trees, in Advances in Neural Information Processing Systems 8 (NIPS 1995), MIT Press, pp. 479485.Google Scholar
Du, S. S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a), Gradient descent finds global minima of deep neural networks, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 16751685.Google Scholar
Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b), Gradient descent provably optimizes over-parameterized neural networks, in 7th International Conference on Learning Representations (ICLR 2019). Available at https://openreview.net/forum?id=S1eK3i09YQ.Google Scholar
Dyer, E. and Gur-Ari, G. (2020), Asymptotics of wide networks from Feynman diagrams, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=S1gFvANKDS.Google Scholar
Ehrenfeucht, A., Haussler, D., Kearns, M. J. and Valiant, L. G. (1989), A general lower bound on the number of examples needed for learning, Inform. Comput. 82, 247261.CrossRefGoogle Scholar
Alaoui, A. El and Mahoney, M. W. (2015), Fast randomized kernel ridge regression with statistical guarantees, in Advances in Neural Information Processing Systems 28 (NIPS 2015) (Cortes, C. et al., eds), Curran Associates, pp. 775783.Google Scholar
Karoui, N. El (2010), The spectrum of kernel random matrices, Ann . Statist. 38, 150.Google Scholar
Fan, Z. and Montanari, A. (2019), The spectral norm of random inner-product kernel matrices, Probab . Theory Related Fields 173, 2785.CrossRefGoogle Scholar
Feldman, V. (2020), Does learning require memorization? A short tale about a long tail, in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC 2020), ACM, pp. 954–959.Google Scholar
Freund, Y. and Schapire, R. E. (1997), A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. System Sci. 55, 119139.CrossRefGoogle Scholar
Friedman, J. H. (2001), Greedy function approximation: A gradient boosting machine, Ann . Statist. 29, 11891232.CrossRefGoogle Scholar
Frieze, A. and Suen, S. (1996), Analysis of two simple heuristics on a random instance of k-sat, J. Algorithms 20, 312355.CrossRefGoogle Scholar
Garey, M. R. and Johnson, D. S. (1979), Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman.Google Scholar
Geifman, A., Yadav, A., Kasten, Y., Galun, M., Jacobs, D. and Basri, R. (2020), On the similarity between the Laplace and neural tangent kernels, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 14511461.Google Scholar
Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G. and Wyart, M. (2019), Jamming transition as a paradigm to understand the loss landscape of deep neural networks, Phys. Rev. E 100, 012115.CrossRefGoogle ScholarPubMed
Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2020), When do neural networks outperform kernel methods?, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 1482014830.Google Scholar
Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2021), Linearized two-layers neural networks in high dimension, Ann . Statist. 49, 10291054.CrossRefGoogle Scholar
Goldt, S., Mézard, M., Krzakala, F. and Zdeborová, L. (2019), Modelling the influence of data structure on learning in neural networks. Available at arXiv:1909.11500.Google Scholar
Goldt, S., Reeves, G., Mézard, M., Krzakala, F. and Zdeborová, L. (2020), The Gaussian equivalence of generative models for learning with two-layer neural networks. Available at arXiv:2006.14709.Google Scholar
Golowich, N., Rakhlin, A. and Shamir, O. (2018), Size-independent sample complexity of neural networks, in Proceedings of the 31st Conference on Learning Theory (COLT 2018) (S. Bubeck, V. Perchet and P. Rigollet, eds), Vol. 75 of Proceedings of Machine Learning Research, PMLR, pp. 297–299.Google Scholar
Goodfellow, I., Bengio, Y. and Courville, A. (2016), Deep Learning, Adaptive Computation and Machine Learning series, MIT Press.Google Scholar
Gunasekar, S., Lee, J. D., Soudry, D. and Srebro, N. (2018a), Implicit bias of gradient descent on linear convolutional networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 94619471.Google Scholar
Gunasekar, S., Lee, J., Soudry, D. and Srebro, N. (2018b), Characterizing implicit bias in terms of optimization geometry, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (J. Dy and A. Krause, eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 1832–1841.Google Scholar
Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B. and Srebro, N. (2017), Implicit regularization in matrix factorization, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 61526160.Google Scholar
Han, S., Mao, H. and Dally, W. J. (2015), Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Available at arXiv:1510.00149.Google Scholar
Hanin, B. and Nica, M. (2020), Finite depth and width corrections to the neural tangent kernel, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=SJgndT4KwB.Google Scholar
Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019), Surprises in high-dimensional ridgeless least squares interpolation. Available at arXiv:1903.08560.Google Scholar
Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer.CrossRefGoogle Scholar
Haussler, D. (1992), Decision theoretic generalizations of the PAC model for neural net and other learning applications, Inform. Comput. 100, 78150.CrossRefGoogle Scholar
Hu, H. and Lu, Y. M. (2020), Universality laws for high-dimensional learning with random features. Available at arXiv:2009.07669.Google Scholar
Huang, J. and Yau, H.-T. (2020), Dynamics of deep neural networks and neural tangent hierarchy, in Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (H. Daumé III and A. Singh, eds), Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 4542–4551.Google Scholar
Jacot, A., Gabriel, F. and Hongler, C. (2018), Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 85718580.Google Scholar
Ji, Z. and Telgarsky, M. (2018), Risk and parameter convergence of logistic regression. Available at arXiv:1803.07300.Google Scholar
Ji, Z. and Telgarsky, M. (2019), Characterizing the implicit bias via a primal–dual analysis. Available at arXiv:1906.04540.Google Scholar
Johnson, D. S. and Preparata, F. P. (1978), The densest hemisphere problem, Theoret . Comput. Sci. 6, 93107.Google Scholar
Johnstone, I. M. (2019), Gaussian estimation: Sequence and wavelet models. Available at http://statweb.stanford.edu/~imj/.Google Scholar
Judd, J. S. (1990), Neural Network Design and the Complexity of Learning, MIT Press.CrossRefGoogle Scholar
Karpinski, M. and Macintyre, A. J. (1997), Polynomial bounds for VC dimension of sig-moidal and general Pfaffian neural networks, J. Comput. System Sci. 54, 169176.CrossRefGoogle Scholar
Kini, G. R. and Thrampoulidis, C. (2020), Analytic study of double descent in binary classification: The impact of loss, in IEEE International Symposium on Information Theory (ISIT 2020), IEEE, pp. 25272532.Google Scholar
Knowles, A. and Yin, J. (2017), Anisotropic local laws for random matrices, Probab . Theory Related Fields 169, 257352.CrossRefGoogle Scholar
Koltchinskii, V. (2001), Rademacher penalties and structural risk minimization, IEEE Trans. Inform. Theory 47, 19021914.CrossRefGoogle Scholar
Koltchinskii, V. (2006), Local Rademacher complexities and oracle inequalities in risk minimization, Ann . Statist. 34, 25932656.Google Scholar
Koltchinskii, V. and Lounici, K. (2017), Concentration inequalities and moment bounds for sample covariance operators, Bernoulli 23, 110133.CrossRefGoogle Scholar
Koltchinskii, V. and Mendelson, S. (2015), Bounding the smallest singular value of a random matrix without concentration, Int. Math. Res. Not. 2015, 12991–13008.Google Scholar
Koltchinskii, V. and Panchenko, D. (2000), Rademacher processes and bounding the risk of function learning, in High Dimensional Probability II (Giné, E., Mason, D. M. and Wellner, J. A., eds), Vol. 47 of Progress in Probability, Birkhäuser, pp. 443459.CrossRefGoogle Scholar
Kurková, V. (1997), Dimension-independent rates of approximation by neural networks, in Computer Intensive Methods in Control and Signal Processing (Kárný, M. and Warwick, K., eds), Springer, pp. 261270.CrossRefGoogle Scholar
Kurková, V. and Sanguineti, M. (2001), Bounds on rates of variable-basis and neural-network approximation, IEEE Trans. Inform. Theory 47, 26592665.CrossRefGoogle Scholar
Kurková, V. and Sanguineti, M. (2002), Comparison of worst case errors in linear and neural network approximation, IEEE Trans. Inform. Theory 48, 264275.CrossRefGoogle Scholar
Lawrence, S., Giles, C. L. and Tsoi, A. C. (1997), Lessons in neural network training: Overfitting may be harder than expected, in Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), AAAI Press, pp. 540545.Google Scholar
LeCun, Y., Bengio, Y. and Hinton, G. (2015), Deep learning, Nature 521, 436444.CrossRefGoogle ScholarPubMed
Ledoux, M. (2001), The Concentration of Measure Phenomenon, Vol. 89 of Mathematical Surveys and Monographs, American Mathematical Society.Google Scholar
Ledoux, M. and Talagrand, M. (1991), Probability in Banach Spaces: Isoperimetry and Processes, Springer.CrossRefGoogle Scholar
Lee, W. S., Bartlett, P. L. and Williamson, R. C. (1996), Efficient agnostic learning of neural networks with bounded fan-in, IEEE Trans. Inform. Theory 42, 21182132.Google Scholar
Li, Y., Ma, T. and Zhang, H. (2018), Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations, in Proceedings of the 31st Conference on Learning Theory (COLT 2018) (Bubeck, S., Perchet, V. and Rigollet, P., eds), Vol. 75 of Proceedings of Machine Learning Research, PMLR, pp. 247.Google Scholar
Liang, T. (2020). Personal communication.Google Scholar
Liang, T. and Rakhlin, A. (2020), Just interpolate: Kernel ‘ridgeless’ regression can generalize, Ann . Statist. 48, 13291347.CrossRefGoogle Scholar
Liang, T., Rakhlin, A. and Sridharan, K. (2015), Learning with square loss: Localization through offset Rademacher complexity, in Proceedings of the 28th Conference on Learning Theory (COLT 2015) (Grünwald, P., Hazan, E. and Kale, S., eds), Vol. 40 of Proceedings of Machine Learning Research, PMLR, pp. 12601285.Google Scholar
Liang, T., Rakhlin, A. and Zhai, X. (2020), On the multiple descent of minimum-norm inter-polants and restricted lower isometry of kernels, in Proceedings of the 33rd Conference on Learning Theory (COLT 2020) (Abernethy, J. and Agarwal, S., eds), Vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 26832711.Google Scholar
Lin, Y. (2004), A note on margin-based loss functions in classification, Statist . Probab. Lett. 68, 7382.CrossRefGoogle Scholar
Liu, C., Zhu, L. and Belkin, M. (2020), On the linearity of large non-linear models: When and why the tangent kernel is constant, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 1595415964.Google Scholar
Louart, C., Liao, Z. and Couillet, R. (2018), A random matrix approach to neural networks, Ann . Appl. Probab. 28, 11901248.Google Scholar
Lugosi, G. and Vayatis, N. (2004), On the Bayes-risk consistency of regularized boosting methods, Ann . Statist. 32, 3055.CrossRefGoogle Scholar
Martin, G. and Pittman, J. A. (1990), Recognizing hand-printed letters and digits, in Advances in Neural Information Processing Systems 2 (NIPS 1989) (Touretzky, D., ed.), Morgan Kaufmann.Google Scholar
Mei, S. and Montanari, A. (2019), The generalization error of random features regression: Precise asymptotics and double descent curve. Available at arXiv:1908.05355 (to appear in Comm. Pure Appl. Math.).Google Scholar
Mei, S., Misiakiewicz, T. and Montanari, A. (2019), Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 23882464.Google Scholar
Mei, S., Misiakiewicz, T. and Montanari, A. (2021), Generalization error of random features and kernel methods: Hypercontractivity and kernel matrix concentration. Available at arXiv:2101.10588.CrossRefGoogle Scholar
Mei, S., Montanari, A. and Nguyen, P.-M. (2018), A mean field view of the landscape of two-layer neural networks, Proc . Nat. Acad. Sci. 115, E7665E7671.CrossRefGoogle Scholar
Mendelson, S. (2002), Improving the sample complexity using global data, IEEE Trans. Inform. Theory 48, 19771991.CrossRefGoogle Scholar
Mendelson, S. (2021), Extending the scope of the small-ball method, Studia Math. 256, 147167.CrossRefGoogle Scholar
Montanari, A. and Zhong, Y. (2020), The interpolation phase transition in neural networks: Memorization and generalization under lazy training. Available at arXiv:2007.12826.Google Scholar
Montanari, A., Ruan, F., Sohn, Y. and Yan, J. (2019), The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. Available at arXiv:1911.01544.Google Scholar
Nacson, M. S., Lee, J., Gunasekar, S., Savarese, P. H. P., Srebro, N. and Soudry, D. (2019), Convergence of gradient descent on separable data, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 34203428.Google Scholar
Nadaraya, E. A. (1964), On estimating regression, Theory Probab . Appl. 9, 141142.Google Scholar
Nagarajan, V. and Kolter, J. Z. (2019), Uniform convergence may be unable to explain generalization in deep learning, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 1161111622.Google Scholar
Neyshabur, B., Tomioka, R. and Srebro, N. (2015), Norm-based capacity control in neural networks, in Proceedings of the 28th Conference on Learning Theory (COLT 2015) (Grünwald, P., Hazan, E. and Kale, S., eds), Vol. 40 of Proceedings of Machine Learning Research, PMLR, pp. 13761401.Google Scholar
Neyshabur, B., Tomioka, R., Salakhutdinov, R. and Srebro, N. (2017), Geometry of optimization and implicit regularization in deep learning. Available at arXiv:1705.03071.Google Scholar
Nguyen, P.-M. and Pham, H. T. (2020), A rigorous framework for the mean field limit of multilayer neural networks. Available at arXiv:2001.11443.Google Scholar
Nitanda, A. and Suzuki, T. (2017), Stochastic particle gradient descent for infinite ensembles. Available at arXiv:1712.05438.Google Scholar
Oymak, S. and Soltanolkotabi, M. (2019), Overparameterized nonlinear learning: Gradient descent takes the shortest path?, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 49514960.Google Scholar
Oymak, S. and Soltanolkotabi, M. (2020), Towards moderate overparameterization: Global convergence guarantees for training shallow neural networks, IEEE J. Selected Areas Inform. Theory 1, 84105.CrossRefGoogle Scholar
Pennington, J. and Worah, P. (2017), Nonlinear random matrix theory for deep learning, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 26372646.Google Scholar
Pennington, J. and Worah, P. (2018), The spectrum of the Fisher information matrix of a single-hidden-layer neural network, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 54105419.Google Scholar
Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2 of NSF-CBMS Regional Conference Series in Probability and Statistics, Institute of Mathematical Statistics.CrossRefGoogle Scholar
Pollard, D. (1995), Uniform ratio limit theorems for empirical processes, Scand . J. Statist. 22, 271278.Google Scholar
Quinlan, J. R. (1996), Bagging, boosting, and C4.5, in Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), AAAI Press, pp. 725730.Google Scholar
Rahimi, A. and Recht, B. (2007), Random features for large-scale kernel machines, in Advances in Neural Information Processing Systems 20 (NIPS 2007) (Platt, J. C. et al., eds), Curran Associates, pp. 11771184.Google Scholar
Rahimi, A. and Recht, B. (2008), Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, in Advances in Neural Information Processing Systems 21 (NIPS 2008) (Koller, D. et al., eds), Curran Associates, pp. 13131320.Google Scholar
Rakhlin, A. and Zhai, X. (2019), Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 25952623.Google Scholar
Rakhlin, A., Sridharan, K. and Tsybakov, A. B. (2017), Empirical entropy, minimax regret and minimax risk, Bernoulli 23, 789824.CrossRefGoogle Scholar
Rotskoff, G. M. and Vanden-Eijnden, E. (2018), Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates.Google Scholar
Rudelson, M. and Vershynin, R. (2006), Combinatorics of random processes and sections of convex bodies, Ann. of Math. 164, 603648.CrossRefGoogle Scholar
Rudi, A. and Rosasco, L. (2017), Generalization properties of learning with random features, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 32153225.Google Scholar
Rudi, A., Camoriano, R. and Rosasco, L. (2015), Less is more: Nyström computational regularization, in Advances in Neural Information Processing Systems 28 (NIPS 2015) (Cortes, C. et al., eds), Curran Associates, pp. 16571665.Google Scholar
Santambrogio, F. (2015), Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, Vol. 87 of Progress in Nonlinear Differential Equations and Their Applications, Birkhäuser.CrossRefGoogle Scholar
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998), Boosting the margin: A new explanation for the effectiveness of voting methods, Ann . Statist. 26, 16511686.Google Scholar
Sirignano, J. and Spiliopoulos, K. (2020), Mean field analysis of neural networks: A law of large numbers, SIAM J. Appl. Math. 80, 725752.CrossRefGoogle Scholar
Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S. and Srebro, N. (2018), The implicit bias of gradient descent on separable data, J. Mach. Learn. Res. 19, 28222878.Google Scholar
Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G. and Wyart, M. (2019), A jamming transition from under- to over-parametrization affects generalization in deep learning, J. Phys. A 52, 474001.CrossRefGoogle Scholar
Srebro, N., Sridharan, K. and Tewari, A. (2010), Optimistic rates for learning with a smooth loss. Available at arXiv:1009.3896.Google Scholar
Taheri, H., Pedarsani, R. and Thrampoulidis, C. (2020), Fundamental limits of ridge-regularized empirical risk minimization in high dimensions. Available at arXiv:2006.08917.Google Scholar
Talagrand, M. (1994), Sharper bounds for Gaussian and empirical processes, Ann . Probab. 22, 2876.CrossRefGoogle Scholar
Telgarsky, M. (2013), Margins, shrinkage, and boosting, in Proceedings of the 30th International Conference on Machine Learning (ICML 2013) (Dasgupta, S. and McAllester, D., eds), Vol. 28 of Proceedings of Machine Learning Research, PMLR, pp. 307315.Google Scholar
Tibshirani, R. (1996), Regression shrinkage and selection via the Lasso, J. Royal Statist. Soc. Ser. B 58, 267288.Google Scholar
Tsigler, A. and Bartlett, P. L. (2020), Benign overfitting in ridge regression. Available at arXiv:2009.14286.Google Scholar
Tsybakov, A. B. (2008), Introduction to Nonparametric Estimation, Springer Series in Statistics, Springer.Google Scholar
Geer, S. van de (1990), Estimating a regression function, Ann . Statist. 18, 907924.Google Scholar
Vapnik, V. N. and Chervonenkis, A. Y. (1971), On the uniform convergence of relative frequencies of events to their probabilities, Theory Probab . Appl. 16, 264280.Google Scholar
Vapnik, V. N. and Chervonenkis, A. Y. (1974), Theory of Pattern Recognition, Nauka.Google Scholar
Vershynin, R. (2018), High-Dimensional Probability: An Introduction with Applications in Data Science, Cambridge University Press.CrossRefGoogle Scholar
Vu, V. H. (1997), On the infeasibility of training neural networks with small squared errors, in Advances in Neural Information Processing Systems 10 (NIPS 1997) (Jordan, M. I. et al., eds), MIT Press, pp. 371377.Google Scholar
Wasserman, L. (2013), All of Statistics: A Concise Course in Statistical Inference, Springer Texts in Statistics, Springer.Google Scholar
Watson, G. S. (1964), Smooth regression analysis, Sankhyā A 26, 359372.Google Scholar
Williams, C. K. I. and Seeger, M. (2000), Using the Nyström method to speed up kernel machines, in Advances in Neural Information Processing Systems 13 (NIPS 2000) (Leen, T. K. et al., eds), MIT Press, pp. 682688.Google Scholar
Wyner, A. J., Olson, M., Bleich, J. and Mease, D. (2017), Explaining the success of AdaBoost and random forests as interpolating classifiers, J. Mach. Learn. Res. 18, 15581590.Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017), Understanding deep learning requires rethinking generalization, in 5th International Conference on Learning Representations (ICLR 2017). Available at https://openreview.net/forum?id=Sy8gdB9xx.Google Scholar
Zhang, T. (2004), Statistical behavior and consistency of classification methods based on convex risk minimization, Ann . Statist. 32, 5685.CrossRefGoogle Scholar
Zhang, T. and Yu, B. (2005), Boosting with early stopping: Convergence and consistency, Ann . Statist. 33, 15381579.CrossRefGoogle Scholar
Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2020), Gradient descent optimizes over-parameterized deep ReLU networks, Mach. Learn. 109, 467492.CrossRefGoogle Scholar