Skip to main content Accessibility help
×
Hostname: page-component-78c5997874-mlc7c Total loading time: 0 Render date: 2024-11-03T05:21:37.011Z Has data issue: false hasContentIssue false

10 - Information-Theoretic Stability and Generalization

Published online by Cambridge University Press:  22 March 2021

Miguel R. D. Rodrigues
Affiliation:
University College London
Yonina C. Eldar
Affiliation:
Weizmann Institute of Science, Israel
Get access

Summary

Machine-learning algorithms can be viewed as stochastic transformations that map training data to hypotheses. Following Bousquet and Elisseeff, we say such an algorithm is stable if its output does not depend too much on any individual training example. Since stability is closely connected to generalization capabilities of learning algorithms, it is of interest to obtain sharp quantitative estimates on the generalization bias of machine-learning algorithms in terms of their stability properties. We describe several information-theoretic measures of algorithmic stability and illustrate their use for upper-bounding the generalization bias of learning algorithms. Specifically, we relate the expected generalization error of a learning algorithm to several information-theoretic quantities that capture the statistical dependence between the training data and the hypothesis. These include mutual information and erasure mutual information, and their counterparts induced by the total variation distance. We illustrate the general theory through examples, including the Gibbs algorithm and differentially private algorithms, and discuss strategies for controlling the generalization error.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Vapnik, V. N., Statistical learning theory. Wiley, 1998.Google Scholar
Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K., “Learnability, stability, and uniform convergence,” J. Machine Learning Res., vol. 11, pp. 2635–2670, 2010.Google Scholar
Taylor, J. and Tibshirani, R. J., “Statistical learning and selective inference,” Proc. Natl. Acad. Sci. USA, vol. 112, no. 25, pp. 7629–7634, 2015.Google Scholar
Devroye, L. and Wagner, T., “Distribution-free performance bounds for potential function rules,” IEEE Trans. Information Theory, vol. 25, no. 5, pp. 601–604, 1979.CrossRefGoogle Scholar
Rogers, W. H. and Wagner, T. J., “A finite sample distribution-free performance bound for local discrimination rules,” Annals Statist., vol. 6, no. 3, pp. 506–514, 1978.Google Scholar
Ron, D. and Kearns, M., “Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation,” Neural Computation, vol. 11, no. 6, pp. 1427–1453, 1999.Google Scholar
Bousquet, O. and Elisseeff, A., “Stability and generalization,” J. Machine Learning Res., vol. 2, pp. 499–526, 2002.Google Scholar
Poggio, T., Rifkin, R., Mukherjee, S., and Niyogi, P., “General conditions for predictivity in learning theory,” Nature, vol. 428, no. 6981, pp. 419–422, 2004.CrossRefGoogle ScholarPubMed
Kutin, S. and Niyogi, P., “Almost -everywhere algorithmic stability and generalization error,” in Proc. 18th Conference on Uncertainty in Artificial Intelligence (UAI 2002), 2002, pp. 275–282.Google Scholar
Rakhlin, A., Mukherjee, S., and Poggio, T., “Stability results in learning theory,” Analysis Applications, vol. 3, no. 4, pp. 397–417, 2005.CrossRefGoogle Scholar
Dwork, C., McSherry, F., Nissim, K., and Smith, A., “Calibrating noise to sensitivity in private data analysis,” in Proc. Theory of Cryptography Conference, 2006, pp. 265–284.Google Scholar
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A., “Generalization in adaptive data analysis and holdout reuse,” arXiv:1506.02629, 2015.Google Scholar
Dwork, C. and Roth, A., “The algorithmic foundatons of differential privacy,” Foundations and Trends in Theoretical Computer Sci., vol. 9, nos. 3–4, pp. 211–407, 2014.Google Scholar
Kairouz, P., Oh, S., and Viswanath, P., “The composition theorem for differential privacy,” in Proc. 32nd International Conference on Machine Learning (ICML), 2015, pp. 1376–1385.Google Scholar
Raginsky, M., Rakhlin, A., Tsao, M., Wu, Y., and Xu, A., “Information -theoretic analysis of stability and bias of learning algorithms,” in Proc. IEEE Information Theory Workshop, 2016.Google Scholar
Xu, A. and Raginsky, M., “Information –theoretic analysis of generalization capability of learning algorithms,” in Proc. Conference on Neural Information Processing Systems, 2017.Google Scholar
Strasser, H., Mathematical theory of statistics: Statistical experiments and asymptotic decision Theory. Walter de Gruyter, 1985.Google Scholar
Polyanskiy, Y. and Wu, Y., “Dissipation of information in channels with input constraints,” IEEE Trans. Information Theory, vol. 62, no. 1, pp. 35–55, 2016.Google Scholar
Verdú, S. and Weissman, T., “The information lost in erasures,” IEEE Trans. Information Theory, vol. 54, no. 11, pp. 5030–5058, 2008.Google Scholar
Hoeffding, W., “Probability inequalities for sums of bounded random variables,” J. Amer. Statist. Soc., vol. 58, no. 301, pp. 13–30, 1963.Google Scholar
Boucheron, S., Lugosi, G., and Massart, P., Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.Google Scholar
Cover, T. M. and Thomas, J. A., Elements of information theory, 2nd edn. Wiley, 2006.Google Scholar
Bassily, R., Nissim, K., Smith, A., Steinke, T., Stemmer, U., and Ullman, J., “Algorithmic stability for adaptive data analysis,” in Proc. 48th ACM Symposium on Theory of Computing (STOC), 2016, pp. 1046–1059.Google Scholar
McAllester, D., “PAC -Bayesian model averaging,” in Proc. 1999 Conference on Learning Theory, 1999.Google Scholar
McAllester, D., “A PAC-Bayesian tutorial with a dropout bound,” arXiv:1307.2118, 2013, http://arxiv.org/abs/1307.2118.Google Scholar
Russo, D. and Zou, J., “Controlling bias in adaptive data analysis using information theory,” in Proc. 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016, pp. 1232–1240.Google Scholar
Jiao, J., Han, Y., and Weissman, T., “Dependence measures bounding the exploration bias for general measurements,” in Proc. IEEE International Symposium on Information Theory, 2017, pp. 1475–1479.CrossRefGoogle Scholar
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A., “Preserving statistical validity in adaptive data analysis,” in Proc. 47th ACM Symposium on Theory of Computing (STOC), 2015.CrossRefGoogle Scholar
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A., “Generalization in adaptive data analysis and holdout reuse,” in 28th Annual Conference on Neural Information Processing Systems (NIPS), 2015.Google Scholar
Bassily, R., Moran, S., Nachum, I., Shafer, J., and Yehudayoff, A., “Learners that use little information,” in Proc. Conference on Algorithmic Learning Theory (ALT), 2018.Google Scholar
Vershynin, R., High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018.Google Scholar
Devroye, L., Györfi, L., and Lugosi, G., A probabilistic theory of pattern recognition. Springer, 1996.Google Scholar
Buescher, K. L. and Kumar, P. R., “Learning by canonical smooth estimation. I. Simultaneous estimation,” IEEE Tran. Automatic Control, vol. 41, no. 4, pp. 545–556, 1996.Google Scholar
Cohen, J. E., Kemperman, J. H. B., and Zbǎganu, G., Comparisons of stochastic matrices, with applications in information theory, statistics, economics, and population sciences. Birkhäuser, 1998.Google Scholar
Polyanskiy, Y., “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation, Princeton University, 2010.Google Scholar
Sason, I. and Verdú, S., “f-divergence inequalities,” IEEE Trans. Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.CrossRefGoogle Scholar
McSherry, F. and Talwar, K., “Mechanism design via differential privacy,” in Proc. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.Google Scholar
Polyanskiy, Y. and Wu, Y., “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC) and 6.441 (MIT), 2012–2016, http://people.lids.mit.edu/yp/homepage/ data/itlectures_v4. pdf.Google Scholar
Verdú, S., “The exponential distribution in information theory,” Problems of Information Transmission, vol. 32, no. 1, pp. 86–95, 1996.Google Scholar
Raginsky, M., “Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels,” IEEE Trans. Information Theory, vol. 62, no. 6, pp. 3355–3389, 2016.Google Scholar
Goodfellow, I., Bengio, Y., and Courville, A., Deep learning. MIT Press, 2016.Google Scholar
Feldman, V. and Steinke, T., “Calibrating noise to variance in adaptive data analysis,” in Proc. 2018 Conference on Learning Theory, 2018.Google Scholar
Cuff, P. and Yu, L., “Differential privacy as a mutual information constraint,” in Proc. 2016 ACM SIGSAC Conference on Computer and Communication Security (CCS), 2016, pp. 43–54.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×