We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This is the first rigorous, self-contained treatment of the theory of deep learning. Starting with the foundations of the theory and building it up, this is essential reading for any scientists, instructors, and students interested in artificial intelligence and deep learning. It provides guidance on how to think about scientific questions, and leads readers through the history of the field and its fundamental connections to neuroscience. The author discusses many applications to beautiful problems in the natural sciences, in physics, chemistry, and biomedicine. Examples include the search for exotic particles and dark matter in experimental physics, the prediction of molecular properties and reaction outcomes in chemistry, and the prediction of protein structures and the diagnostic analysis of biomedical images in the natural sciences. The text is accompanied by a full set of exercises at different difficulty levels and encourages out-of-the-box thinking.
This chapter provides an introduction to uncertainty relations underlying sparse signal recovery. We start with the seminal work by Donoho and Stark (1989), which defines uncertainty relations as upper bounds on the operator norm of the band-limitation operator followed by the time-limitation operator, generalize this theory to arbitrary pairs of operators, and then develop, out of this generalization, the coherence-based uncertainty relations due to Elad and Bruckstein (2002), plus uncertainty relations in terms of concentration of the 1-norm or 2-norm. The theory is completed with set-theoretic uncertainty relations which lead to best possible recovery thresholds in terms of a general measure of parsimony, the Minkowski dimension. We also elaborate on the remarkable connection between uncertainty relations and the “large sieve,” a family of inequalities developed in analytic number theory. We show how uncertainty relations allow one to establish fundamental limits of practical signal recovery problems such as inpainting, declipping, super-resolution, and denoising of signals corrupted by impulse noise or narrowband interference.
In compressed sensing (CS) a signal x ∈ Rn is measured as y =A x + z, where A ∈ Rm×n (m<n) and z ∈ Rm denote the sensing matrix and measurement noise. The goal is to recover x from measurements y when m<n. CS is possible because we typically want to capture highly structured signals, and recovery algorithms take advantage of a signal’s structure to solve the under-determined system of linear equations. As in CS, data-compression codes take advantage of a signal’s structure to encode it efficiently. Structures used by compression codes are much more elaborate than those used by CS algorithms. Using more complex structures in CS, like those employed by data-compression codes, potentially leads to more efficient recovery methods requiring fewer linear measurements or giving better reconstruction quality. We establish connections between data compression and CS, giving CS recovery methods based on compression codes, which indirectly take advantage of all structures used by compression codes. This elevates the class of structures used by CS algorithms to those used by compression codes, leading to more efficient CS recovery methods.
This chapter provides a survey of the common techniques for determining the sharp statistical and computational limits in high-dimensional statistical problems with planted structures, using community detection and submatrix detection problems as illustrative examples. We discuss tools including the first- and second-moment methods for analyzing the maximum-likelihood estimator, information-theoretic methods for proving impossibility results using mutual information and rate-distortion theory, and methods originating from statistical physics such as the interpolation method. To investigate computational limits, we describe a common recipe to construct a randomized polynomial-time reduction scheme that approximately maps instances of the planted clique problem to the problem of interest in total variation distance.
We study compression for function computation of sources at nodes in a network at receiver(s). The rate region of this problem has been considered under restrictive assumptions. We present results that significantly relax these assumptions. For a one-stage tree network, we characterize a rate region by a necessary and sufficient condition for any achievable coloring-based coding scheme, the coloring connectivity condition. We propose a modularized coding scheme based on graph colorings to perform arbitrarily closely to derived rate lower bounds. For a general tree network, we provide a rate lower bound based on graph entropies and show that it is tight for independent sources. We show that, in a general tree network case with independent sources, to achieve the rate lower bound, intermediate nodes should perform computations, but for a family of functions and random variables, which we call chain-rule proper sets, it suffices to have no computations at intermediate nodes to perform arbitrarily closely to the rate lower bound. We consider practicalities of coloring-based coding schemes and propose an efficient algorithm to compute a minimum-entropy coloring of a characteristic graph.
Clustering is a general term for techniques that, given a set of objects, aim to select those that are closer to one another than to the rest, according to a chosen notion of closeness. It is an unsupervised-learning problem since objects are not externally labeled by category. Much effort has been expended on finding natural mathematical definitions of closeness and then developing/evaluating algorithms in these terms. Many have argued that there is no domain-independent mathematical notion of similarity but that it is context-dependent; categories are perhaps natural in that people can evaluate them when they see them. Some have dismissed the problem of unsupervised learning in favor of supervised learning, saying it is not a powerful natural phenomenon. Yet, most learning is unsupervised. We largely learn how to think through categories by observing the world in its unlabeled state. Drawing on universal information theory, we ask whether there are universal approaches to unsupervised clustering. In particular, we consider instances wherein the ground-truth clusters are defined by the unknown statistics governing the data to be clustered.
Information theory plays an indispensable role in the development of algorithm-independent impossibility results, both for communication problems and for seemingly distinct areas such as statistics and machine learning. While numerous information-theoretic tools have been proposed for this purpose, the oldest one remains arguably the most versatile and widespread: Fano’s inequality. In this chapter, we provide a survey of Fano’s inequality and its variants in the context of statistical estimation, adopting a versatile framework that covers a wide range of specific problems. We present a variety of key tools and techniques used for establishing impossibility results via this approach, and provide representative examples covering group testing, graphical model selection, sparse linear regression, density estimation, and convex optimization.
This chapter introduces basic ideas of information-theoretic models for distributed statistical inference problems with compressed data, and discusses current and future research directions and challenges in applying these models to various statistical learning problems. In these applications, data are distributed in multiple terminals, which can communicate with each other via limited-capacity channels. Instead of recovering data at a centralized location first and then performing inference, this chapter describes schemes that can perform statistical inference without recovering the underlying data. Information-theoretic tools are borrowed to characterize the fundamental limits of the classical statistical inference problems using compressed data directly. In this chapter, distributed statistical learning problems are first introduced. Then, models and results of distributed inference are discussed. Finally, new directions that generalize and improve the basic scenarios are described.
Machine-learning algorithms can be viewed as stochastic transformations that map training data to hypotheses. Following Bousquet and Elisseeff, we say such an algorithm is stable if its output does not depend too much on any individual training example. Since stability is closely connected to generalization capabilities of learning algorithms, it is of interest to obtain sharp quantitative estimates on the generalization bias of machine-learning algorithms in terms of their stability properties. We describe several information-theoretic measures of algorithmic stability and illustrate their use for upper-bounding the generalization bias of learning algorithms. Specifically, we relate the expected generalization error of a learning algorithm to several information-theoretic quantities that capture the statistical dependence between the training data and the hypothesis. These include mutual information and erasure mutual information, and their counterparts induced by the total variation distance. We illustrate the general theory through examples, including the Gibbs algorithm and differentially private algorithms, and discuss strategies for controlling the generalization error.
A grand challenge in representation learning is the development of computational algorithms that learn the explanatory factors of variation behind high-dimensional data. Representation models (encoders) are often determined for optimizing performance on training data when the real objective is to generalize well to other (unseen) data. This chapter provides an overview of fundamental concepts in statistical learning theory and the information-bottleneck principle. This serves as a mathematical basis for the technical results, in which an upper bound to the generalization gap corresponding to the cross-entropy risk is given. When this penalty term times a suitable multiplier and the cross-entropy empirical risk are minimized jointly, the problem is equivalent to optimizing the information-bottleneck objective with respect to the empirical data distribution. This result provides an interesting connection between mutual information and generalization, and helps to explain why noise injection during the training phase can improve the generalization ability of encoder models and enforce invariances in the resulting representations.
The ability to understand and solve high-dimensional inference problems is essential for modern data science. This chapter examines high-dimensional inference problems through the lens of information theory and focuses on the standard linear model as a canonical example that is both rich enough to be practically useful and simple enough to be studied rigorously. In particular, this model can exhibit phase transitions where an arbitrarily small change in the model parameters can induce large changes in the quality of estimates. For this model, the performance of optimal inference can be studied using the replica method from statistical physics but, until recently, it was not known whether the resulting formulas were actually correct. In this chapter, we present a tutorial description of the standard linear model and its connection to information theory. We also describe the replica prediction for this model and outline the authors’ recent proof that it is exact.
Processing, storing, and communicating information that originates as an analog phenomenon involve conversion of the information to bits. This conversion can be described by the combined effect of sampling and quantization. The digital representation in this procedure is achieved by first sampling the analog signal so as to represent it by a set of discrete-time samples and then quantizing these samples to a finite number of bits. Traditionally, these two operations are considered separately. The sampler is designed to minimize information loss due to sampling based on prior assumptions about the continuous-time input. The quantizer is designed to represent the samples as accurately as possible, subject to the constraint on the number of bits that can be used in the representation. The goal of this chapter is to revisit this paradigm by considering the joint effect of these two operations and to illuminate the dependence between them.