We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter is concerned with quasi-Monte Carlo rules, i.e., multivariate quadrature rules featuring equal weights and deterministically chosen evaluation points. The variation of a function and the star discrepancy of a set of points are defined as a prerequisite to the Koksma--Hlawka inequality, which bounds the error of a quasi-Monte Carlo rule by the product of the variation and the star discrepancy. Finally, some evaluation points with small star discrepancy are uncovered, namely the Halton sequence and the Hammersley set.
This appendix recalls some key notions of probability theory, such as tails and moment generating functions. These notions are essential in the proof of some concentration inequalities, e.g., the McDiarmid inequality. In turn, these inequalities are used to establish the restricted isometry properties for sparse vectors and for low-rank matrices required earlier.
The high dimensionality of datapoints often constitutes an obstacle to efficient computations. This chapter investigates three workarounds that replace the datapoints by some substitutes selected in a lower dimensional set. The first workaround is principal component analysis, where the lower dimensional set is a linear space spanned by the top singular vectors of the data matrix. The second workaround is a Johnson–Lindenstrauss projection, where the lower dimensional set is a random linear space. The third workaround is locally linear embedding, where the lower dimensional set is not chosen as a linear space anymore.
This chapter studies binary classification from a non-statistical viewpoint. For data that are linearly separable, the perceptron algorithm is presented first. It is followed by an optimization program, known as the hard support vector machine (SVM), consisting in maximizing the margin. For data that are not exactly linearly separable, this optimization program is relaxed into soft SVM. Finally, for data that are linearly separable only after applying a feature map, the representer theorem is used to validate the so-called kernel trick.
This chapter corroborates the empirical belief in the superiority of deep networks over shallow ones. It does so by highlighting three situations where a clear advantage can be demonstrated. First, using depth two, there are activation functions turning neural networks into universal approximators even when restricting the width. Second, depth overcomes the limitation that shallow ReLU networks cannot generate compactly supported functions. Third, the approximation rate of Lipschitz functions by deep ReLU networks is better than that of shallow ones.
This chapter considers the unsupervised learning task known as clustering, which consists in grouping unlabeled datapoints based on some similarity information. The single-linkage algorithm is examined first. Then, the Lloyd algorithm is presented to illustrate the center-based clustering strategy. Finally, the problem of detecting two communities via spectral clustering is analyzed under the stochastic block model.
This chapter starts by introducing the key concepts attached to neural networks, such as architecture, weights, biases, and activation function. It proceeds with the specific choice of the rectified linear unit (ReLU) as activation function. In this case, neural networks generate continuous piecewise linear (CPwL) functions. It is then shown that, in the univariate setting, any CPwL function can generated by a shallow ReLU network. This is no longer true in the multivariate setting, for which it is nonetheless shown that any CPwL function can generated by a deep ReLU network.
This chapter touches on some aspects related to the training of neural networks. First, a method called backpropagation is presented as a way to efficiently compute gradients in descent algorithms when deep networks are used. Next, the chapterconsiders shallow networks in the overparametrized regime, and it is proved that the empirical-risk landscape, despite its nonconvexity, features no strict local minimizers. Finally, convolutional neural networks are briefly mentioned.
In this chapter, a variation of the standard compressive sensing problem is studied. In this variation, sparse vectors are replaced by low-rank matrices. Recovery is now performed by nuclear-norm minimization, with success characterized by an analog of the null space property for the observation map. This property holds with high probability for random observation maps, again as a consequence of an analog of the restricted isometry property. Finally, a formulation of nuclear norm minimization as a semidefinite program is justified.
This appendix states and proves several important results about completeness, convexity, and extreme points. These results, including the supporting hyperplane theorem and the Hahn–Banach extension theorem, are invoked throughout the text.
This chapter introduces the key concepts of optimization, such as objective function, constraints, local and global minimizers, and gradient descent algorithms. The rate of convergence for the steepest descent algorithm is analyzed when the objective function is smooth and convex or smooth and strongly convex. The analysis is extended to the stochastic gradient descent algorithm.
This chapter returns to the recovery of sparse vectors, but this time the linear measurements are quantized to retain only their signs. With the help of the restricted isometry property from ?2 to ?1, it is shown that the direction of sparse vectors can still be approximately recovered via a hard thresholding procedure or via a linear program. Furthermore, it is shown that the magnitude, too, can be recovered if an appropriate modification of the signed observations is allowed.
This appendix establishes some crucial results about eigenvalues, singular values, and matrix norms. Of particular importance are the Mirsky inequality and the von Neumann trace inequality.
This chapter presents three examples of nonconvex optimization programs that can be solved (almost) exactly. The first example concerns quadratically constrained quadratic programs, whose treatment relies on the so-called S-lemma. The second example is dynamic programming, which is utilized to compute best approximants by sparse and disjointed vectors. The third example consists of projected gradient descent algorithms, including iterative hard thresholding algorithms.