We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This final chapter covers topics that build on the material discussed in the book, with the aim of pointing to avenues for further study and research. The selection of topics is clearly a matter of personal choice, but care has been taken to present both well-established topics, such as probabilistic graphical models, and emerging ones, such as causality and quantum machine learning. The topics are distinct, and each section can be read separately. The presentation is brief, and only meant as a launching pad for exploration.
As discussed so far in this book, the standard formulation of machine learning makes the following two basic assumptions: 1. Statistical equivalence of training and testing. The statistical properties of the data observed during training match those to be experienced during testing – i.e., the population distribution underlying the generation of the data is the same during both training and testing. 2. Separation of learning tasks. Training is carried out separately for each separate learning task – i.e., for any new data set and/or loss function, training is viewed as a new problem to be addressed from scratch.
In this chapter, we use the optimization tools presented in Chapter 5 to develop supervised learning algorithms that move beyond the simple settings studied in Chapter 4 for which the training problem could be solved exactly, typically by addressing an LS problem. We will focus specifically on binary and multi-class classification, with a brief discussion at the end of the chapter about the (direct) extension to regression problems. Following Chapter 4, the presentation will mostly concentrate on parametric model classes, but we will also touch upon mixture models and non-parametric methods.
This chapter focuses on three key problems that underlie the formulation of many machine learning methods for inference and learning, namely variational inference (VI), amortized VI, and variational expectation maximization (VEM). We have already encountered these problems in simplified forms in previous chapters, and they will be essential in developing the more advanced techniques to be covered in the rest of the book. Notably, VI and amortized VI underpin optimal Bayesian inference, which was used, e.g., in Chapter 6 to design optimal predictors for generative models; and VEM generalizes the EM algorithm that was introduced in Chapter 7 for training directed generative latent-variable models.
The previous chapters have adopted a limited range of probabilistic models, namely Bernoulli and categorical distributions for discrete rvs and Gaussian distributions for continuous rvs. While these are common modeling choices, they clearly do not represent many important situations of interest for machine learning applications. For instance, discrete data may a priori take arbitrarily large values, making categorical models unsuitable. Continuous data may need to satisfy certain constraints, such as non-negativity, rendering Gaussian models far from ideal.
So far, this book has focused on conventional centralized learning settings in which data are collected at a central server, which carries out training. When data originate at distributed agents, such as personal devices, organizations, or factories run by different companies, this approach has two clear drawbacks: • First, it requires transferring data from the agents to the server, which may incur a prohibitive communication load. • Second, in the process of transferring, storing, and processing the agents’ data, sensitive information may be exposed or exploited.
The Science of Deep Learning emerged from courses taught by the author that have provided thousands of students with training and experience for their academic studies, and prepared them for careers in deep learning, machine learning, and artificial intelligence in top companies in industry and academia. The book begins by covering the foundations of deep learning, followed by key deep learning architectures. Subsequent parts on generative models and reinforcement learning may be used as part of a deep learning course or as part of a course on each topic. The book includes state-of-the-art topics such as Transformers, graph neural networks, variational autoencoders, and deep reinforcement learning, with a broad range of applications. The appendices provide equations for computing gradients in backpropagation and optimization, and best practices in scientific writing and reviewing. The text presents an up-to-date guide to the field built upon clear visualizations using a unified notation and equations, lowering the barrier to entry for the reader. The accompanying website provides complementary code and hundreds of exercises with solutions.
Beyond quantifying the amount of association between two variables, as was the goal in a previous chapter, regression analysis aims at describing that association and/or at predicting one of the variables based on the other ones. Examples of applications where this is needed abound in engineering and a broad range of industries. For example, in the insurance industry, when pricing a policy, the predictor variable encapsulates the available information about what is being insured, and the response variable is a measure of risk that the insurance company would take if underwriting the policy. In this context, a procedure is solely evaluated based on its performance at predicting that risk, and can otherwise be very complicated and have no simple interpretation. The chapter covers both local methods such as kernel regression (e.g., local averaging) and empirical risk minimization over a parametric model (e.g., linear models fitted by least squares). Cross-validation is introduced as a method for estimating the prediction power of a certain regression or classification metod.
Measurements are often numerical in nature, which naturally leads to distributions on the real line. We start our discussion of such distributions in the present chapter, and in the process introduce the concept of random variable, which is really a device to facilitate the writing of probability statements and the derivation of the corresponding computations. We introduce objects such as the distribution function, survival function, and quantile function, any of which characterizes in the underlying distribution.
Some experiments lead to considering not one, but several measurements. As before, each measurement is represented by a random variable, and these are stacked into a random vector. For example, in the context of an experiment that consists in flipping a coin multiple times, we defined in a previous chapter as many random variables, each indicating the result of one coin flip. These are then concatenated to form a random vector, compactly describing the outcome of the entire experiment. Concepts such as conditional probability and independence are introduced.
We consider an experiment that yields, as data, a sample of independent and identically distributed (real-valued) random variables with a common distribution on the real line. The estimation of the underlying mean and median is discussed at length, and bootstrap confidence intervals are constructed. Tests comparing the underlying distribution to a given distribution (e.g., the standard normal distribution) or a family of distribution (e.g., the normal family of distributions) are introduced. Censoring, which is very common in some clinical trials, is briefly discuss.
In this chapter we introduce some tools for sampling from a distribution. We also explain how to use computer simulations to approximate probabilities and, more generally, expectations, which can allow one to circumvent complicated mathematical derivations. The methods that are introduced include Monte Carlo sampling/integration, rejection sampling, and Markov Chain Monte Carlo sampling.
An expectation is simply a weighted mean, and means are at the core of Probability Theory and Statistics. In Statistics, in particular, such expectations are used to define parameters of interest. It turns out that an expectation can be approximated by an empirical average based on a sample from the distribution of interest, and the accuracy of this approximation can be quantified via what is referred to as concentration inequalities.