We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we address so-called “error generic” data poisoning (DP) attacks (hereafter called DP attacks) on classifiers. Unlike backdoor attacks, DP attacks aim to degrade overall classification accuracy. (Previous chapters were concerned with “error specific” DP attacks involving specific backdoor patterns and source and target classes for classification applications.) To effectively mislead classifier training using relatively few poisoned samples, an attacker introduces “feature collision” to the training samples by, for example, flipping the class labels of clean samples. Another possibility is to poison with synthetic data, not typical of any class. The information extracted from the clean and poisoned samples labeled to the same class (as well as from clean samples that originate from the same class as the (mislabeled) poisoned samples) is largely inconsistent, which prevents the learning of an accurate class decision boundary. We develop a BIC based framework for both detection and cleansing of such data poisoning. This method is compared with existing DP defenses for both image data domains and document classification domains.
In this chapter, we introduce the design of statistical anomaly detectors. We discuss types of data – continuous, discrete categorical, and discrete ordinal features – encountered in practice. We then discuss how to model such data, in particular to form a null model for statistical anomaly detection, with emphasis on mixture densities. The EM algorithm is developed for estimating the parameters of a mixture density. K-means is a specialization of EM for Gaussian mixtures. The Bayesian information criterion (BIC) is discussed and developed – widely used for estimating the number of components in a mixture density. We also discuss parsimonious mixtures, which economize on the number of model parameters in a mixture density (by sharing parameters across components). These models allow BIC to obtain accurate model order estimates even when the feature dimensionality is huge and the number of data samples is small (a case where BIC applied to traditional mixtures grossly underestimates the model order). Key performance measures are discussed, including true positive rate, false positive rate, and receiver operating characteristic (ROC) and associated area-under-the-curve (ROC AUC). The density models are used in attack detection defenses in Chapters 4 and 13. The detection performance measures are used throughout the book.
In this chapter we consider attacks that do not alter the machine learning model, but “fool” the classifier (plus supplementary defense, including human monitoring) into making erroneous decisions. These are known as test-time evasion attacks (TTEs). In addition to representing a threat, TTEs reveal the non-robustness of existing deep learning systems. One can alter the class decision made by the DNN by making small changes to the input, changes which would not alter the (robust) decision-making of a human being, for example performing visual pattern recognition. Thus, TTEs are a foil to claims that deep learning, currently, is achieving truly robust pattern recognition, let alone that it is close to achieving true artificial intelligence. Thus, TTEs are a spur to the machine learning community to devise more robust pattern recognition systems. We survey various TTE attacks, including FGSM, JSMA, and CW. We then survey several types of defenses, including anomaly detection as well as robust classifier training strategies. Experiments are included for anomaly detection defenses based on classical statistical anomaly detection, as well as a class-conditional generative adversarial network, which effectively learns to discriminate “normal” from adversarial samples, and without any supervision (no supervising attack examples).
In this chapter, we introduce attacks/threats against machine learning. A primary aim of an attack is to cause the neural network to make errors. An attack may target the training dataset (its integrity or privacy), the training process (deep learning), or the parameters of the DNN once trained. Alternatively, an attack may target vulnerabilities by discovering test samples that produce erroneous output. The attacks include: (i) TTEs, which make subtle changes to a test pattern, causing the classifier’s decision to change; (ii) data poisoning attacks, which corrupt the training set to degrade accuracy of the trained model; (iii) backdoor attacks, a special case of data poisoning where a subtle (backdoor) pattern is embedded into some training samples, with their supervising label altered, so the classifier learns to misclassify to a target class when the backdoor pattern is present; (iv) reverse-engineering attacks, which query a classifier to learn its decision-making rule; and (v) membership inference attacks, which seek information about the training set from queries to the classifier. Defenses aim to detect attacks and/or to proactively improve robustness of machine learning. An overview is given of the three main types of attacks (TTEs, data poisoning, and backdoors) investigated in subsequent chapters.
In this chapter, we focus on before/during training backdoor defense, where the defender is also the training authority, with control of the training process and responsibility for providing an accurate, backdoor-free DNN classifier. Deployment of a backdoor defense during training is supported by the fact that the training authority is usually more resourceful in both computation and storage than a downstream user of the trained classifier. Moreover, before/during training detection could be easier than post-training detection because the defender has access to the (possibly poisoned) training set and, thus, to samples that contain the backdoor pattern. However, before/during training detection is still highly challenging because it is unknown whether there is poisoning and, if so, which subset of samples (among many possible subsets) is poisoned. A detailed review of backdoor attacks (Trojans) is given, and optimization-based reverse-engineering defense for training set cleansing deployed before/during classifier training is described. The defense is designed to detect backdoor attacks on samples with a human-imperceptible backdoor pattern, as widely considered in existing attacks and defenses. Detection of training set poisoning is achieved by reverse engineering (estimating) the pattern of a putative backdoor attack, considering each class as the possible target class of an attack.
Previous chapters exclusively considered attacks against classifiers. In this chapter, we devise a backdoor attack and defense for deep regression or prediction models. Such models may be used to, for example, predict housing prices in an area given measured features, to estimate a city’s power consumption on a given day, or to price financial derivatives (where they replace complex equation solvers and vastly improve the speed of inference). The developed attack is made most effective by surrounding poisoned samples (with their mis-supervised target values) by clean samples, in order to localize the attack and thus make it evasive to detection. The developed defense involves the use of a kind of query-by-synthesis active learning which trades off depth (local error maximizers) and breadth of search. Both the developed attack and defense are evaluated for an application domain that involves the pricing of a simple (single barrier) financial option.
In this chapter we describe unsupervised post-training defenses that do not make explicit assumptions regarding the backdoor pattern or how it was incorporated into clean samples. These backdoor defenses aim to be “universal.” They do not produce an estimate of the backdoor pattern (which may be valuable information as the basis for detecting backdoor triggers at test time, the subject of Chapter 10). We start by describing a universal backdoor detector that does not require any clean labeled data. This approach optimizes over the input image to the DNN, seeking the input that yields the maximum margin (for each putative target class of an attack). The premise here, under a winner-take-all decision rule, is that backdoors produce much larger classifier margins than those of un-attacked examples. Then a universal backdoor mitigation strategy is described that does leverage a small clean dataset. This optimizes a threshold (tamping down unusually large ReLU activations) for each neuron in the network. In each backdoor attack scenario described, different detection and mitigation strategies are compared, where some mitigation strategies are also known as “unlearning” defenses. Some universal backdoor defenses modify or augment the DNN itself, while others do not.
In this chapter we focus on post-training defense against backdoor data poisoning (Trojans). The defender has access to the trained DNN but not to the training set. The following are examples. (i) Proprietary: a customized DNN model purchased by government or a company without data rights and without training set access. (ii) Legacy: the data is long forgotten or not maintained. (iii) Cell phone apps: the user has no access to the training set for the app classifier. It is also assumed that a clean labeled dataset (no backdoor poisoning) is available with a small number of examples from each of the classes from the domain. This clean labeled dataset is insufficient for retraining and its small size makes its availability a reasonable assumption. Reverse-engineering defenses (REDs) are described including one that estimates putative backdoor patterns for each candidate (source class, target class) backdoor pair and then assesses an order statistic p-value on the sizes of these perturbations. This is successful at detecting subtle backdoor patterns, including sparse patterns involving few pixels, and global patterns where many pixels are modified subtly. A computationally efficient variant is presented. The method addresses additive backdoor embeddings and other embedding functions.
In this chapter we focus on post-training detection of backdoor attacks which replace a patch of pixels by a common backdoor pattern. We focus on scene-plausible perceptible backdoor patterns. Scene-plausibility is important for a perceptible attack to be evasive to human and machine-based detection, whether the attack is physical or digital. Though the focus is on image classification, the methodology could be applied to audio, where for “scene-plausibility” the backdoor pattern does not sound artificial or incongruous, amongst other sounds in the audio clip. For the Neural Cleanse method, the common backdoor pattern may be scene plausible or incongruous. In the latter case, backdoor trigger images (at test time) might be noticed by a human, thus thwarting the attack. The focus here is on defending against patch attacks that are scene plausible, meaning that the backdoor pattern cannot in general be embedded into the same location in every (poisoned) image. For example, a rainbow (one of the attack patterns) must be embedded in the sky (and this location may vary). The main method described builds on RED. It exploits the need for scene-plausibility, and attack “durability,” that the backdoor trigger will be effective in the presence of noise and occlusion.
In this chapter we provide an introduction to deep learning. This includes introducing pattern recognition concepts, neural network architectures, basic optimization techniques (as used by gradient-based deep learning algorithms), and various deep learning paradigms, for example for coping with limited available labeled training data and for improving embedded deep feature representation power. Some of the former include semi-supervised learning, transfer learning, and contrastive learning. Some of the latter include mainstays of deep learning such as convolutional layers, pooling layers, ReLU activations, dropout layers, attention mechanisms, and transformers. Gated recurrent neural networks (such as LSTMs) are not discussed in depth because they are not used in subsequent chapters. Some topics introduced in this chapter, such as neural network inversion and robust classifier training strategies, will be revisited frequently in subsequent chapters, as they form the basis both for attacks against deep learning and for defenses against such attacks.
In this chapter, we focus on post-training backdoor defense for classification problems involving only a few classes, particularly just two classes (K = 2), and involving arbitrary numbers of backdoor attacks, including different backdoor patterns with the same source and/or target classes. In Chapter 6, null models were estimated using (K – 1)2 statistics. For K = 2, only one such statistic is available, which is insufficient for estimating a null density model. Thus, the detection inference approach cannot be directly applied in the two-class case. Other detection statistics, such as the median absolute deviation (MAD) statistic used by Neural Cleanse, are also unsuitable for the two-class case. The developed method relies on high transferability of putative backdoor patterns that are estimated sample-wise, that is, a perturbation specifically designed to cause one sample to be misclassified also induces other (neighboring) samples to be misclassified. Intriguingly, the proposed method works effectively with a common (theoretically derived) detection threshold, irrespective of the classification domain and the particular attack. This is significant, as it may be difficult to set the detection threshold for any method in practice. The proposed method can be applied for various attack embedding functions (additive, patch, multiplicative, etc.).
Previous chapters considered detection of backdoors before/during training and post-training. Here, our objective is to detect use of a backdoor trigger operationally, that is, at test time. Such detection may prevent potentially catastrophic decisions, as well as potentially catching culprits in the act of exploiting a learned backdoor mapping. We also refer to such detection as “in-flight.” A likelihood based backdoor trigger detector is developed and compared against other detectors.
Backdoor attacks have been considered in non-image data domains, including speech and audio, text, as well as for regression applications (Chapter 12). In this chapter, we consider classification of point cloud data, for example, LiDAR data used by autonomous vehicles. Point cloud data differs significantly from images, with the former representing a given scene/object by a collection of points in 3D (or a higher-dimensional) space. Accordingly, point cloud DNN classifiers (such as PointNet) deviate significantly from the DNN architectures commonly used for image classification. So, backdoor (as well as test-time evasion) attacks also need to be customized to the nature of the (point cloud) data. Such attacks typically involve either adding points, deleting points, or modifying (transforming) the points representing a given scene/object. While test-time evasion attacks against point cloud classifiers were previously proposed, in this chapter we develop backdoor attacks against point cloud classifiers (based on insertion of points designed to defeat the classifier, as well as to defeat anomaly detectors that identify point outliers and remove them). We also devise a post-training detector designed to defeat this attack, as well as other point cloud backdoor attacks.