We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. Introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite block-length approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC Bayes and variational principle, Kolmogorov's metric entropy, strong data processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by a solutions manual for instructors, and additional standalone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
Word embeddings are now a vital resource for social science research. However, obtaining high-quality training data for non-English languages can be difficult, and fitting embeddings therein may be computationally expensive. In addition, social scientists typically want to make statistical comparisons and do hypothesis tests on embeddings, yet this is nontrivial with current approaches. We provide three new data resources designed to ameliorate the union of these issues: (1) a new version of fastText model embeddings, (2) a multilanguage “a la carte” (ALC) embedding version of the fastText model, and (3) a multilanguage ALC embedding version of the well-known GloVe model. All three are fit to Wikipedia corpora. These materials are aimed at “low-resource” settings where the analysts lack access to large corpora in their language of interest or to the computational resources required to produce high-quality vector representations. We make these resources available for 40 languages, along with a code pipeline for another 117 languages available from Wikipedia corpora. We extensively validate the materials via reconstruction tests and other proofs-of-concept. We also conduct human crowdworker tests for our embeddings for Arabic, French, (traditional Mandarin) Chinese, Japanese, Korean, Russian, and Spanish. Finally, we offer some advice to practitioners using our resources.
Peatlands, covering approximately one-third of global wetlands, provide various ecological functions but are highly vulnerable to climate change, with their changes in space and time requiring monitoring. The sub-Antarctic Prince Edward Islands (PEIs) are a key conservation area for South Africa, as well as for the preservation of terrestrial ecosystems in the region. Peatlands (mires) found here are threatened by climate change, yet their distribution factors are poorly understood. This study attempted to predict mire distribution on the PEIs using species distribution models (SDMs) employing multiple regression-based and machine-learning models. The random forest model performed best. Key influencing factors were the Normalized Difference Water Index and slope, with low annual mean temperature, with low annual mean temperature, precipitation seasonality and distance from the coast being less influential. Despite moderate predictive ability, the model could only identify general areas of mires, not specific ones. Therefore, this study showed limited support for the use of SDMs in predicting mire distributions on the sub-Antarctic PEIs. It is recommended to refine the criteria used to select environmental factors and enhance the geospatial resolution of the data to improve the predictive accuracy of the models.
Developing large-eddy simulation (LES) wall models for separated flows is challenging. We propose to leverage the significance of separated flow data, for which existing theories are not applicable, and the existing knowledge of wall-bounded flows (such as the law of the wall) along with embedded learning to address this issue. The proposed so-called features-embedded-learning (FEL) wall model comprises two submodels: one for predicting the wall shear stress and another for calculating the eddy viscosity at the first off-wall grid nodes. We train the former using the wall-resolved LES (WRLES) data of the periodic hill flow and the law of the wall. For the latter, we propose a modified mixing length model, with the model coefficient trained using the ensemble Kalman method. The proposed FEL model is assessed using the separated flows with different flow configurations, grid resolutions and Reynolds numbers. Overall good a posteriori performance is observed for predicting the statistics of the recirculation bubble, wall stresses and turbulence characteristics. The statistics of the modelled subgrid-scale (SGS) stresses at the first off-wall grids are compared with those calculated using the WRLES data. The comparison shows that the amplitude and distribution of the SGS stresses and energy transfer obtained using the proposed model agree better with the reference data when compared with the conventional SGS model.
A biofilm refers to a intricate community of microorganisms firmly attached to surfaces and enveloped within a self-generated extracellular matrix. Machine learning (ML) methodologies have been harnessed across diverse facets of biofilm research, encompassing predictions of bio-film formation, identification of pivotal genes, and the formulation of novel therapeutic approaches. This investigation undertook a bibliographic analysis focused on ML applications in biofilm research, aiming to present a comprehensive overview of the field’s cur-rent status. Our exploration involved searching the Web of Science database for articles incorporating the term “machine learning biofilm,” leading to the identification and analysis of 126 pertinent articles. Our findings indicate a substantial upswing in the publication count concerning ML in biofilm over the last decade, underscoring an escalating interest in deploying ML techniques for biofilm investigations. The analysis further disclosed prevalent research themes, predominantly revolving around biofilm formation, prediction, and control. Notably, artificial neural networks and support vector machines emerged as the most frequently employed ML techniques in biofilm research. Overall, our study furnishes valuable insights into prevailing trends and future trajectories within the realm of ML applied to biofilm research. It underscores the significance of collaborative efforts between biofilm researchers and ML experts, advocating for interdisciplinary synergy to propel innovation in this domain.
This study reveals the morphological evolution of a splashing drop by a newly proposed feature extraction method, and a subsequent interpretation of the classification of splashing and non-splashing drops performed by an explainable artificial intelligence (XAI) video classifier. Notably, the values of the weight matrix elements of the XAI that correspond to the extracted features are found to change with the temporal evolution of the drop morphology. We compute the rate of change of the contributions of each frame with respect to the classification value of a video as an importance index to quantify the contributions of the extracted features at different impact times to the classification. Remarkably, the rate computed for the extracted splashing features of ethanol and 1 cSt silicone oil is found to have a peak value at the early impact times, while the extracted features of 5 cSt silicone oil are more obvious at a later time when the lamella is more developed. This study provides an example that clarifies the complex morphological evolution of a splashing drop by interpreting the XAI.
Anticipating future migration trends is instrumental to the development of effective policies to manage the challenges and opportunities that arise from population movements. However, anticipation is challenging. Migration is a complex system, with multifaceted drivers, such as demographic structure, economic disparities, political instability, and climate change. Measurements encompass inherent uncertainties, and the majority of migration theories are either under-specified or hardly actionable. Moreover, approaches for forecasting generally target specific migration flows, and this poses challenges for generalisation.
In this paper, we present the results of a case study to predict Irregular Border Crossings (IBCs) through the Central Mediterranean Route and Asylum requests in Italy. We applied a set of Machine Learning techniques in combination with a suite of traditional data to forecast migration flows. We then applied an ensemble modelling approach for aggregating the results of the different Machine Learning models to improve the modelling prediction capacity.
Our results show the potential of this modelling architecture in producing forecasts of IBCs and Asylum requests over 6 months. The explained variance of our models through a validation set is as high as 80%. This study offers a robust basis for the construction of timely forecasts. In the discussion, we offer a comment on how this approach could benefit migration management in the European Union at various levels of policy making.
Public procurement is a fundamental aspect of public administration. Its vast size makes its oversight and control very challenging, especially in countries where resources for these activities are limited. To support decisions and operations at public procurement oversight agencies, we developed and delivered VigIA, a data-based tool with two main components: (i) machine learning models to detect inefficiencies measured as cost overruns and delivery delays, and (ii) risk indices to detect irregularities in the procurement process. These two components cover complementary aspects of the procurement process, considering both active and passive waste, and help the oversight agencies to prioritize investigations and allocate resources. We show how the models developed shed light on specific features of the contracts to be considered and how their values signal red flags. We also highlight how these values change when the analysis focuses on specific contract types or on information available for early detection. Moreover, the models and indices developed only make use of open data and target variables generated by the procurement processes themselves, making them ideal to support continuous decisions at overseeing agencies.
Focusing on methods for data that are ordered in time, this textbook provides a comprehensive guide to analyzing time series data using modern techniques from data science. It is specifically tailored to economics and finance applications, aiming to provide students with rigorous training. Chapters cover Bayesian approaches, nonparametric smoothing methods, machine learning, and continuous time econometrics. Theoretical and empirical exercises, concise summaries, bolded key terms, and illustrative examples are included throughout to reinforce key concepts and bolster understanding. Ancillary materials include an instructor's manual with solutions and additional exercises, PowerPoint lecture slides, and datasets. With its clear and accessible style, this textbook is an essential tool for advanced undergraduate and graduate students in economics, finance, and statistics.
Active flow control based on reinforcement learning has received much attention in recent years. Indeed, the requirement for substantial data for trial-and-error in reinforcement learning policies has posed a significant impediment to their practical application, which also serves as a limiting factor in the training of cross-case agents. This study proposes an in-context active flow control policy learning framework grounded in reinforcement learning data. A transformer-based policy improvement operator is set up to model the process of reinforcement learning as a causal sequence and autoregressively give actions with sufficiently long context on new unseen cases. In flow separation problems, this framework demonstrates the capability to successfully learn and apply efficient flow control strategies across various airfoil configurations. Compared with general reinforcement learning, this learning mode without the need for updating the network parameter has even higher efficiency. This study presents an effective novel technique in using a single transformer model to address the flow separation active flow control problem on different airfoils. Additionally, the study provides an innovative demonstration of incorporating reinforcement-learning-based flow control with aerodynamic shape optimization, leading to collective enhancement in performance. This method efficiently lessens the training burden of the new flow control policy during shape optimization, and opens up a promising avenue for interdisciplinary intelligent co-design of future vehicles.
When they occur, azimuthal thermoacoustic oscillations can detrimentally affect the safe operation of gas turbines and aeroengines. We develop a real-time digital twin of azimuthal thermoacoustics of a hydrogen-based annular combustor. The digital twin seamlessly combines two sources of information about the system: (i) a physics-based low-order model; and (ii) raw and sparse experimental data from microphones, which contain both aleatoric noise and turbulent fluctuations. First, we derive a low-order thermoacoustic model for azimuthal instabilities, which is deterministic. Second, we propose a real-time data assimilation framework to infer the acoustic pressure, the physical parameters, and the model bias and measurement shift simultaneously. This is the bias-regularized ensemble Kalman filter, for which we find an analytical solution that solves the optimization problem. Third, we propose a reservoir computer, which infers both the model bias and measurement shift to close the assimilation equations. Fourth, we propose a real-time digital twin of the azimuthal thermoacoustic dynamics of a laboratory hydrogen-based annular combustor for a variety of equivalence ratios. We find that the real-time digital twin (i) autonomously predicts azimuthal dynamics, in contrast to bias-unregularized methods; (ii) uncovers the physical acoustic pressure from the raw data, i.e. it acts as a physics-based filter; (iii) is a time-varying parameter system, which generalizes existing models that have constant parameters, and capture only slow-varying variables. The digital twin generalizes to all equivalence ratios, which bridges the gap of existing models. This work opens new opportunities for real-time digital twinning of multi-physics problems.
The identification of predictors of treatment response is crucial for improving treatment outcome for children with anxiety disorders. Machine learning methods provide opportunities to identify combinations of factors that contribute to risk prediction models.
Methods
A machine learning approach was applied to predict anxiety disorder remission in a large sample of 2114 anxious youth (5–18 years). Potential predictors included demographic, clinical, parental, and treatment variables with data obtained pre-treatment, post-treatment, and at least one follow-up.
Results
All machine learning models performed similarly for remission outcomes, with AUC between 0.67 and 0.69. There was significant alignment between the factors that contributed to the models predicting two target outcomes: remission of all anxiety disorders and the primary anxiety disorder. Children who were older, had multiple anxiety disorders, comorbid depression, comorbid externalising disorders, received group treatment and therapy delivered by a more experienced therapist, and who had a parent with higher anxiety and depression symptoms, were more likely than other children to still meet criteria for anxiety disorders at the completion of therapy. In both models, the absence of a social anxiety disorder and being treated by a therapist with less experience contributed to the model predicting a higher likelihood of remission.
Conclusions
These findings underscore the utility of prediction models that may indicate which children are more likely to remit or are more at risk of non-remission following CBT for childhood anxiety.
This experimental study employs Bayesian optimisation to maximise the cross-flow (transverse) flow-induced vibration (FIV) of an elastically mounted thin elliptical cylinder by implementing axial (or angular) flapping motions. The flapping amplitude was in proportion to the vibration amplitude, with a relative phase angle imposed between the angular and transverse displacements of the cylinder. The control parameter space spanned over the ranges of proportional gain and phase difference of $0 \leq K_p^* \leq 5$ and $0 \leq \phi _d \leq 360^\circ$, respectively, over a reduced velocity range of $3.0 \leqslant {U^*} = U/({{f_{nw}}} b) \leqslant 8.5$. The corresponding Reynolds number range was $1250 \leqslant {{Re}} =(U b)/\nu \leqslant 3580$. Here, $U$ is the free stream velocity, $b$ is the major cross-sectional diameter of the cylinder, ${{f_{nw}}}$ is the natural frequency of the system in quiescent fluid (water) and $\nu$ is the kinematic viscosity of the fluid. It was found that the controlled body rotation extended the wake-body synchronisation across the entire ${U^*}$ range tested, with a larger amplitude response than the non-rotating case for all flow speeds. Interestingly, two new wake-body synchronisation regimes were identified, which have not been reported in previous studies. As this geometry acts as a ‘hard-oscillator’ for ${U^*} \geqslant 6.3$, an adaptive gain (i.e. one that varies as a function of oscillation amplitude) was also implemented, allowing the body vibration, achieved for a non-rotating cylinder using increasing ${U^*}$ increments, to be excited from rest. The findings of the present study hold potential implications for the use of FIV as a means to efficiently extract energy from free-flowing water sources, a topic of increasing interest over the last decade.
The performance and confidence in fault detection and diagnostic systems can be undermined by data pipelines that feature multiple compounding sources of uncertainty. These issues further inhibit the deployment of data-based analytics in industry, where variable data quality and lack of confidence in model outputs are already barriers to their adoption. The methodology proposed in this paper supports trustworthy data pipeline design and leverages knowledge gained from one fully-observed data pipeline to a similar, under-observed case. The transfer of uncertainties provides insight into uncertainty drivers without repeating the computational or cost overhead of fully redesigning the pipeline. A SHAP-based human-readable explainable AI (XAI) framework was used to rank and explain the impact of each choice in a data pipeline, allowing the decoupling of positive and negative performance drivers to facilitate the successful selection of highly-performing pipelines. This empirical approach is demonstrated in bearing fault classification case studies using well-understood open-source data.
Modern machine-learning techniques are generally considered data-hungry. However, this may not be the case for turbulence as each of its snapshots can hold more information than a single data file in general machine-learning settings. This study asks the question of whether nonlinear machine-learning techniques can effectively extract physical insights even from as little as a single snapshot of turbulent flow. As an example, we consider machine-learning-based super-resolution analysis that reconstructs a high-resolution field from low-resolution data for two examples of two-dimensional isotropic turbulence and three-dimensional turbulent channel flow. First, we reveal that a carefully designed machine-learning model trained with flow tiles sampled from only a single snapshot can reconstruct vortical structures across a range of Reynolds numbers for two-dimensional decaying turbulence. Successful flow reconstruction indicates that nonlinear machine-learning techniques can leverage scale-invariance properties to learn turbulent flows. We also show that training data of turbulent flows can be cleverly collected from a single snapshot by considering characteristics of rotation and shear tensors. Second, we perform the single-snapshot super-resolution analysis for turbulent channel flow, showing that it is possible to extract physical insights from a single flow snapshot even with inhomogeneity. The present findings suggest that embedding prior knowledge in designing a model and collecting data is important for a range of data-driven analyses for turbulent flows. More broadly, this work hopes to stop machine-learning practitioners from being wasteful with turbulent flow data.
Psychotic disorders are characterized by abnormalities in the synchronization of neuronal responses. A 40 Hz gamma band deficit during auditory steady-state response (ASSR) measured by electroencephalogram (EEG) is a robust observation in psychosis and is associated with symptoms and functional deficits. However, the majority of ASSR studies focus on specific electrode sites, while whole scalp analysis using all channels, and the association with clinical symptoms, are rare.
Methods:
In this study, we use whole-scalp 40 Hz ASSR EEG measurements—power and phase locking factor—to establish deficits in early-stage psychosis (ESP) subjects, classify ESP status using an ensemble of machine learning techniques, identify correlates with principal components obtained from clinical/demographic/functioning variables, and correlate functional outcome after a short-term follow-up.
Results:
We identified significant spatially-distributed group level differences for power and phase locking. The performance of different machine learning techniques and interpretation of the extracted feature importance indicate that phase locking has a more predictive and parsimonious pattern than power. Phase locking is also associated with principal components composed of measures of cognitive processes. Short-term functional outcome is associated with baseline 40 Hz ASSR signals from the FCz and other channels in both phase locking and power.
Conclusion:
This whole scalp EEG study provides additional evidence to link deficits in 40 Hz ASSRs with cognition and functioning in ESP, and corroborates with prior studies of phase locking from a subset of EEG channels. Confirming 40 Hz ASSR deficits serves as a candidate phenotype to identify circuit dysfunctions and a biomarker for clinical outcomes in psychosis.
Turbulent flows are chaotic and multi-scale dynamical systems, which have large numbers of degrees of freedom. Turbulent flows, however, can be modeled with a smaller number of degrees of freedom when using an appropriate coordinate system, which is the goal of dimensionality reduction via nonlinear autoencoders. Autoencoders are expressive tools, but they are difficult to interpret. This article aims to propose a method to aid the interpretability of autoencoders. First, we introduce the decoder decomposition, a post-processing method to connect the latent variables to the coherent structures of flows. Second, we apply the decoder decomposition to analyze the latent space of synthetic data of a two-dimensional unsteady wake past a cylinder. We find that the dimension of latent space has a significant impact on the interpretability of autoencoders. We identify the physical and spurious latent variables. Third, we apply the decoder decomposition to the latent space of wind-tunnel experimental data of a three-dimensional turbulent wake past a bluff body. We show that the reconstruction error is a function of both the latent space dimension and the decoder size, which are correlated. Finally, we apply the decoder decomposition to rank and select latent variables based on the coherent structures that they represent. This is useful to filter unwanted or spurious latent variables or to pinpoint specific coherent structures of interest. The ability to rank and select latent variables will help users design and interpret nonlinear autoencoders.
Machine learning is increasingly being utilised across various domains of nutrition research due to its ability to analyse complex data, especially as large datasets become more readily available. However, at times, this enthusiasm has led to the adoption of machine learning techniques prior to a proper understanding of how they should be applied, leading to non-robust study designs and results of questionable validity. To ensure that research standards do not suffer, key machine learning concepts must be understood by the research community. The aim of this review is to facilitate a better understanding of machine learning in research by outlining good practices and common pitfalls in each of the steps in the machine learning process. Key themes include the importance of generating high-quality data, employing robust validation techniques, quantifying the stability of results, accurately interpreting machine learning outputs, adequately describing methodologies, and ensuring transparency when reporting findings. Achieving this aim will facilitate the implementation of robust machine learning methodologies, which will reduce false findings and make research more reliable, as well as enable researchers to critically evaluate and better interpret the findings of others using machine learning in their work.
We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. Our method has three advantages over existing unsupervised methods (such as YAKE). First, it is significantly more effective at extracting keywords from long texts in terms of precision and recall. Second, it allows inference of two types of keywords: local and global. Third, it extracts basic topics from texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works. The agreement between annotators is moderate to substantial. Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.