We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Peatlands, covering approximately one-third of global wetlands, provide various ecological functions but are highly vulnerable to climate change, with their changes in space and time requiring monitoring. The sub-Antarctic Prince Edward Islands (PEIs) are a key conservation area for South Africa, as well as for the preservation of terrestrial ecosystems in the region. Peatlands (mires) found here are threatened by climate change, yet their distribution factors are poorly understood. This study attempted to predict mire distribution on the PEIs using species distribution models (SDMs) employing multiple regression-based and machine-learning models. The random forest model performed best. Key influencing factors were the Normalized Difference Water Index and slope, with low annual mean temperature, with low annual mean temperature, precipitation seasonality and distance from the coast being less influential. Despite moderate predictive ability, the model could only identify general areas of mires, not specific ones. Therefore, this study showed limited support for the use of SDMs in predicting mire distributions on the sub-Antarctic PEIs. It is recommended to refine the criteria used to select environmental factors and enhance the geospatial resolution of the data to improve the predictive accuracy of the models.
Weeds are one of the greatest challenges to snap bean production. Anecdotal observation posits certain species frequently escape the weed management system by the time of crop harvest, hereafter called residual weeds. The objectives of this work were to 1) quantify the residual weed community in snap bean (Phaseolus vulgaris L.) grown for processing across the major growing regions in the U.S., and 2) investigate linkages between the density of residual weeds and their contributions to weed canopy cover. In surveys of 358 fields across the Northwest (NW), Midwest (MW), and Northeast (NE), residual weeds were observed in 95% of the fields. While a total of 109 species or species-group were identified, one to three species dominated the residual weed community of individual fields in most cases. It was not uncommon to have >10 weeds m-2 with a weed canopy covering >5% of the field’s surface area. Some of the most abundant and problematic species or species-group escaping control included amaranth species (such as smooth pigweed (Amaranthus hybridus L.), Palmer amaranth (Amaranthus palmeri S. Watson), redroot pigweed (Amaranthus retroflexus L.), and waterhemp [Amaranthus tuberculatus (Moq.) J. D. Sauer]), common lambsquarters (Chenopodium album L.), large crabgrass [Digitaria sanguinalis (L.) Scop.], and ivyleaf morningglory (Ipomoea hederacea Jacq.). Emerging threats include hophornbeam copperleaf (Acalypha ostryifolia Riddell) in the MW and sharppoint fluvellin [Kickxia elatine (L.) Dumort.] in the NW. Beyond crop losses due to weed interference, the weed canopy at harvest poses a risk to contaminating snap bean products with foreign material. Random forest modeling predicts the residual weed canopy is dominated by common lambsquarters, large crabgrass, carpetweed (Mollugo verticillata L.), I. hederacea, amaranth species, and A. ostryifolia. This is the first quantitative report on the weed community escaping control in U.S. snap bean production.
The aspirations-ability framework proposed by Carling has begun to place the question of who aspires to migrate at the center of migration research. In this article, building on key determinants assumed to impact individual migration decisions, we investigate their prediction accuracy when observed in the same dataset and in different mixed-migration contexts. In particular, we use a rigorous model selection approach and develop a machine learning algorithm to analyze two original cross-sectional face-to-face surveys conducted in Turkey and Lebanon among Syrian migrants and their respective host populations in early 2021. Studying similar nationalities in two hosting contexts with a distinct history of both immigration and emigration and large shares of assumed-to-be mobile populations, we illustrate that a) (im)mobility aspirations are hard to predict even under ‘ideal’ methodological circumstances, b) commonly referenced “migration drivers” fail to perform well in predicting migration aspirations in our study contexts, while c) aspects relating to social cohesion, political representation and hope play an important role that warrants more emphasis in future research and policymaking. Methodologically, we identify key challenges in quantitative research on predicting migration aspirations and propose a novel modeling approach to address these challenges.
Deficiency of vitamin B12 (B12 or cobalamin), an essential water-soluble vitamin, leads to neurological damage, which can be irreversible and anaemia, and is sometimes associated with chronic disorders such as osteoporosis and cardiovascular diseases. Clinical tests to detect B12 deficiency lack specificity and sensitivity. Delays in detecting B12 deficiency pose a major threat because the progressive decline in organ functions may go unnoticed until the damage is advanced or irreversible. Here, using targeted unbiased metabolomic profiling in the sera of subjects with low B12 levels v control individuals, we set out to identify biomarker(s) of B12 insufficiency. Metabolomic profiling identified seventy-seven metabolites, and partial least squares discriminant analysis and hierarchical clustering analysis showed a differential abundance of taurine, xanthine, hypoxanthine, chenodeoxycholic acid, neopterin and glycocholic acid in subjects with low B12 levels. Random forest multivariate analysis identified a taurine/chenodeoxycholic acid ratio, with an AUC score of 1, to be the best biomarker to predict low B12 levels. Mechanistic studies using a mouse model of B12 deficiency showed that B12 deficiency reshaped the transcriptomic and metabolomic landscape of the cell, identifying a downregulation of methionine, taurine, urea cycle and nucleotide metabolism and an upregulation of Krebs cycle. Thus, we propose taurine/chenodeoxycholic acid ratio in serum as a potential biomarker of low B12 levels in humans and elucidate using a mouse model of cellular metabolic pathways regulated by B12 deficiency.
Pinyon–juniper woodlands are dry ecosystems defined by the presence of juniper (Juniperus spp.) and pinyon pine (Pinus spp.), which stretch over 400 000 km2 across 10 US states. Certain areas have become unnaturally dense and have moved into former shrub and grasslands, while others have experienced widespread mortality. To properly manage these woodlands, sites must be evaluated individually and decisions made based on scientific information that is often not available. Many species utilize pinyon–juniper woodlands, including the pinyon jay (Gymnorhinus cyanocephalus), named for its mutualism with pinyon pine, whose population has declined by c. 2.2% per year from 1966 to 2022, an overall decrease of c. 71%. To increase the likelihood of further research progress, we propose a tool to model the distribution of pinyon pine at a finer scale than current woodland classification tools in the northern US Great Basin: a random forest model using geographical, ecological and climate variables. Our results achieved an accuracy of 93.94%, indicating high predictive power to identify locations of pinyon pine in north-eastern Nevada, the south-eastern corner of Oregon and southern Idaho. These findings can inform managers and planners researching pinyon pine, pinyon–juniper woodlands and potentially the pinyon jay.
Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.
This article provides a structured description of openly available news topics and forecasts for armed conflict at the national and grid cell level starting January 2010. The news topics, as well as the forecasts, are updated monthly at conflictforecast.org and provide coverage for more than 170 countries and about 65,000 grid cells of size 55 × 55 km worldwide. The forecasts rely on natural language processing (NLP) and machine learning techniques to leverage a large corpus of newspaper text for predicting sudden onsets of violence in peaceful countries. Our goals are a) to support conflict prevention efforts by making our risk forecasts available to practitioners and research teams worldwide, b) to facilitate additional research that can utilize risk forecasts for causal identification, and c) to provide an overview of the news landscape.
Real-time gait trajectory planning is challenging for legged robots walking on unknown terrain. In this paper, to realize a more efficient and faster motion control of a quadrupedal robot, we propose an optimized gait planning generator (GPG) based on the decision tree (DT) and random forest (RF) model of the robot leg workspace. First, the framework of this embedded GPG and some of the modules associated with it are illustrated. Aiming at the leg workspace model described by DT and RF used in GPG, this paper introduces in detail how to collect the original data needed for training the model and puts forward an Interpolation Labeling with Dilation and Erosion (ILDE) data processing algorithm. After the DT and RF models are trained, we preliminarily evaluate their performance. We then present how these models can be used to predict the location relation between a spatial point and the leg workspace based on its distributional features. The DT model takes only 0.00011 s to process a sample, while the RF model can give the prediction probability. As a complement, the PID inverse kinematic model used in GPG is also mentioned. Finally, the optimized GPG is tested during a real-time single-leg trajectory planning experiment and an unknown terrain recognition simulation of a virtual quadrupedal robot. According to the test results, the GPG shows a remarkable rapidity for processing large-scale data in the gait trajectory planning tasks, and the results can prove it has an application value for quadruped robot control.
A variable annuity is a modern life insurance product that offers its policyholders participation in investment with various guarantees. To address the computational challenge of valuing large portfolios of variable annuity contracts, several data mining frameworks based on statistical learning have been proposed in the past decade. Existing methods utilize regression modeling to predict the market value of most contracts. Despite the efficiency of those methods, a regression model fitted to a small amount of data produces substantial prediction errors, and thus, it is challenging to rely on existing frameworks when highly accurate valuation results are desired or required. In this paper, we propose a novel hybrid framework that effectively chooses and assesses easy-to-predict contracts using the random forest model while leaving hard-to-predict contracts for the Monte Carlo simulation. The effectiveness of the hybrid approach is illustrated with an experimental study.
Prospective studies on the mental health of university students highlighted a major concern. Specifically, young adults in academia are affected by markedly worse mental health status than their peers or adults in other vocations. This situation predisposes to exacerbated disability-adjusted life-years.
Methods
We enroled 1,388 students at the baseline, 557 of whom completed follow-up after 6 months, incorporating their demographic information and self-report questionnaires on depressive, anxiety and obsessive–compulsive symptoms. We applied multiple regression modelling to determine associations – at baseline – between demographic factors and self-reported mental health measures and supervised machine learning algorithms to predict the risk of poorer mental health at follow-up, by leveraging the demographic and clinical information collected at baseline.
Results
Approximately one out of five students reported severe depressive symptoms and/or suicidal ideation. An association of economic worry with depression was evidenced both at baseline (when high-frequency worry odds ratio = 3.11 [1.88–5.15]) and during follow-up. The random forest algorithm exhibited high accuracy in predicting the students who maintained well-being (balanced accuracy = 0.85) or absence of suicidal ideation but low accuracy for those whose symptoms worsened (balanced accuracy = 0.49). The most important features used for prediction were the cognitive and somatic symptoms of depression. However, while the negative predictive value of worsened symptoms after 6 months of enrolment was 0.89, the positive predictive value is basically null.
Conclusions
Students’ severe mental health problems reached worrying levels, and demographic factors were poor predictors of mental health outcomes. Further research including people with lived experience will be crucial to better assess students’ mental health needs and improve the predictive outcome for those most at risk of worsening symptoms.
Vitamin D is an essential nutrient to be consumed in the habitual dietary intake, whose deficiency is associated with various disturbances. This study represents a validation of vitamin D status estimation using a semi-quantitative FFQ, together with data from additional physical activity and lifestyle questionnaires. This information was combined to forecast the serum vitamin D status. Different statistical methods were applied to estimate the vitamin D status using predictors based on diet and lifestyle. Serum vitamin D was predicted using linear regression (with leave-one-out cross-validation) and random forest models. Intraclass correlation coefficients, Lin’s agreement coefficients, Bland–Altman plots and other methods were used to assess the accuracy of the predicted v. observed serum values. Data were collected in Spain. A total of 220 healthy volunteers aged between 18 and 78 years were included in this study. They completed validated questionnaires and agreed to provide blood samples to measure serum 25-hydroxyvitamin D (25(OH)D) levels. The common final predictors in both models were age, sex, sunlight exposure, vitamin D dietary intake (as assessed by the FFQ), BMI, time spent walking, physical activity and skin reaction after sun exposure. The intraclass correlation coefficient for the prediction was 0·60 (95 % CI: 0·52, 0·67; P < 0·001) using the random forest model. The magnitude of the correlation was moderate, which means that our estimation could be useful in future epidemiological studies to establish a link between the predicted 25(OH)D values and the occurrence of several clinical outcomes in larger cohorts.
In this study, we quantify the relationship between socio-economic status and life expectancy and identify combinations of socio-economic variables that are particularly useful for explaining mortality differences between neighbourhoods in England. We achieve this by examining socio-economic variation in mortality experiences across small areas in England known as lower layer super output areas (LSOAs). We then consider 12 socio-economic variables that are known to have a strong association with mortality. We estimate the relationship between those variables and mortality rates using a random forest algorithm. Based on the resulting estimate, we then create a new socio-economic mortality index – the Longevity Index for England (LIFE). The index is constructed in a way that eliminates the impact of care homes that might artificially increase mortality rates in LSOAs with care homes compared to LSOAs that do not contain a care home. Using mortality data for different age groups, we make the index age-dependent and investigate the impact of specific socio-economic characteristics on the age-specific mortality risk. We compare the explanatory power of the LIFE index to the English Index of Multiple Deprivation (IMD) as predictors of mortality. While we find that the IMD can explain regional mortality differences to some extent, the LIFE index has significantly greater explanatory power for mortality differences between regions. Our empirical results also indicate that income deprivation amongst the elderly and employment deprivation are the most significant socio-economic factors for explaining mortality variation across LSOAs in England.
A decision tree is a tree-like model of decisions and their consequences, with classification and regression tree (CART) being the most commonly used. Being simple models, decision trees are considered ’weak learners’ relative to more complex and more accurate models. By using a large ensemble of weak learners, methods such as random forest can compete well against strong learners such as neural networks. An alternative to random forest is boosting. While random forest constructs all the trees independently, boosting constructs one tree at a time. At each step, boosting tries to a build a weak learner that improves on the previous one.
Archaeologists tend to produce slow data that is contextually rich but often difficult to generalize. An example is the analysis of lithic microdebitage, or knapping debris, that is smaller than 6.3 mm (0.25 in.). So far, scholars have relied on manual approaches that are prone to intra- and interobserver errors. In the following, we present a machine learning–based alternative together with experimental archaeology and dynamic image analysis. We use a dynamic image particle analyzer to measure each particle in experimentally produced lithic microdebitage (N = 5,299) as well as an archaeological soil sample (N = 73,313). We have developed four machine learning models based on Naïve Bayes, glmnet (generalized linear regression), random forest, and XGBoost (“Extreme Gradient Boost[ing]”) algorithms. Hyperparameter tuning optimized each model. A random forest model performed best with a sensitivity of 83.5%. It misclassified only 28 or 0.9% of lithic microdebitage. XGBoost models reached a sensitivity of 67.3%, whereas Naïve Bayes and glmnet models stayed below 50%. Except for glmnet models, transparency proved to be the most critical variable to distinguish microdebitage. Our approach objectifies and standardizes microdebitage analysis. Machine learning allows studying much larger sample sizes. Algorithms differ, though, and a random forest model offers the best performance so far.
Broiler chickens are among the main livestock sectors worldwide. With individual treatments being inapplicable, contrary to many other animal species, the need for antimicrobial use (AMU) is relatively high. AMU in animals is known to drive the emergence and spread of antimicrobial resistance (AMR). High farm biosecurity is a cornerstone for animal health and welfare, as well as food safety, as it protects animals from the introduction and spread of pathogens and therefore the need for AMU. The goal of this study was to identify the main biosecurity practices associated with AMU in broiler farms and to develop a statistical model that produces customised recommendations as to which biosecurity measures could be implemented on a farm to reduce its AMU, including a cost-effectiveness analysis of the recommended measures. AMU and biosecurity data were obtained cross-sectionally in 2014 from 181 broiler farms across nine European countries (Belgium, Bulgaria, Denmark, France, Germany, Italy, the Netherlands, Poland and Spain). Using mixed-effects random forest analysis (Mix-RF), recursive feature elimination was implemented to determine the biosecurity measures that best predicted AMU at the farm level. Subsequently, an algorithm was developed to generate AMU reduction scenarios based on the implementation of these measures. In the final Mix-RF model, 21 factors were present: 10 about internal biosecurity, 8 about external biosecurity and 3 about farm size and productivity, with the latter showing the largest (Gini) importance. Other AMU predictors, in order of importance, were the number of depopulation steps, compliance with a vaccination protocol for non-officially controlled diseases, and requiring visitors to check in before entering the farm. K-means clustering on the proximity matrix of the final Mix-RF model revealed that several measures interacted with each other, indicating that high AMU levels can arise for various reasons depending on the situation. The algorithm utilised the AMU predictive power of biosecurity measures while accounting also for their interactions, representing a first step toward aiding the decision-making process of veterinarians and farmers who are in need of implementing on-farm biosecurity measures to reduce their AMU.
When one or several classes are much less prevalent than another class (unbalanced data), class error rates and variable importances of the machine learning algorithm random forest can be biased, particularly when sample sizes are smaller, imbalance levels higher, and effect sizes of important variables smaller. Using simulated data varying in size, imbalance level, number of true variables, their effect sizes, and the strength of multicollinearity between covariates, we evaluated how eight versions of random forest ranked and selected true variables out of a large number of covariates despite class imbalance. The version that calculated variable importance based on the area under the curve (AUC) was least adversely affected by class imbalance. For the same number of true variables, effect sizes, and multicollinearity between covariates, the AUC variable importance ranked true variables still highly at the lower sample sizes and higher imbalance levels at which the other seven versions no longer achieved high ranks for true variables. Conversely, using the Hellinger distance to split trees or downsampling the majority class already ranked true variables lower and more variably at the larger sample sizes and lower imbalance levels at which the other algorithms still ranked true variables highly. In variable selection, a higher proportion of true variables were identified when covariates were ranked by AUC importances and the proportion increased further when the AUC was used as the criterion in forward variable selection. In three case studies, known species–habitat relationships and their spatial scales were identified despite unbalanced data.
We compared climatic relationships to insurance loss across the inland Pacific Northwest region of the United States, using a design matrix methodology, to identify optimum temporal windows for climate variables by county in relationship to wheat insurance loss due to drought. The results of our temporal window construction for water availability variables (precipitation, temperature, evapotranspiration, and the Palmer drought severity index [PDSI]) identified spatial patterns across the study area that aligned with regional climate patterns, particularly with regards to drought-prone counties of eastern Washington. Using these optimum time-lagged correlational relationships between insurance loss and individual climate variables, along with commodity pricing, we constructed a regression-based random forest model for insurance loss prediction and evaluation of climatic feature importance. Our cross-validated model results indicated that PDSI was the most important factor in predicting total seasonal wheat/drought insurance loss, with wheat pricing and potential evapotranspiration having noted contributions. Our overall regional model had a $ {R}^2 $ of 0.49, and a RMSE of $30.8 million. Model performance typically underestimated annual losses, with moderate spatial variability in terms of performance between counties.
Chronic food insecurity remains a challenge globally, exacerbated by climate change-driven shocks such as droughts and floods. Forecasting food insecurity levels and targeting vulnerable households is apriority for humanitarian programming to ensure timely delivery of assistance. In this study, we propose to harness a machine learning approach trained on high-frequency household survey data to infer the predictors of food insecurity and forecast household level outcomes in near real-time. Our empirical analyses leverage the Measurement Indicators for Resilience Analysis (MIRA) data collection protocol implemented by Catholic Relief Services (CRS) in southern Malawi, a series of sentinel sites collecting household data monthly. When focusing on predictors of community-level vulnerability, we show that a random forest model outperforms other algorithms and that location and self-reported welfare are the best predictors of food insecurity. We also show performance results across several neural networks and classical models for various data modeling scenarios to forecast food security. We pose that problem as binary classification via dichotomization of the food security score based on two different thresholds, which results in two different positive class to negative class ratios. Our best performing model has an F1 of 81% and an accuracy of 83% in predicting food security outcomes when the outcome is dichotomized based on threshold 16 and predictor features consist of historical food security score along with 20 variables selected by artificial intelligence explainability frameworks. These results showcase the value of combining high-frequency sentinel site data with machine learning algorithms to predict future food insecurity outcomes.
The authors apply logistic regression, multinomial regression, classification trees and random forests to a ternary outcome variable: the variation between the ’s-genitive, the of-genitive and functionally equivalent noun + noun combinations. The statistical approaches discussed fall into regression models on the one hand and classification trees on the other. Specifically, as an alternative to successive binomial regression analyses, the authors implement a multinomial model, which can analyse the entire dataset with three outcome categories simultaneously. Further, a basic classification tree is calculated alongside a more complex (and more robust) random forest. The chapter does not only weigh advantages and shortcomings of all four models, but it also explicates the different rationales and interpretations that come with them. As a major insight, it emerges that the nature of the dataset, the analytic purpose and the statistical model are interdependent and condition each other in several non-trivial respects.
In a comparison of generalised linear mixed-effects models, generalised linear mixed-effects model trees and random forests, the author applies the three methodologies to a binary variable from the field of interactional pragmatics, the choice between filled and unfilled pauses across varieties of English represented by components of the International Corpus of English. Based on a large number of examples annotated for linguistic and extralinguistic factors the steps and decisions involved in the analyses are demonstrated. Though different in essence, the three resulting models share central trends. A more fine-grained evaluation of results and interpretations shows, however, that the three approaches differ in their systematicity of handling multiple observations from the same source, in that only the mixed-effects models explicitly account for and systematically partial out the relatedness of data points contributed by the same speaker. As to the way the approaches balance researcher involvement and control of the outcome, the approaches also differ substantially. A modelling choice can thus lead to notably different perspectives on an identical set of data and variables.