1 Introduction
Children acquire the languages used by those around them. Toddlers in English-speaking families say dog, while those in French-speaking families say chien. Thus, all theories of language development are grounded in the assumption that the language children experience (i.e., the input) plays a critical role in language development. Over the last forty years, however, a substantial body of research has been conducted to test a stronger claim: that differences in the amount or kind of input that children receive are associated with differences in the pace with which they acquire language (Zauche et al., Reference Zauche, Thul, Mahoney and Stapel-Wax2016).
In these studies, researchers record interactions between a young child and their caregiver during daily activities (like meals) or a structured play session. This speech sample is then transcribed and coded for properties that might predict individual differences in children’s language development, such as the amount of speech directed at the child (Huttenlocher et al., Reference Huttenlocher, Haight, Bryk, Seltzer and Lyons1991), its lexical diversity (Hart & Risley, Reference Hart and Risley1995), or grammatical complexity (Hoff-Ginsberg, Reference Hoff-Ginsberg1986). Finally, parent input is compared to the child’s language ability, which can be assessed in a wide variety of ways, such as by administering a vocabulary assessment, collecting the information via a parent report, or sampling speech produced by the child during observation. Most of these studies find that parent input predicts their children’s language outcomes. This work has been cited as motivating researchers to develop interventions that seek to increase parents’ verbal engagement with their children (e.g., Dupas et al., Reference Dupas, Falezan, Jayachandran and Walsh2023; Suskind et al., Reference Suskind, Leffel, Graf, Hernandez, Gunderson, Sapolich, Suskind, Leininger, Goldin-Meadow and Levine2016; Weber et al., Reference Weber, Fernald and Diop2017; Wong et al., Reference Wong, Thomas and Boben2020).
The present paper is a meta-analysis exploring the size of these input–output correlations, the range of conditions under which they are observed, and the degree to which the size of these effects depends on the choice of input measure, outcome measure, study design, or population studied. In the remainder of this introduction, we review the principal findings of the input literature, describe meta-analytical methods and what they can accomplish, present the questions motivating the present meta-analysis, and discuss the findings from previous meta-analyses on this topic and how our work goes beyond this study.
1.1 The growing interest in studies of caregiver input
While research into the relationship between parental speech and language development began in the 1970’s (e.g., Newport & Gleitman, Reference Newport and Gleitman1977), much of the current interest in this topic stems from Hart & Risley’s Reference Hart and Risley1995 book (H&R). H&R followed 42 families for nearly 2.5 years, collecting data on children’s linguistic milestones and sampling naturalistic speech in the home during monthly hour-long visits. They found that the thirteen children in the highest income group heard on average over 2000 words per hour from their primary caregiver, whereas the six children in families receiving welfare heard around 600. Critically, differences in caregiver input predicted individual differences in the rate of children’s vocabulary growth, such that children who heard more words had larger vocabularies than children who heard fewer. These effects persisted: parental vocabulary use at age 3 predicted performance on standardized language tests at age 9 (Walker et al., Reference Walker, Greenwood, Hart and Carta1994). The authors concluded that child-directed speech plays a critical role in language development.
While more recent studies have found that socioeconomic differences in child-directed speech are neither as large nor as prevalent as H&R suggest (e.g., Dailey & Bergelson, Reference Dailey and Bergelson2022; Sperry et al., Reference Sperry, Sperry and Miller2019), a growing body of research has supported H&R’s second conclusion that differences in the properties of caregiver input predict differences in language growth. These findings have been replicated in a variety of socioeconomic contexts (e.g., Hoff, Reference Hoff2003; Pan et al., Reference Pan, Rowe, Spier and Tamis-Lemonda2004; Rowe, Reference Rowe2008; Huttenlocher et al., Reference Huttenlocher, Waterfall, Vasilyeva, Vevea and Hedges2010; Hirsh-Pasek et al., Reference Hirsh-Pasek, Adamson, Bakeman, Owen, Golinkoff, Pace, Yust and Suma2015; Romeo et al., Reference Romeo, Leonard, Robinson, West, Mackey, Rowe and Gabrieli2018). While most input studies have been conducted with English-speaking families in the U.S., similar patterns have also been observed in other contexts (e.g., Hurtado et al., Reference Hurtado, Marchman and Fernald2008; Mastin & Vogt, Reference Mastin and Vogt2016; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012; Weber et al., Reference Weber, Fernald and Diop2017; Weisleder & Fernald, Reference Weisleder and Fernald2013; Zhang et al., Reference Zhang, Liu, Pappas, Dill, Feng, Zhang, Zhao, Rozelle and Ma2023).
1.2 Motivating the meta-analytic approach
While there is a broad consensus that input and outcome are correlated, several important questions remain that can be addressed by meta-analysis. First, it is unclear how large these correlations are. H&R found that input measures collected between 34 and 36 months of age predicted over half of the variance in vocabulary at 36 months (R 2 = 0.53), suggesting that input variation is the primary factor setting the pace for vocabulary growth. Studies conducted since then, however, have found a wide range of effect sizes ranging from R 2 = 0.61 (Leech & Rowe, Reference Leech and Rowe2014) to R 2 = 0.00 (Pancsofar & Vernon-Feagans, Reference Pancsofar and Vernon-Feagans2006). Knowing how large we should expect these effects to be, in general, allows us to more accurately determine how typical or unusual a given finding is, opening the way for further discovery. This information also allows researchers to set their priors more accurately for power analyses. Finally, it allows for direct comparisons of parental input to other predictors of language development (e.g., Is parent input more predictive of language outcomes than input from teachers or genetic differences that impact learning?).
Second, there is mixed consensus on which input measures are most predictive of language outcomes. Child-directed speech is a rich stimulus that can be characterized in a variety of ways. Researchers may be interested in whether simply hearing more speech facilitates development or whether the benefits come from hearing speech of a specific kind. Thus, speech coding is often broken down into two categories: the quantity and quality of the input. Quantity measures, like the number of words, capture how much speech children hear during interactions with their caregivers. Quality measures, like lexical richness, capture the degree to which caregiver speech contains features that are thought to facilitate language learning. Some researchers have found that measures of quality are more associated with child outcomes than measures of quantity, particularly later in development (e.g., Hsu et al., Reference Hsu, Hadley and Rispoli2017; Pan et al., Reference Pan, Rowe, Spier and Tamis-Lemonda2004; Rowe, Reference Rowe2012). Findings of this kind are central to understanding the mechanism by which input shapes language development and the kind of input that matters most (Golinkoff et al., Reference Golinkoff, Hoff, Rowe, Tamis‐LeMonda and Hirsh‐Pasek2019). Thus, it is important to know if such patterns are consistently observed across studies and how large the difference in effect size is. In our study, we focus on the four measures that are most often reported in input studies. Two are measures of quantity (number of utterances and word tokens), and two are measures of quality (number of word types and the mean length of utterances). Word tokens are counted as the number of words in the sample, while word types are counted as the number of different words in the sample. Mean length of utterance, or MLU, is defined as the average length of an utterance in words or morphemes.
Third, because input studies differ greatly from one another, differences in study design, participant characteristics, or language measures could moderate the effects of input. Speech from parents can be sampled: in the participants’ homes or in the researcher’s laboratory; during episodes of play or reading; for a few minutes or for many hours. Children’s language ability can be assessed via direct testing, by observing children’s speech production with their caregivers, or by surveying their primary caregiver about their language use. Studies also vary in the characteristics of the participants, such as the age of the child or the gender of the parent. Studies like H&R that find correlations between SES and input raise the possibility that the strength of the association between input and outcome might depend on the SES composition of the sample, with more economically diverse samples having larger correlations.
Finally, there is a possibility that research on input–outcome relationships is skewed by publication bias. The association between children’s language and caregiver speech is supported by a large literature and has become a fixture of interventions and policy initiatives (Dupas et al., Reference Dupas, Falezan, Jayachandran and Walsh2023; Suskind et al., Reference Suskind, Leffel, Graf, Hernandez, Gunderson, Sapolich, Suskind, Leininger, Goldin-Meadow and Levine2016; Weber et al., Reference Weber, Fernald and Diop2017; Wong et al., Reference Wong, Thomas and Boben2020). Thus, many researchers may expect to find such a relationship in their data. This kind of consensus can create a “file-drawer effect,” where negative or null results are not published (Rosenthal, Reference Rosenthal1979), which would inflate the apparent effect size. Fortunately, meta-analysis offers tools to estimate the possible effects of publication bias. By comparing data that has been published to unpublished data obtained through contact with authors, one can determine whether positive results are more likely to be reported than null results. One can also create funnel plots, which visualize the distribution of effect sizes relative to their standard errors, to determine whether there are more positive results than would be expected in studies with greater variance (suggesting selective reporting). Methods like Egger’s test can be used to determine whether effect sizes are asymmetrically distributed against standard errors (Egger et al., Reference Egger, Smith, Schneider and Minder1997). These plots can also include contour lines plotting the distribution of p-values, which can reveal other possible causes of asymmetry, such as variable study quality (Peters et al., Reference Peters, Sutton, Jones, Abrams and Rushton2008).
1.3 Prior meta-analyses
Two prior meta-analyses have explored input–outcome correlations. Wang et al. (Reference Wang, Williams, Dilley and Houston2020) conducted a meta-analysis comparing language outcomes to caregiver input collected via the use of LENA recording devices, which produce automatic quantitative measures of speech spoken near the child extracted from hours of audio. They collected 17 studies of children from birth to 48 months, including two non-English studies (one in Mandarin and one in Finnish), and ran three analyses on each of the LENA input measures: adult word counts, child vocalization counts, and conversational turn counts. Collapsing across measures, they found a moderate relationship between LENA measures and language outcomes (R 2 = 0.07). Examining each measure individually, they found that adult word counts had the weakest relationship with outcomes (R 2 = 0.04), followed by conversational turn counts (R 2 = 0.09) and child vocalizations (R 2 = 0.09). In addition, longer elapsed time between input and outcome collection was associated with smaller correlations. No evidence of publication bias was found.
Most recently, Anderson et al. (Reference Anderson, Graham, Prime, Jenkins and Madigan2021, AGPJM) conducted a meta-analysis focusing on the relationship between language outcomes and input quantity and quality. Studies were included in the analysis of quantity if they included either word tokens or utterances as an input measure. Studies were included in the analysis of quality if they included either a measure of input diversity (e.g., word types/roots, type-token ratio) or complexity (mean length of utterance, rare word usage, lexical richness, or multi-clausal utterances). Only studies examining typically developing English-speaking children were included, and studies using LENA were omitted. The analyses included 33 quantity studies and 35 quality studies of 1- to 72-month olds. Using hierarchically ordered study selection criteria, they selected a single statistical effect size from each study. Studies of quantity had significant effects overall (R 2 = 0.04). There were two reliable moderators: effect sizes were larger in longitudinal studies and in studies where children were assessed in naturalistic contexts. The funnel plot was asymmetric for the quantity studies, suggesting publication bias, but the authors estimated that the true effect size was only slightly smaller than the reported effect size. Studies of quality had larger effect sizes (R 2 = 0.11) with no evidence of publication bias. For studies of quality, there were larger effect sizes for longer studies, studies where input was collected in naturalistic contexts, and studies of older children. They conducted separate analyses on their measures of diversity (primarily types, R 2 = 0.07) and complexity (primarily MLU, R 2 = 0.11), which were not significantly different from one another.
1.4 Current study
This study builds upon the AGPJM meta-analysis but goes beyond it in six critical ways.
First, because this research area is highly active, we were able to include 23 studies that were not available when AGPJM conducted their analysis.
Second, we adopted a multilevel mixed-effects design that allows us to include multiple effect sizes per study (e.g., correlations collected with different language assessments or different speakers), increasing the total number of effect sizes analysed from k = 71 to k = 323. By using robust variance estimation to control for statistical interdependence between effect sizes in the same study, we are able to include all effect sizes reported in our sample, increasing the precision of our estimates (Pustejovsky & Tipton, Reference Pustejovsky and Tipton2022). This approach was also used by Wang et al. (Reference Wang, Williams, Dilley and Houston2020) and departed from our original preregistered analysis, which used hierarchical study selection to derive a single effect size per study. The findings of the preregistered analysis, which differed little from the current analysis, can be found in the Supplementary Materials.
Third, we adopted a finer-grained coding system for input, examining the four standard measures of input separately: utterances, tokens, types, and MLU. This approach differentiates between measures of input within quality and quantity and allows us to perform more precise analyses of pooled effect size and moderators. Using multilevel modelling, we could also compare the pooled effect sizes of these different measures directly by including them in the meta-regression model as moderating variables. We use a similar technique to test for publication bias. If there were publication bias, we would expect studies with large standard errors to have disproportionately large and positive effect sizes. We can test for this by constructing a regression model with standard error as a moderating variable (Rodgers & Pustejovsky, Reference Rodgers and Pustejovsky2021).
Fourth, in addition to exploring previously studied moderators using our novel statistical approach, we included several moderators that have gone unexamined. For example, we were interested in whether effect sizes are larger when input and outcome measures are identical, which often occurs when children’s language is assessed on the basis of the input speech sample. If the context of the conversation shapes the behaviour of both participants in similar ways (e.g., a discussion of different animals might produce greater lexical diversity from parent and child than pretend play with cars), we might expect these studies to produce larger correlations.
Fifth, we wished to investigate whether input–outcome associations showed evidence of boundedness. Most studies of input claim that associations between input and outcomes are causal in nature: children who hear speech more often or hear more complex speech learn their language more quickly. To our knowledge, no studies have examined whether the effects of input are bounded. In other words, is there a point at which children begin seeing diminishing returns on learning from additional input? Intuitively, we might expect there to be a limit to how much new information young children can learn in a day, after which additional child-directed speech does not produce more learning. If this were true, we would expect input–outcome correlations to be smaller in samples where children, on average, hear more speech or more complex speech.
Finally, we have taken a simple step to explore the effects of language and culture on input–outcome associations by including non-English studies in our analysis. While we do not have access to data from studies of rural, non-Western populations, we wanted to determine how large effect sizes were in non-English and in non-U.S. samples, and whether they differed substantially from English/U.S. samples. This analysis was not included in the meta-analyses reviewed earlier. This analysis is an initial, small step towards understanding whether there are systematic differences in these effects across languages and countries.
2 Methods
We followed PRISMA guidelines for conducting and reporting results for meta-analyses (Page et al., Reference Page, McKenzie, Bossuyt, Boutron, Hoffmann, Mulrow, Shamseer, Tetzlaff and Moher2021) (Figure 1). A full breakdown of our study selection procedure can be found in the Supplementary Materials.
2.1 Search procedure
2.1.1. Forward search
We constructed a Boolean search query based on our inclusion criteria, searching the abstract, title, and keywords for references to (1) caregiver participants, (2) child participants, (3) input measures, and (4) output measures. We refined our query by testing candidate queries to ensure that the 28 studies in our literature review (see below) were found. We surveyed the following databases based on their relevance to research in child language development: ERIC, PsycInfo, Academic Search Premier, PubMed, Web of Science, and Proquest (for dissertations). The most recent search was conducted 14 January 2021 and produced 6763 abstracts for further screening.
2.1.2. Other sources
Expert knowledge: Prior to our search, we included 28 publications for eligibility drawn from a literature review conducted for a previous study of child-directed speech (Coffey et al., Reference Coffey, Shafto, Geren and Snedeker2022).
Contacting authors: We reached out to the research community through the ICIS and CHILDES email listservs for missed studies and unpublished data. We considered an additional eight publications collected this way.
Prior meta-analysis: Finally, we compared the results of our literature review, forward search, and author contacts to the list of publications included in the AGPJM meta-analysis. We considered an additional 23 publications collected this way. Nineteen were excluded due to data overlap or differences in our respective inclusion criteria (Figure 1).
2.2 Inclusion criteria
We screened study abstracts for four criteria. First, studies must be English-language journal articles, book chapters, dissertations, or conference proceedings. Reviews and meta-analyses were excluded. Second, studies must examine typically developing, monolingual children between the ages of 1–8. Atypically developing children, such as children with autism or preterm infants, were excluded, as well as multilingual children. Third, studies must include one of our four input measures (utterances, tokens, types, or MLU) from speech directed to children by caregivers in naturalistic or semi-naturalistic settings. Studies that only collected other measures of input (e.g., questions, decontextualized speech) or measures of interaction (e.g., warmth, responsiveness) were excluded. We also excluded input that was scripted (e.g., only words read from a book). Fourth, studies must include either: a measure of children’s vocabulary, a broad measure of language development, or an observation of children’s language use. We excluded studies that measured other specific forms of language proficiency (syntactic knowledge, pragmatics, novel word-learning) primarily because of the small number of studies using any one of these measures. Our screening of abstracts left 248 potentially eligible publications. We located full-text versions of these studies for further review and coding. Studies with data that reported significant overlap with other studies were reviewed further. When available, the earliest reporting of the original data set was preferred. We assumed the initial reporting of data would be the primary analysis, whereas subsequent reports would be affected by what had been found previously. Full-text review was conducted by the lead author.
This procedure resulted in 75 studies that were coded for effect size and moderating variables. Four studies that did not provide Pearson’s r were omitted from analysis. In addition, five reported input–outcome correlations were calculated from composited measures of input (e.g., taking the average of multiple standardized input values) and could not be included in any of the individual input analyses. In total, we examined 71 studies, reporting 323 correlations, across 4760 unique participants.
2.3 Study coding
Coding was conducted by the lead author. An additional annotator independently coded 21 studies to check for accuracy. Studies were coded four different kinds of variables: input measures, outcome measures, subject characteristics, and study characteristics.
2.3.1. Input measures
Word tokens: Word tokens are a raw count of the total number of words produced. Tokens produced by parents have been found to predict language development in children (e.g., Hoff, Reference Hoff2003; Rowe, Reference Rowe2012).
Total utterances: Utterances are defined as a continuous segment of speech. An utterance can be a single sentence, a word, a phrase, or a portion of a sentence and are commonly delimited by pauses in speech. The number of parent utterances have been found to predict individual differences in language outcomes (e.g., Pancsofar & Vernon-Feagans, Reference Pancsofar and Vernon-Feagans2006).
Word types: Word types are a measure of how many different words are produced in a sample. They index the lexical diversity of speech or the number of different words that children have the opportunity to learn. Parent word types are frequently found to predict language outcomes (e.g., Hart & Risley, Reference Hart and Risley1995; Rowe, Reference Rowe2012). In addition to word types, we also included closely associated measures, like number of different word roots or morphemes.
Mean-length of utterance (MLU): MLU is the average number of linguistic units an utterance contains. This measure has been used to index the grammatical complexity of an utterance. MLU and language outcomes are often positively correlated (e.g., Hoff-Ginsberg, Reference Hoff-Ginsberg1986). MLU is commonly defined as the average number of morphemes in an utterance but can also be defined as the average number of words in an utterance. These measures are highly correlated (Parker & Brorson, Reference Parker and Brorson2005), and thus, we included both in our analysis.
2.3.2. Outcome measures
We coded studies for how they measured language outcomes. We distinguished between three kinds of assessment types: observation, direct assessment, and parent report. Within these assessments, we also distinguished between two kinds of assessment measures: expressive and receptive language. We also distinguished whether an assessment was a measure of vocabulary specifically. Finally, for studies using word types as the input measure, we coded whether the outcome measure captured the same construct in the child (e.g., parent word types and child word types produced during observation). Studies that did were coded as matched.
2.3.3. Subject characteristics
Gender was coded as the percentage of children who were female. Age was coded at the time input measures were collected and the time outcome measures were collected. We coded information about the speaker(s) providing input (e.g., mother, father, primary caregiver, etc.) and the native language and country of origin of the household. Next, we coded for household SES by categorizing studies into three groups used by AGPJM. Studies focusing on samples with lower levels of income/education (relative to national norms) were coded as low SES. Studies with samples from across different income/education levels were coded as diverse SES. All other studies were coded as middle/high SES by default. Finally, for each input measure collected in a study, we coded mean input, or the average recorded value across observations. To ensure input means were comparable from study to study, we normalized word tokens and utterances for observation duration, producing measures of word tokens per minute and utterances per minute, respectively. This was unnecessary for MLU, which is already normalized for total caregiver utterances. We did not code the average word types produced because number of word types declines as a function of time (i.e., the longer a session goes on, the less likely new word types are to be encountered), and thus, word types per minute is confounded with observation duration.
2.3.4. Study characteristics
Studies were coded for their total language sample duration in minutes. Study location was coded as home, lab, or other. The activity taking place during the study was coded as either naturalistic (participants were told to go about their day), semi-naturalistic play (participants were asked to play as they would normally), structured play (participants were asked to play with a particular set of toys), or other. Studies were also coded for temporal design, either cross-lagged or concurrent. Cross-lagged studies were those for which parent input was collected at a different time (Time 1) than children’s outcomes (Time 2). Finally, we coded publication type. First, we coded a baseline set which included all studies that were published in peer-reviewed journals and where our correlation coefficient was taken from that paper. Next, data from books, dissertations, or other non-peer-reviewed sources were coded as non-peer-reviewed. Then, all studies in which the correlation coefficient was not included in the paper but could be calculated from the paper or were retrieved after contacting authors were coded as non-reported. We conducted two moderator analyses: one comparing baseline studies to non-peer-reviewed studies and another comparing baseline studies to studies coded as non-peer-reviewed or non-reported.
2.3.5. Effect sizes
Studies were coded for Pearson’s r correlations between input and outcome measures. When these measures could not be found in the study, we reached out to the authors for either the correlation coefficient or the raw data from which we could calculate the correlation ourselves. We did not include partial correlations or regression coefficients with covariates to maintain the comparability of effects across studies. All correlation coefficients were normalized via conversion to z-scores and division by the squared inverse of their standard errors, or $ \frac{1}{n-3}, $ where n is the sample size for each study (Hedges & Olkin, Reference Hedges and Olkin1985).
2.4 Analysis
All analyses were conducted in R (v 4.4.1) (R Core Team, 2024), using the metafor (v 4.6.0) and clubsandwich (v 0.5.11) packages (Pustejovsky, Reference Pustejovsky2024; Viechtbauer, Reference Viechtbauer2010). To determine the size of input–outcome correlations, random effects meta-regression models were fitted for each of the four input measures, where the intercept is the pooled effect size estimate. These models controlled for interdependence between shared effect sizes using robust variance estimation (rho = 0.8). Q-statistics from the resulting models were used to assess whether there was sufficient heterogeneity across studies to motivate an analysis of potential moderating variables. Moderators were then added to each of the base models as predictors. When considering continuous moderators, we removed outliers by excluding studies that reported values more than three standard deviations from the mean in our sample to avoid data skew. These studies were included in all other analyses. Statistical significance was determined in a two-step process: first, individual coefficients are tested for significance in the regression; second, likelihood ratio testing is used to determine whether the addition of the variable improved fit from the base model. When multiple moderators were found to be significant alone, they were included together in a single model, which was then checked for significance and improvement of fit. Base cases for categorical moderators were dummy coded zero and indicated in each table (e.g., for household SES, the base case is middle-upper or MU).
We then checked for publication bias using two methods. First, we publication status as a binary predictor variable in our moderator analysis. Second, we constructed funnel plots for each of our input variables. These figures plot the correlation reported for each study against its standard error. Studies with smaller standard errors would be expected to find correlations closer to the pooled effect size than studies with larger standard errors (resulting in a downward funnel). If there were publication bias, it would lead to the disappearance of studies with large standard errors but small (non-significant) effect sizes, resulting in an effect size that gets larger as the standard error increases. To test for this, we approximated Egger’s regression test for asymmetry in metafor by constructing a regression model using the standard error of each effect as a moderator (Egger et al., Reference Egger, Smith, Schneider and Minder1997; Rodgers & Pustejovsky, Reference Rodgers and Pustejovsky2021). We then applied our criteria for moderator significance (regression and likelihood ratio test) to determine whether there was a significant effect of standard error on effect size.
Finally, we determined whether certain forms of input were more strongly associated with children’s outcomes by fitting a single multilevel model with all data from all input types, including input measure as a moderator and applying our significance criteria.
3 Results
Our preregistration, data, code, and Supplementary Materials can be accessed through OSF (https://osf.io/aydcf/). A version of this analysis using hierarchical study selection, replicating most of the findings below, can be found in the Supplementary Materials.
A full breakdown of the study characteristics for each input analysis is given in Table 1. To determine whether the studies using the different input variables differed along other dimensions, we conducted a series of mixed-effect linear regressions using lme4 (v 1.1-35.4) (Bates et al., Reference Bates, Mächler, Bolker and Walker2015), where the moderating variable is used as the response variable, input measure is included as a categorical predictor (word tokens coded as zero), and study ID is used as a random intercept. We used mixed-effect modelling for the comparisons to account for multiple values introduced by the same study. We found that studies of word types were shorter on average than studies of word tokens (β = −15.86, SE = 7.16, p = 0.03). Studies of word tokens were longer than other studies, but these comparisons did not reach significance.
We provide forest plots for each analysis: for studies with multiple reported correlations, a single composited correlation was calculated using aggregate using metafor. Reported pooled effect sizes have been converted from the z-score to r for interpretability. Moderator effects are given in tables, with significant effects bolded (i.e., significant coefficient and improved model fit by χ 2 test). Finally, funnel plots are presented illustrating each reported correlation plotted against its standard error, its statistical significance, the overall pooled effect size (with 95% confidence intervals), and the degree of distributional asymmetry given by an Egger’s test.
3.1 Word tokens
We examined 93 correlations across 38 studies that measured word tokens (n = 1986 unique participants). The number of effect size estimates per study ranged from 1 to 6 (median: 2). We found a medium-sized effect across studies (r = 0.23, Figure 2). Q-statistics revealed significant evidence for between-study heterogeneity (Q(92) =362.68, p < 0.001), which motivated an analysis of possible moderators. While some moderators were significant when included in the model (Table 2), none of them resulted in improved model fit.
* p < 0.05; bolding indicates significant coefficient and χ 2.
intcpt = model intercept; AICc = Akaike information criterion (corrected); χ 2 = likelihood ratio test.
To check for moderating effects of publication type, we compared the non-peer-reviewed and studies with unreported statistics to our baseline. Neither of these variables were found to moderate the effect of word tokens on outcomes. In addition, we found no evidence of asymmetry in our funnel plot using Egger’s test (β = 0.69, SE = 0.53, p = 0.22) (Figure 3). In sum, there was no evidence of publication bias in the word token studies.
3.2 Utterances
3.2.1. Summary statistics
We examined 45 correlations across 17 studies that measured word tokens (n = 956 unique participants). The number of effect size estimates per study ranged from 1 to 6 (median: 2). We found a medium-sized effect across studies (r = 0.19, Figure 4). Q-statistics revealed significant evidence for between-study heterogeneity (Q(44) = 338.75, p < 0.001). Some moderators were significant when included in the model (Table 3), but none of them improved overall model fit. Neither our analysis of publication status nor the Egger’s test (β = 0.83, SE = 0.34, p = 0.08) revealed any evidence of publication bias (Figure 5).
* p < 0.05; bolding indicates significant coefficient and χ 2.
intcpt = model intercept; AICc = Akaike information criterion (corrected); χ 2 = likelihood ratio test.
3.3 Word types
We examined 111 correlations across 37 studies that measured word types (n = 2420 unique participants), with effect size estimates per study ranging from 1 to 10 (median: 3). We found a medium-sized effect across studies (r = 0.27, Figure 6). Q-statistics revealed significant evidence for between-study heterogeneity motivating an analysis of moderators (Q(110) = 621.82, p < 0.001). We found that studies with children who were older at the time of input data collection reported larger correlations (Table 4), significantly improving model fit (χ 2(1) = 12.64, p < 0.001). Studies with children who were older at the time of language outcome data collection also had marginally larger effect sizes (p = 0.05), also improving model fit (χ 2(1) = 7.60, p = 0.006). Unsurprisingly, we found that these two variables were highly correlated with one another (R 2 = 0.39, p < 0.001), and thus, it is unclear if it is the age at input, outcome, or both that predicts the bigger effect size.
* p < 0.05; bolding indicates significant coefficient and χ 2.
intcpt = model intercept; AICc = Akaike information criterion (corrected); χ 2 = likelihood ratio test
We found that studies using parent-reported outcomes exhibited lower correlations with input than studies using direct assessments (χ 2(1) = 9.35, p = 0.03). Parent reports are more commonly used when children are younger, resulting in a difference in child age across these assessment types (β = −6.72, SE = 2.8, p = 0.02). When assessment type is included in models with age at input collection and outcome assessment as predictors, it is non-significant and fails to improve model fit. Thus, this effect was most likely due to the partial confound with child age.
Finally, while including vocabulary matching (i.e., parent word types and child types produced during observation) in our model was found to improve fit, there were not significantly higher effect sizes in studies where input and outcome measures were matched (Table 3). As a follow-up, we included relation type as a main effect and interaction to this model to see whether this effect was significant for studies where input and outcome are collected during the same observation session (i.e., where parent input might situationally influence child output, or vice versa). We found a significant interaction between these variables, such that reported correlations were higher in concurrent studies where input and outcomes were measured in the same way (main effect of matching: β = −0.01, SE = 0.08, p = 0.88; interaction: β = 0.23, SE = 0.07, p = 0.01).
In our analysis of publication bias, neither peer review nor whether the correlation was reported was found to moderate the effect of word types on outcomes. However, we did find evidence of funnel plot asymmetry using Egger’s test (β = 1.22, SE = 0.41, p = 0.02), such that studies with larger standard errors tended to be skewed towards larger positive values (Figure 7).
3.4 MLU
We examined 74 correlations across 27 studies that measured mean length of utterance (n = 2340 unique participants), with the number of effect size estimates per study ranging from 1 to 6 (median: 3). We found a medium-sized effect across studies (r = 0.21, Figure 8). Q-statistics revealed significant evidence for between-study heterogeneity (Q(74) = 339.42, p < 0.001), motivating an analysis of possible moderators. We found a significant and positive correlation between the length of the observation session and effect size (Table 5, χ 2(1) = 12.92, p < 0.001), with longer studies producing larger effect sizes. We might expect to see such an effect if MLU measures were more stable when the sample of utterances is larger. Neither our analysis of publication status nor the Egger’s test (β = 0.54, SE = 0.47, p = 0.25) revealed any evidence of publication bias (Figure 9).
* p < 0.05; bolding indicates significant coefficient and χ 2.
intcpt = model intercept; AICc = Akaike information criterion (corrected); χ 2 = likelihood ratio test.
3.5 Comparison of input measures
Finally, to compare the effect sizes for our four input measures, we first constructed a baseline model with no moderators, containing 323 input–outcome correlations across all 71 studies (Table 6). Overall, there was a medium-sized association between all input and children’s language outcomes (r = 0.24, p < 0.001; CI [0.20; 0.29]). Next, we added input measure as a moderating variable, using tokens as the contrast case. No significant difference in effect size was found between our input measures, and there was no improvement of model fit.
* p < 0.05; bolding indicates significant coefficient and χ 2.
intcpt = model intercept; AICc = Akaike information criterion (corrected); χ 2 = likelihood ratio test.
4 Discussion
The present meta-analysis drew upon 71 studies and 4760 participants to explore the magnitude of input effects. The analysis included 38 studies that were not included in the most recent prior meta-analysis on this topic (AGPJM). In addition, our analysis employed an innovative statistical method that allowed us to include multiple effect sizes per study, resulting in a total of 323 effect sizes. As yet there are no widely accepted methods for determining power in multilevel meta-regression models (e.g., Vembye et al., Reference Vembye, Pustejovsky and Pigott2023). Nevertheless, we should expect, on first principles, that the accuracy and sensitivity of an analysis will increase as the number of studies and effects that are included increases. Our meta-analysis also contributed to this literature by including studies conducted on languages other than English and using more sensitive within-study comparisons to explore differences in effect size across measures.
We found that the relationship between caregiver input and child language outcomes is reliable across four different measures of caregiver input: utterances, word tokens, word types, and MLU. These measures all produced similar small-to-medium-sized effects, with no significant differences between them. For word types, we found that effect sizes were reliably larger when children were older. For MLU, we found that effect sizes were larger in studies with longer observation sessions. We also found evidence that using parent and child word types collected from the same session produce larger correlations.
However, most of the moderators that have been hypothesized to be relevant were not reliable predictors of effect size in our analyses. This included caregiver demographics, child demographics, and whether the measure was based on a speech sample, experimenter administered test, or parent report. Critically, we did not replicate three findings from the AGPJM meta-analysis. In our sample, we found no evidence that naturalistic studies have larger effect sizes than more structured observations, nor that studies with cross-lagged observations have larger effects than studies with concurrent observations. Furthermore, we found no evidence to support the claim that measures of input quality are more reliable predictors than measures of input quantity, despite using potentially more sensitive within study models.
Finally, we found evidence for publication bias for studies of where parental word types were the critical input variable. In contrast, AGPJM found evidence for publication bias in studies of input quantity (i.e., tokens and utterances).
The remainder of our discussion we address four issues: (1) assessing the pooled effect sizes observed in this meta-analysis and how it affects our understanding of the input literature and its policy implications; (2) interpreting the moderators observed in our analysis; (3) understanding the null effects in this analysis; and (4) the limitations of this meta-analysis and the input literature more broadly. Throughout our discussion, we will conduct exploratory analysis on critical subgroups of studies within our sample to rule out different hypotheses about our results.
4.1 Assessing the magnitude of the pooled effects
A central goal of meta-analysis is to better understand how large a particular effect truly is. In our analyses, we found pooled effect sizes that ranged from r = 0.19 to r = 0.27. It is easier to conceptualize these effects if we convert them to R2 so that they represent the proportion of variance accounted for by the input variable. On this scale, the effects range from R 2 = 0.04 for utterances to R 2 = 0.07 for types. These estimates are quite similar to those in AGPJM even though less than half (46%) of the studies in our sample appeared in their sample and the outcome measures were categorized differently. Specifically, in AGPJM, the estimates ranged from R 2 = 0.04 for quantity to R 2 = 0.11 speech complexity.
These estimates might seem modest to those who were introduced to this question by Hart & Risley’s seminal 1995 study. It is hard to overstate the effect that H&R have had on language acquisition research and social policy; as of July 2024, Google Scholar lists over 12,000 citations to their 1995 book. However, the magnitude of the input–outcome relationship found by H&R for caregiver word types is substantially higher than the pooled effect size found in our meta-analysis (R 2 = 0.53 versus R 2 = 0.07). This difference would be critical, for example, for our expectations about the impact of a policy that sought to improve child language outcomes with parent training. Thus, understanding this discrepancy is critical to understanding how we can best use limited resources.
There are two broad explanations for these divergent effect size estimates. First, the H&R study might have unique properties that lead the true effect to be larger in their sample. Second, it is possible that H&R is simply one sample drawn from an underlying distribution in which the true effect size is roughly equivalent to the estimate from our analysis. The first explanation could lead to new directions for research and new ways in which policies might be targeted. The second hypothesis suggests that we, as scientists and policy makers, may need to adjust our expectations.
There are several features of H&R that stand out as potential reasons for a larger effect size. First, their measures of parent speech and their outcome measures were based on large samples of speech, collected over an unusually long-time frame, in a naturalistic context. Specifically, as many as 29 hour-long observation sessions were conducted in the child’s home between the ages of 7 months to 3 years. In contrast, the other studies of word types in our sample used between 2 minutes and 2 hours of input (M = 26 minutes). As a result, H&R may have produced more accurate estimates of parental speech resulting in a larger observed correlation. Similarly, their primary outcome variable was an estimate of lexical types produced across three-hour-long observation sessions. The mean length of the child observations across studies of word types in our sample was 15 minutes. If these factors were responsible for the larger effect size, it would suggest (1) that policymakers could expect large effects from interventions that are effective in changing parental input in enduring ways, and (2) that researchers (and clinicians) should consider longer data collection periods for input studies. Our meta-analysis, however, does not support this conclusion: for studies that used parental types as an input measure, we found no evidence that the length of the observation period moderated the effect size. One might question the relevance of our moderator analysis, since there are few studies with an observation period that was anywhere near as long as H&R. We disagree: if the advantage of larger speech samples is that they are less noisy, then we would expect to see the steepest improvement in stability at the low end of the scale. This is, however, ultimately an empirical question. The relationship between sample size and correlation strength could be directly tested by conducting secondary analyses of the H&R data set to determine how rapidly input measures approximate the estimate from the total sample as increasingly large subsamples of input are analysed.
The second feature that makes H&R unusual is that the sample was selected to overrepresent the extreme ends of the socio-economic spectrum in the U.S. Of the 42 participating families, six were receiving welfare benefits and thirteen were recruited because the primary wage earner was a high-status professional. In their sample, there was a strong relationship between SES and input, with caregivers in professional families producing about three times as much speech as the parents receiving benefits. Subsequent studies have not found differences between their SES groups that are anywhere near this large (see Dailey & Bergelson, Reference Dailey and Bergelson2022). For example, Hoff (Reference Hoff2003) found that the high SES families in her sample produced roughly 33% more speech than her mid SES group (see also Gilkerson et al., Reference Gilkerson, Richards, Warren, Montgomery, Greenwood, Kimbrough Oller, Hansen and Paul2017). It is unclear whether this difference in findings reflects the unusual composition of the Hart and Risley sample, changes in child-rearing practices across communities in the U.S. over time, or the way in which their input measures were collected and conceptualized. But critically, whatever its cause, the tighter link between SES and input in the Hart and Risley sample raises the possibility that the unusually large correlation between input and outcome in that study is attributable to other causal pathways linking parental SES to child language outcomes, such as passive gene environment correlations (see Coffey et al., Reference Coffey, Shafto, Geren and Snedeker2022 for discussion) or the effect of higher maternal education on a wider range of parenting practices that might influence children’s linguistic and cognitive development.
As we noted above, the second hypothesis is that H&R is drawn from the same underlying distribution as the other studies with a true effect size around R 2 = 0.07. H&R’s sample size was modest for a study focused on individual differences (n = 42). We expect the variability of effect sizes to increase as sample size decreases (giving funnel plots their characteristic shape). The H&R results, however, fall outside of the range of what we might expect on that basis alone (see Figure 7). In fact, even if the true effect size was moderately larger than our estimate (e.g., r = 0.35, or the upper limit of our 95% confidence interval), H&R would remain an outlier.
4.2 Moderator effects
4.2.1. Older children benefit more from lexical diversity
We found that the pooled effect size for word type studies was larger when the children studied were older at both the time of input collection and the time of language assessment. Why might younger children benefit less from lexical diversity?
One possibility is that words that children acquire early in life are so common and so concrete (Braginsky et al., Reference Braginsky, Yurovsky, Marchman and Frank2019; Coffey & Snedeker, Reference Coffey and Snedeker2024) that they are likely to appear in informative contexts even in the speech of parents who exhibit lower lexical diversity. After children pick up these more common words, they learn words that are less frequent and less consistent across parents but are still quite concrete, like tiger or truck (Coffey et al., Reference Coffey, Zeitlin, Crawford and Snedeker2024). At this stage, children who hear more diverse input could reap the benefit of encountering more word types. In addition, as children become more linguistically proficient, they are more likely to learn words that are frequent but not particularly concrete (Coffey et al., Reference Coffey, Zeitlin, Crawford and Snedeker2024). Words of this kind can often only be acquired by using information from different contexts in which the word was used (Gillette et al., Reference Gillette, Gleitman, Gleitman and Lederer1999). More lexically diverse speech might be more likely to provide these clues.
We did not find any evidence that age moderated the effects of the other input measures. The fact that the quantity of input (tokens and utterances) is equally helpful across this developmental range is consistent with most learning theories—more learning opportunities are helpful even for the easiest words. However, to the extent that these measures are correlated with types (see below), we would expect to find an effect of age given a sufficiently large sample of studies and participants.
The prior literature on MLU has been mixed. Some conclude that it only predicts outcomes when it is tailored to children’s level of development (in which case we would expect an effect of age, e.g., Murray et al., Reference Murray, Johnson and Peters1990). Others find that complex speech predicts outcomes at all ages (in which case we would not expect an effect of age, e.g., Hoff-Ginsberg, Reference Hoff-Ginsberg1986). The fact that MLU does relate to input across studies but is not moderated by age suggests the latter may be true.
Interestingly, while AGPJM found an effect of child age on correlations with input quality, when they conducted separate analyses on studies that contained measures of lexical diversity (equivalent to our word types) and sentence complexity (consisting of not only MLU, but also other measures, such as sophistication, rare words, and multi-clausal utterances), they found no effects. Our results suggest that the age effects in the primary analysis were likely driven by word types, rather than the complexity. Our ability to find this effect within the studies of lexical diversity is likely due to the larger sample of relevant studies available to us (N = 17 versus N = 37).
4.2.2. Length of the observation and MLU
We found that MLU studies with longer observation sessions reported larger correlations. This could be because longer observation sessions produce more stable measures of MLU. However, this leaves open the question of why we did not find a moderating effect in our analyses of utterances, word tokens, or word types. One possibility is that MLU is intrinsically a less stable measure than the others, requiring a longer session to measure reliably. This could be assessed using existing data sets (e.g., by comparing correlations across different-sized sub-samples). AGPJM also found that observation duration moderated the effect size of input quality studies. Our results suggest that this finding may be driven by studies using sentence complexity measures, rather than lexical diversity measures.
4.3 Surprising non-moderators
4.3.1. Observation activity
We did not find significant differences in the size of the input–outcome correlations depending on the activity during the observation session. In contrast, in their analysis of input quality, AGPJM found that studies using naturalistic observation produced larger effect sizes, as compared to other contexts. A priori, we might expect larger effect sizes from naturalistic observations because they might be more representative of typical input. Previous studies, however, have found that input measures from structured and naturalistic observations are correlated (Tamis‐LeMonda et al., Reference Tamis‐LeMonda, Kuchirko, Luo, Escobar and Bornstein2017).
One possibility is that we reduced the difference in effect size between naturalistic studies and other studies by including LENA studies, which are naturalistic but were omitted by AGPJM. We do not believe that this is the case: omitting LENA studies from our analysis did not change our results (see Supplementary Materials). We do not find this surprising, as meta-analyses of LENA studies and non-LENA studies result in similar effect-size estimates (Wang et al., Reference Wang, Williams, Dilley and Houston2020; AGPJM).
Instead, we suspect that the finding in AGPJM is attributable to the very small number of naturalistic studies in their sample (5 for the quality analysis). This could make their moderator analysis vulnerable to skewing due to a couple of naturalistic studies with unusually large effect sizes (such as H&R). In contrast, our sample of naturalistic studies was larger (10 for types), which may have made our analysis less sensitive to skewing.
4.3.2. Cross-lagged versus concurrent studies
We found no significant differences between studies where input and outcome are collected concurrently and studies where data collection was cross-lagged. In contrast, AGPJM found larger correlations in quantity studies that were cross-lagged. This is unlikely to reflect skew due to outliers since there are a number of studies of both kinds in their sample (N = 16 for concurrent, N = 17 for cross-lagged).
Given our large sample of effect sizes (k = 93 for tokens; k = 45 for utterances), it is unlikely that we lacked the power to detect such an effect. Instead, we suspect that their finding was a side-effect of their hierarchical data selection procedure: if studies had cross-lagged and concurrent correlations, only a cross-lagged correlation was included in the meta-analysis, creating a confound between study complexity/length and temporal design. In our study, all correlations were included. In addition, given the large number of mediators in these meta-analyses and the confounds between them, we are likely to find effects that shrink or disappear as more data are collected (Barnett et al., Reference Barnett, Van Der Pols and Dobson2005; Gelman & Carlin, Reference Gelman and Carlin2014).
4.3.3. Quality versus quantity of input
Another notable difference between our study and AGPJM is the fact that we did not find a difference in effect size between any of our input measures. Some researchers have argued that measures of input quality are better suited to predict individual differences in language outcomes than measures of input quantity (e.g., Golinkoff et al., Reference Golinkoff, Hoff, Rowe, Tamis‐LeMonda and Hirsh‐Pasek2019). This would be expected if the pace of acquisition did not depend primarily on the number of words a child encounters but the degree to which the context of word use allows them to infer their meaning. While the logic behind this argument is sound, one might still expect the effect sizes for quantity and quality measures to be quite similar because in practice they are often highly correlated. To explore this, we calculated these correlations for the studies in our sample with available data. There were large correlations between types and tokens (range: r = 0.65–0.94; median: r = 0.88; k = 9), types and utterances (range: r = 0.45–0.90; median: r = 0.75; k = 6), and MLU and tokens (range: r = 0.19–0.71; median: r = 0.44; k = 8). The only correlation that was small and sometimes negative was between utterances and MLU (range: r = −0.44–0.36; median: r = 0.12; k = 8).
4.4 Remaining questions
4.4.1. Culture and language
The dearth of input studies conducted outside of the Western world or with speakers of non-Western languages makes a systematic investigation of cultural or linguistic moderators of the input–outcome relationship difficult. In our analysis, we tried to get at this question by characterizing studies as either “within the U.S.” or “outside the U.S.” and as either “English” or “non-English.” This classification system cannot be justified on cultural, geographic, or linguistic grounds. It makes sense only in light of the degree to which developmental research in general, and work on this topic in particular, has focused on English-speaking populations within the United States (Kidd & Garcia, Reference Kidd and Garcia2022). This was the only coding scheme that would allow us to amass a reasonable, albeit small, number of studies in the second group.
Nevertheless, we see this as one small but important step in using meta-analytic approaches to examine cross-cultural input studies. We found no evidence that studies conducted in English or in the U.S. produced larger or smaller correlations than other studies. This finding has two critical limitations. First, due to the small number of non-English studies (12/75) and non-US studies (16/75), we may lack the power to detect modest effects. Second, the non-English and non-US samples consisted of families living in urban areas of Europe, East Asia, or North America. While there are a few input studies conducted in rural agrarian settings (e.g., Mastin & Vogt, Reference Mastin and Vogt2016; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012; Weber et al., Reference Weber, Fernald and Diop2017; Zhang et al., Reference Zhang, Liu, Pappas, Dill, Feng, Zhang, Zhao, Rozelle and Ma2023), these studies were not eligible for our meta-analysis for a number of reasons (e.g., no appropriate input measures, no input–outcome correlations reported, or conducted after the final search). Thus, we cannot speak to the degree to which input–output correlations vary across the full range of human societies.
Cross-cultural research is critical for understanding the nature of input–outcome correlations and what they might reveal about the causal role of input in language development. There is considerable cross-cultural variation in how parents speak to their children and their beliefs about the role this plays in language development (Schieffelin & Ochs, Reference Schieffelin and Ochs1986). Our current understanding of the relationship between input variation and outcome is based almost entirely on a narrow set of environments (mostly in the U.S., mostly in English) in which talking to young children is not only accepted but encouraged and deemed valuable. Determining whether the magnitude of these input–outcome correlations is affected by variation in mean input amount or variation in the language socialization practices will provide critical insights into the causal connections between input and outcome. If the correlations with caregiver speech shrink or disappear entirely in some contexts, it might suggest that other sources of input play a larger role in these contexts, that additional factors need to be present for input to set the pace for outcomes, or that third variables (like maternal education) are inflating the correlations in WEIRD societies. If the correlations are present cross-culturally but increase with variation in input within the population, it would provide additional support for the simple causal model in which input sets the pace for early acquisition.
Currently, however, we are in no position to address these questions. While our analysis of mean input as a moderator was negative for all input measures, the range of variation was restricted to what is found in urbanized societies where formal education is valued. Our review confirms the need for additional research on input effects in non-Western societies, small-scale societies, agrarian societies, and societies where secondary education is less common.
4.4.2. Socioeconomic status
There are compelling reasons to believe a priori that we would find larger input effects in studies of low-SES households. Environmental differences account for more variance in the developmental outcomes of children from lower-SES households than children from higher-SES households (Turkheimer et al., Reference Turkheimer, Haley, Waldron, D’Onofrio and Gottesman2003). One explanation for this pattern is that children in lower-SES households experience more environmental heterogeneity than high-SES children. If this was the case, we might expect to see larger input–outcome correlations in studies with low-SES households, which we did not. One possibility is that we were underpowered to find such effects. We had fewer studies that drew only from low-SES households as compared to middle-upper SES households (e.g., N = 7 Low versus N = 24 M-U in our analysis of word types).
We might have also expected to find differences larger correlations in studies that contain socioeconomically diverse samples of children (relative to middle-upper SES samples). Recent studies have confirmed that there are, on average, modest but reliable differences in the amount of child-directed speech between higher-SES and lower-SES households (Dailey & Bergelson, Reference Dailey and Bergelson2022). Thus, we would expect studies that sampled from different socioeconomic groups would find greater variation in input, and therefore larger input–outcome correlations. However, the absence of a moderating effect in both AGPJM’s meta-analysis and our own suggests that this is not the case. Here again, power is a concern: there are fewer studies in our analyses that draw from different socioeconomic groups (e.g., N = 6 Diverse versus N = 24 M-U in our analysis of word types).
4.5 Limitations
4.5.1. Unable to establish causality
Although our approach has demonstrated that associations between input and outcome are reliable across studies, correlational research of this kind cannot disambiguate the causal relationship between these factors. It is possible that the robust relationships between input and outcome we observe are caused by a third variable that influences both parental speech and the pace of child language acquisition. For example, input could be related to other environmental factors that impact development, such as general parental attentiveness or the availability of educational materials in the home. In addition, most input studies are conducted with children and their biological parents. Language ability, like most human characteristics, is greatly influenced by genetic factors (Polderman et al., Reference Polderman, Benyamin, De Leeuw, Sullivan, Van Bochoven, Visscher and Posthuma2015; Stromswold, Reference Stromswold2001). This introduces the possibility that the associations between caregiver input and children’s language outcomes we observe are genetic in nature: verbal parents have verbal children because they pass on those genes.
Nevertheless, there are several reasons to believe that these effects might be causal in nature. For one, parent-targeted randomized control trials that produce changes in input also often impact language outcomes (e.g., Suskind et al., Reference Suskind, Leffel, Graf, Hernandez, Gunderson, Sapolich, Suskind, Leininger, Goldin-Meadow and Levine2016; Weber et al., Reference Weber, Fernald and Diop2017). Second, as we have seen in our meta-analysis, input effects persist across a range of environments and measures, suggesting that, if there is a third variable underlying the pattern, it must be one that is correlated with both input and outcome across these environments. Finally, although there are only a few studies that use genetically non-confounded designs, these studies find reliable input–outcome correlations (Hardy-Brown et al., Reference Hardy-Brown, Plomin and DeFries1981; Huttenlocher et al., Reference Huttenlocher, Vasilyeva, Cymerman and Levine2002; Gauthier et al., Reference Gauthier, Genesee, Dubois and Kasparian2013; Coffey et al., Reference Coffey, Shafto, Geren and Snedeker2022, but see Wadsworth et al., Reference Wadsworth, Corley, Hewitt, Plomin and DeFries2002). Unfortunately, there are not enough studies of this kind to use meta-analysis to determine whether the input–outcome correlation in these studies is smaller than studies with a genetic confound. Future work of this kind is necessary to understand the complex causal pathways linking language input and outcomes.
4.5.2. Remaining sources of bias
The only evidence of publication bias that we found was asymmetry in the funnel plot for word types, indicating that studies with smaller samples reported larger effects than we would expect. This asymmetry could reflect differences in the methods used in larger and smaller studies, but it could also result from studies with non-significant correlations being culled from the literature. We attempted to address this by reaching out to authors for unpublished studies, but this approach is unlikely to totally eliminate this source of bias. In a paper examining 10 meta-analyses across different areas of language and cognitive development, Tsuji et al. (Reference Tsuji, Cristia, Frank and Bergmann2020) found that the inclusion of unpublished literature did not result in any significant difference in estimated effect size. This may be because unpublished data is often collected by reaching out to authors in familiar networks which may favour the reporting of positive results. Furthermore, it is likely that in many published studies only a subset of correlations calculated between variables were reported. Preregistration and open data access have emerged as partial solutions to this problem.
4.5.3. Limited data on other kinds of input
Almost all of the studies considered here tracked child-directed speech produced by adults. An open question in language development is the degree to which children benefit from overheard speech or speech produced by other children. Many accounts of rural societies stress the importance placed on children’s ability to learn about the adult world by watching and listening (e.g., Schieffelin & Ochs, Reference Schieffelin and Ochs1986; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012). In many cultures, older children assume caregiving responsibilities early in life and potentially account for a large amount of input to young learners (e.g., Loukatou et al., Reference Loukatou, Scaff, Demuth, Cristia and Havron2022; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012). Some previous studies have suggested that overheard speech and sibling speech are less useful for learners, at least in WEIRD societies (e.g., Mannle et al., Reference Mannle, Barton and Tomasello1992). This might lead us to expect smaller or non-significant relationships with outcomes as compared to maternal input. Within our sample, the few studies of overheard speech (N = 3) and sibling input (N = 2) give uniformly null results. Nevertheless, omitting these sources of speech risks mischaracterizing the early language environments of children in other cultural contexts (Sperry et al., Reference Sperry, Sperry and Miller2019).
4.6 Conclusion
In our sample of 71 input studies, we found that caregiver input predicted child language outcomes, albeit to a lesser degree than some early studies suggested (R 2 = 0.04–0.07). The size of these input–outcome associations is similar across different input measures. For word types, we found evidence that the correlation increases with age, as well as evidence of publication bias. For mean length of utterance, we found larger associations between input and outcome measures in longer observation sessions.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0305000924000692.
Data availability statement
Our data are fully available via OSF: OSF | Does Talking To Children Matter? A Meta-Analysis (https://osf.io/aydcf/).
Acknowledgements
We would like to thank all the authors who contributed publications and data to this project: Carina Lüke, Meredith Rowe, Elaine Smolen, Ronda Rufsvold, Maria Hartman, Jill Gilkerson, Erika Hoff, Riccardo Fusaroli, Eric Walle, Laurie Hoffman, Mitsuhiko Ota, Mele Taumoepeau, Katrina D’Apice, Gabrielle Strauss, Naja Ferjan Ramírez, Daniel Swingley, Shuxia Liu, Aleka Akoyunoglou Blackwell, Allegra Cattani, Katie Alcock, and Christine Cox Eriksson. We would also like to thank our research assistant Claire Lin for her work on data collection and coding.
Competing interest
We have no known competing interests to disclose.