Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-01-10T20:58:20.925Z Has data issue: false hasContentIssue false

The cross-linguistic performance of word segmentation models over time

Published online by Cambridge University Press:  11 October 2019

Andrew CAINES*
Affiliation:
Department of Computer Science & Technology, University of Cambridge, Cambridge, UK
Emma ALTMANN-RICHER
Affiliation:
Faculty of Modern & Medieval Languages, University of Cambridge, Cambridge, UK
Paula BUTTERY
Affiliation:
Department of Computer Science & Technology, University of Cambridge, Cambridge, UK
*
*Corresponding author: Department of Computer Science & Technology, William Gates Building, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK. E-mail: [email protected]

Abstract

We select three word segmentation models with psycholinguistic foundations – transitional probabilities, the diphone-based segmenter, and PUDDLE – which track phoneme co-occurrence and positional frequencies in input strings, and in the case of PUDDLE build lexical and diphone inventories. The models are evaluated on caregiver utterances in 132 CHILDES corpora representing 28 languages and 11.9 m words. PUDDLE shows the best performance overall, albeit with wide cross-linguistic variation. We explore the reasons for this variation, fitting regression models to performance scores with linguistic properties which capture lexico-phonological characteristics of the input: word length, utterance length, diversity in the lexicon, the frequency of one-word utterances, the regularity of phoneme patterns at word boundaries, and the distribution of diphones in each language. These properties together explain four-tenths of the observed variation in segmentation performance, a strong outcome and a solid foundation for studying further variables which make the segmentation task difficult.

Type
Articles
Copyright
Copyright © Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–23.Google Scholar
Amatuni, A., & Bergelson, E. (2017). Semantic networks generated from early linguistic input. In Proceedings of the 39th Annual Conference of the Cognitive Science Society. Online <https://mindmodeling.org/cogsci2017/papers/0302/index.html>..>Google Scholar
Aslin, R. N., Saffran, J. R. & Newport, E. L. (1998). Computation of probability statistics by 8-month-old infants. Psychological Science, 9, 321–4.Google Scholar
Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.Google Scholar
Baayen, R. H., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390412.Google Scholar
Bartoń, K. (2018). MuMIn: Multi-Model Inference. R package version 1.42.1. Online <https://cran.r-project.org/package=MuMIn>..>Google Scholar
Basbøll, H. (2005). The phonology of Danish. Oxford University Press.Google Scholar
Basbøll, H. (2012). Monosyllables and prosody: the sonority syllable model meets the word. In Stolz, T., Nau, N., & Stroh, C. (Eds.), Studia typologica: Monosyllables: from phonology to typology (pp. 1341). Berlin: De Gruyter.Google Scholar
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 148.Google Scholar
Bentz, C., Alikaniotis, D., Cysouw, M., & i Cancho, R. F. (2017). The entropy of words – learnability and expressivity across more than 1000 languages. Entropy, 19(6), 275.Google Scholar
Bergelson, E., Amatuni, A., Dailey, S., Koorathota, S., & Tor, S. (2019). Day by day, hour by hour: naturalistic language input to infants. Developmental Science, 22(1), e12715.Google Scholar
Bernard, M. (2018). phonemizer-1.0. Online <http://doi.org/10.5281/zenodo.1045826>..>Google Scholar
Bernard, M., Thiolliere, R., Saksida, A., Loukatou, G., Larsen, E., Johnson, M., Fibla, L., Dupoux, E., Daland, R., Cao, X., & Cristia, A. (in press). WordSeg: standardizing unsupervised word form segmentation from text. Behavior Research Methods. Online <https://doi.org/10.3758/s13428-019-01223-3>..>Google Scholar
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the Natural Language Toolkit. Beijing: O'Reilly Media.Google Scholar
Bleses, D., Basbøll, H., & Vach, W. (2011). Is Danish difficult to acquire? Evidence from Nordic past-tense studies. Language and Cognitive Processes, 26, 1193–231.Google Scholar
Bleses, D., Vach, W., Slott, M., Wehberg, S., Thomsen, P., Madsen, T., & Basbøll, H. (2008). Early vocabulary development in Danish and other languages: a CDI-based comparison. Journal of Child Language, 35, 619–50.Google Scholar
Bortfield, H., Morgan, J., Golinkoff, R., & Rathbun, K. (2005). Mommy and me: familiar names help launch babies into speech-stream segmentation. Psychological Science, 16, 298304.Google Scholar
Boruta, L., Peperkamp, S., Crabbé, B., & Dupoux, E. (2011). Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics. Online <https://www.aclweb.org/anthology/W11-0601>..>Google Scholar
Braginsky, M., Yurovsky, D., Marchman, V., & Frank, M. (2018). Consistency and variability in children's word learning across languages. PsyArXiv. doi:10.31234/osf.io/cg6ahGoogle Scholar
Brent, M., & Cartwright, T. (1996). Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93125.Google Scholar
Butler, J., & Frota, S. (2018). Emerging word segmentation abilities in European Portuguese-learning infants: new evidence for the rhythmic unit and the edge factor. Journal of Child Language, 45, 1294–308.Google Scholar
Cairns, P., Shillcock, R., Chater, N., & Levy, J. (1997). Bootstrapping word boundaries: a bottom-up corpus-based approach to speech segmentation. Cognitive Psychology, 33, 111–53.Google Scholar
Casas, B., Català, N., Ferrer-i-Cancho, R., Hernández-Fernández, A., & Baixeries, J. (2018). The polysemy of the words that children learn over time. Interaction Studies, 19, 389426.Google Scholar
Chin, I., Goodwin, M., Vosoughi, S., Roy, D., & Naigles, L. (2018). Dense home-based recordings reveal typical and atypical development of tense/aspect in a child with delayed language development. Journal of Child Language, 45, 134.Google Scholar
Çöltekin, Ç. (2017). Using predictability for lexical segmentation. Cognitive Science, 41, 19882021.Google Scholar
Curtin, S. (2009). Twelve-month-olds learn novel word–object pairs differing only in stress pattern. Journal of Child Language, 36, 1157–65.Google Scholar
Curtin, S., Mintz, T. H., & Christiansen, M. H. (2005). Stress changes the representational landscape: evidence from word segmentation. Cognition, 96, 233–62.Google Scholar
Cutler, A., & Carter, D. (1987). The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2, 133–42.Google Scholar
Dahan, D., & Brent, M. (1999). An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–85.Google Scholar
Daland, R., & Pierrehumbert, J. (2011). Learning diphone-based segmentation. Cognitive Science, 35, 119–55.Google Scholar
Dautriche, I., Mahowald, K., Gibson, E., Christophe, A., & Piantadosi, S. (2017). Words cluster phonetically beyond phonotactic regularities. Cognition, 163, 128–45.Google Scholar
Dupoux, E., Parlato, E., Frota, S., Hirose, Y., & Peperkamp, S. (2011). Where do illusory vowels come from? Journal of Memory and Language, 64, 199210.Google Scholar
Ettlinger, M., Finn, A., & Kam, C. H. (2012). The effect of sonority on word segmentation: evidence for the use of a phonological universal. Cognitive Science, 36, 655–73.Google Scholar
Evert, S. (2004). A simple LNRE model for random character sequences. In Proceedings of JADT. Online <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.642>..>Google Scholar
Evert, S., & Baroni, M. (2007). zipfR: word frequency distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions. Online <https://www.aclweb.org/anthology/P07-2008>..>Google Scholar
Fourtassi, A., Börschinger, B., Johnson, M., & Dupoux, E. (2013). Whyisenglishsoeasytosegment? In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics. Online <https://www.aclweb.org/anthology/W13-2601>..>Google Scholar
Frank, M., Goldwater, S., Griffiths, T., & Tenenbaum, J. (2010). Modeling human performance in statistical word segmentation. Cognition, 117, 107125.Google Scholar
Frank, S., Keller, F., & Goldwater, S. (2013). Exploring the utility of joint morphological and syntactic learning from child-directed speech. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Online <https://www.aclweb.org/anthology/D13-1004>..>Google Scholar
Friederici, A., & Wessels, J. (1993). Phonotactic knowledge of word boundaries and its use in infant speech-perception. Perception & Psychophysics, 54, 287–95.Google Scholar
Gambell, T., & Yang, C. (2005). Word segmentation: quick but not dirty. Unpublished ms, Yale University. Online <http://www.ling.upenn.edu/~ycharles/papers.html>..>Google Scholar
Gervain, J., & Guevara Erra, R. G. (2012). The statistical signature of morphosyntax: a study of Hungarian and Italian infant-directed speech. Cognition, 125, 263–87.Google Scholar
Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A Bayesian framework for word segmentation: exploring the effects of context. Cognition, 112, 2154.Google Scholar
Goodsitt, J. V., Morgan, J. L., & Kuhl, P. K. (1993). Perceptual strategies in prelingual speech segmentation. Journal of Child Language, 20, 229–52.Google Scholar
Graf Estes, K., & Hurley, K. (2013). Infant-directed prosody helps infants map sounds to meanings. Infancy, 18, 797824.Google Scholar
Grønnum, N. (2003). Why are the Danes so hard to understand? In Galberg Jacobsen, H., Bleses, D., Madsen, T. O. & Thomsen, P. (Eds.), Take Danish – for instance: linguistic studies in honour of Hans Basbøll presented on the occasion of his 60th birthday 12 July 2003. Odense: University Press of Southern Denmark.Google Scholar
Hallé, P. A., & de Boysson-Bardies, B. (1994). Emergence of an early receptive lexicon: infants’ recognition of words. Infant Behavior and Development, 17, 119–29.Google Scholar
Hammarström, H., Forkel, R., & Haspelmath, M. (2018). Glottolog 3.3. Online <https://glottolog.org>..>Google Scholar
Hartman, K., Bernstein Ratner, N., & Newman, R. (2017). Infant-direct speech (IDS) vowel clarity and child language outcomes. Journal of Child Language, 44, 1140–62.Google Scholar
Hay, J., Pelucchi, B., Estes, K., & Saffran, J. (2011). Linking sounds to meanings: infant statistical learning in a natural language. Cognitive Psychology, 63, 93106.Google Scholar
Hendrickson, A., & Perfors, A. (2019). Cross-situational learning in a Zipfian environment. Cognition, 189, 1122.Google Scholar
Hockema, S. (2006). Finding words in speech: an investigation of American English. Language Learning and Development, 2, 119–46.Google Scholar
James, W. (1890). The principles of psychology, Volume 1. New York: Henry Holt and Company.Google Scholar
Johnson, E., & Jusczyk, P. (2001). Word segmentation by 8-month-olds: when speech cues count more than statistics. Journal of Memory and Language, 44, 548–67.Google Scholar
Johnson, E., & Tyler, M. (2010). Testing the limits of statistical learning for word segmentation. Developmental Science, 13, 339–45.Google Scholar
Johnson, M. (2008). Unsupervised word segmentation for Sesotho using adaptor grammars. In Proceedings of the Tenth Meeting of the ACL Special Interest Group on Computational Morphology and Phonology. Online <https://www.aclweb.org/anthology/W08-0704>..>Google Scholar
Jusczyk, P. W., Cutler, A., & Redanz, N. (1993). Preference for the predominant stress patterns of English words. Child Development, 64, 675–87.Google Scholar
Jusczyk, P. W., Luce, P., & Charles-Luce, J. (1994). Infants’ sensitivity to phonotactic patterns in the native language. Journal of Memory and Language, 33, 630–45.Google Scholar
Kidd, E., Junge, C., Spokes, T., Morrison, L., & Cutler, A. (2018). Individual differences in infant speech segmentation: achieving the lexical shift. Infancy, 23, 770–94.Google Scholar
Krogh, L., Vlach, H. A., & Johnson, S. P. (2012). Statistical learning across development: flexible yet constrained. Frontiers in Psychology, 3. doi:10.3389/fpsyg.2012.00598Google Scholar
Kurumada, C., Meylan, S., & Frank, M. (2013). Zipfian frequency distributions facilitate word segmentation in context. Cognition, 127, 439–53.Google Scholar
Ladefoged, P. (2003). Commentary: some thoughts on syllables–an old-fashioned interlude. In Local, J., Ogden, R., & Temple, R. (Eds.), Phonetic interpretation: Papers in Laboratory Phonology VI. (pp. 269–78). Cambridge University Press.Google Scholar
Larsen, E., Cristia, A., & Dupoux, E. (2017). Relating unsupervised word segmentation to reported vocabulary acquisition. In Proceedings of INTERSPEECH. Online <https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0937.html>..>Google Scholar
Lignos, C. (2012). Infant word segmentation: an incremental, integrated model. In Proceedings of the West Coast Conference on Formal Linguistics. Online <http://www.lingref.com/cpp/wccfl/30/paper2821.pdf>..>Google Scholar
Linzen, T., & Gallagher, G. (2017). Rapid generalization in phonotactic learning. Laboratory Phonology, 8, 132.Google Scholar
Long, J. (2018). jtools: analysis and presentation of social scientific data. R package version 1.1.1. Online <https://cran.r-project.org/package=jtools>..>Google Scholar
MacWhinney, B. (1982). Basic syntactic processes. In Kuczaj, S. (Ed.), Language acquisition. volume 1: syntax and semantics (pp. 73136). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
MacWhinney, B. (2000). The CHILDES project: tools for analyzing talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
Mampe, B., Friederici, A. D., Christophe, A., & Wermke, K. (2009). Newborns’ cry melody is shaped by their native language. Current Biology, 15, 14.Google Scholar
Mani, N., & Pätzold, W. (2016). Sixteen-month-old infants’ segment words from infant- and adult-directed speech. Language Learning and Development, 12, 499508.Google Scholar
Mattys, S., & Jusczyk, P. (2001). Phonotactic cues for segmentation of fluent speech by infants. Cognition, 78, 91121.Google Scholar
Mattys, S., White, L., & Melhorn, J. (2005). Integration of multiple segmentation cues: a hierarchical framework. Journal of Experimental Psychology: General, 134, 477500.Google Scholar
May, L., Byers-Heinlein, K., Gervain, J., & Werker, J. F. (2011). Language and the newborn brain: Does prenatal language experience shape the neonate neural response to speech? Frontiers in Psychology, 2. doi:10.3389/fpsyg.2011.00222Google Scholar
McCauley, S., Monaghan, P., & Christiansen, M. (2015). Language emergence in development. In MacWhinney, B. & O'Grady, W. (Eds.), The handbook of language emergence (pp. 415–36). Oxford: Blackwell.Google Scholar
Mehler, J., Dommergues, J. Y., Frauenfelder, U., & Segui, J. (1981). The syllable's role in speech segmentation. Journal of Verbal Learning and Verbal Behavior, 20, 298305.Google Scholar
Mintz, T., Walker, R., Welday, A., & Kidd, C. (2018). Infants’ sensitivity to vowel harmony and its role in segmenting speech. Cognition, 171, 95107.Google Scholar
Monaghan, P., & Christiansen, M. (2010). Words in puddles of sound: modelling psycholinguistic effects in speech segmentation. Journal of Child Language, 37, 545–64.Google Scholar
Moon, C., Panneton Cooper, R., & Fifer, W. P. (1993). Two-day-olds prefer their native language. Infant Behavioral Development, 16, 495500.Google Scholar
Nespor, M., Peña, M., & Mehler, J. (2003). On the different roles of vowels and consonants in speech processing and language acquisition. Lingue e Linguaggio, 2, 221–47.Google Scholar
Ngon, C., Martin, A., Dupoux, E., Cabrol, D., Dutat, M., & Peperkamp, S. (2013). (Non)words, (non)words, (non)words: evidence for a protolexicon during the first year of life. Developmental Science, 16, 2434.Google Scholar
Ota, M., & Skarabela, B. (2018). Reduplication facilitates early word segmentation. Journal of Child Language, 45, 204–18.Google Scholar
Pelucchi, B., Hay, J., & Saffran, J. (2009a). Learning in reverse: eight-month-old infants track backward transitional probabilities. Cognition, 113, 244–7.Google Scholar
Pelucchi, B., Hay, J., & Saffran, J. (2009b). Statistical learning in a natural language by 8-month-old infants. Child Development, 80, 674–85.Google Scholar
Phillips, L. (2015). The role of empirical evidence in modeling speech segmentation (Unpublished dissertation, University of California, Irvine). Retrieved from <http://eric.ed.gov/?id=ED568017>..>Google Scholar
Phillips, L., & Pearl, L. (2015). Utility-based evaluation metrics for models of language acquisition: a look at speech segmentation. In Proceedings of the Sixth Workshop on Cognitive Modeling and Computational Linguistics. Online <https://www.aclweb.org/anthology/W15-1108>..>Google Scholar
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Online <https://www.r-project.org>..>Google Scholar
Räsänen, O., Doyle, G., & Frank, M. (2018). Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171, 130–50.Google Scholar
Rowland, C. F., & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. Journal of Child Language, 33, 859–77.Google Scholar
Saffran, J., Aslin, R., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–8.Google Scholar
Saksida, A., Langus, A., & Nespor, M. (2017). Co-occurrence statistics as a language-dependent cue for speech segmentation. Developmental Science, 20(3). doi.org/10.1111/desc.12390Google Scholar
Schüppert, A., Hilton, N. H., & Gooskens, C. (2016). Why is Danish so difficult to understand for fellow Scandinavians? Speech Communication, 79, 4760.Google Scholar
Shoemaker, E., & Wauquier, S. (2019). The emergence of speech segmentation in adult L2 learners of French. Language, Interaction and Acquisition, 10, 2244.Google Scholar
Siyanova-Chanturia, A., Conklin, K., Caffarra, S., Kaan, E., & Van Heuven, W. (2017). Representation and processing of multi-word expressions in the brain. Brain and Language, 175, 111–22.Google Scholar
Swingley, D. (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86132.Google Scholar
Tamis-LeMonda, C., Kuchirko, Y., Luo, R., Escobar, K., & Bornstein, M. (2017). Power in methods: language to infants in structured and naturalistic contexts. Developmental Science, 20. doi.org/10.1111/desc.12456Google Scholar
Thiessen, E. D., & Saffran, J. R. (2003). When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental Psychology, 39, 706–16.Google Scholar
Tomasello, M. (2000). The item-based nature of children's early syntactic development. Trends in Cognitive Sciences, 4, 156–63.Google Scholar
Trecca, F., Bleses, D., Madsen, T. O., & Christiansen, M. H. (2018). Does sound structure affect word learning? An eye-tracking study of Danish learning toddlers. Journal of Experimental Child Psychology, 167, 180203.Google Scholar
Trecca, F., McCauley, S. M., Andersen, S. R., Bleses, D., Basbøll, H., Højen, A., Madsen, T. O., Ribu, I. S. B., & Christiansen, M. H. (2019). Segmentation of highly vocalic speech via statistical learning: initial results from Danish, Norwegian, and English. Language Learning, 69(1), 143–76.Google Scholar
Vihman, M., dePaolis, R., Nakai, S., & Hallé, P. A. (2004). The role of accentual pattern in early lexical representation. Journal of Memory and Language, 50, 336–53.Google Scholar
Winter, B., & Wieling, M. (2016). How to analyze language change using mixed models, growth curve analysis and generalized additive modeling. Journal of Language Evolution, 1, 718.Google Scholar
Ziegler, J. C., & Goswami, U. (2005). Reading acquisition, developmental dyslexia, and skilled reading across languages: a psycholinguistic grain size theory. Psychological Bulletin, 131, 329.Google Scholar
Zipf, G. (1949). Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley.Google Scholar