Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-12T23:03:28.281Z Has data issue: false hasContentIssue false

The use of film subtitles to estimate word frequencies

Published online by Cambridge University Press:  28 September 2007

BORIS NEW
Affiliation:
Université Paris Descartes and CNRS
MARC BRYSBAERT
Affiliation:
Royal Holloway, University of London
JEAN VERONIS
Affiliation:
Université de Provence
CHRISTOPHE PALLIER
Affiliation:
CNRS, INSERM, and Service Hospitalier Frédéric Joliot

Abstract

We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy way to obtain word frequency measures in language registers other than text writing. We compiled a corpus of 52 million French words, coming from a variety of films. Frequency measures based on this corpus compared well to other spoken and written frequency measures, and explained variance in lexical decision times in addition to what is accounted for by the available French written frequency measures.

Type
Articles
Copyright
2007 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baayen H., Feldman L., & Schreuder B.2006. Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 55, 290313.Google Scholar
Baayen H., Piepenbrock R., & Gulikers L.1995. The CELEX Lexical Database(Release 2) [CD-ROM]. Philadelphia, PA: University of Pennsylvania, Linguistic Data Consortium.
Balota D. A., Cortese M. J., Sergent-Marshall S. D., Spieler D. H., & Yap M. J.2004. Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133, 283316.Google Scholar
Balota D. A., Yap M. J., Cortese M. J., Hutchison K. I., Kessler B., Loftis B., et al. (in press). The English Lexicon Project. Behavior Research Method.
Blair I. V., Urland G. R., & Ma J. E.2002. Using Internet search engines to estimate word frequency. Behavior Research Methods, Instruments, & Computers, 34, 286290.Google Scholar
Bonin P., Chalard M., Méot A., & Fayol M.2001. Age-of-acquisition and word frequency in the lexical decision task: Further evidence from the French language. Current Psychology of Cognition, 20, 401443.Google Scholar
Desmet T., De Baecke C., Drieghe D., Brysbaert M., & Vonk W.2006. Relative clause attachment in Dutch: On-line comprehension corresponds to corpus frequencies when lexical variables are taken into account. Language and Cognitive Processes, 21, 453485.Google Scholar
Equipe DELIC. 2004. Présentation du Corpus de référence du Français parlé. Recherches sur le Français Parlé, 18, 1142. Also available at http://www.up.univ-mrs.fr/veronis/pdf/2004-presentation-crfp.pd.
Grondelaers S., Deygers K., van Aken H., van den Heede V., & Speelman D.2000. Het ConDiv-corpus geschreven Nederlands. Nederlandse Taalkunde, 5, 356363.Google Scholar
New B., Pallier C., Brysbaert M., & Ferrand L.2004. Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, & Computers, 36, 516524.Google Scholar
New B., Pallier C., Ferrand L., & Matos R.2001. Une base de données lexicales du français contemporain sur internet: LEXIQUE, L'Année Pschologique, 101, 447462.Google Scholar
Robert P.1996. Le grand Robert électronique [Software]. Havas Interactive. Accessed at http://www.havas.co.
RomaryL., Salmon-Alt S., & Francopoulo G.2004. Standards going concrete: From LMF to Morphalou. Unpublished manuscript, Coling, Geneva, Switzerland, Workshop on Electronic Dictionaries.