Assessing two methods of webcam-based eye-tracking for child language research

Margaret Kandel; Jesse Snedeker

doi:10.1017/S0305000924000175

Assessing two methods of webcam-based eye-tracking for child language research

Published online by Cambridge University Press: 07 May 2024

Margaret Kandel

and

Jesse Snedeker

Show author details

Margaret Kandel*: Affiliation:
Department of Psychology, Harvard University, USA
Jesse Snedeker: Affiliation:
Department of Psychology, Harvard University, USA
*: Corresponding author: Margaret Kandel; Email: [email protected].

Article contents

Abstract
Introduction
Experiment 1: visual-world task
Results
Conclusion
Data availability statement
Competing interest
Footnotes
References

Rights & Permissions

Abstract

We assess the feasibility of conducting web-based eye-tracking experiments with children using two methods of webcam-based eye-tracking: automatic gaze estimation with the WebGazer.js algorithm and hand annotation of gaze direction from recorded webcam videos. Experiment 1 directly compares the two methods in a visual-world language task with five to six year-old children. Experiment 2 more precisely investigates WebGazer.js’ spatiotemporal resolution with four to twelve year-old children in a visual-fixation task. We find that it is possible to conduct web-based eye-tracking experiments with children in both supervised (Experiment 1) and unsupervised (Experiment 2) settings – however, the webcam eye-tracking methods differ in their sensitivity and accuracy. Webcam video annotation is well-suited to detecting fine-grained looking effects relevant to child language researchers. In contrast, WebGazer.js gaze estimates appear noisier and less temporally precise. We discuss the advantages and disadvantages of each method and provide recommendations for researchers conducting child eye-tracking studies online.

Keywords

web-based experimentation eye-tracking phonemic cohort competition language comprehension WebGazer

Type: Article
Information: Journal of Child Language , Volume 52 , Issue 3 , May 2025 , pp. 675 - 708

DOI: https://doi.org/10.1017/S0305000924000175 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

Introduction

Visual-world eye-tracking is an important tool for studying real-time language processing in children. In the visual-world paradigm, participants are presented with a display, and their eye-movements are recorded as they listen to or produce an utterance. Individuals systematically look to referents or associates of the words they hear (e.g., Cooper, Reference Cooper1974; Tanenhaus et al., Reference Tanenhaus, Spivey-Knowlton, Eberhard and Sedivy1995) or are planning to produce (e.g., Griffin & Bock, Reference Griffin and Bock2000; Meyer et al., Reference Meyer, Sleiderink and Levelt1998). Saccades are tightly linked to linguistic information, with fixations to relevant stimuli rising within 200ms of the onset of linguistic cues in adults (e.g., Allopenna et al., Reference Allopenna, Magnuson and Tanenhaus1998; Cooper, Reference Cooper1974). This relationship has allowed researchers to use eye-movements to investigate a variety of questions in language processing (see Huettig et al., Reference Huettig, Rommers and Meyer2011 for review). This paradigm is particularly useful for child research, as it provides a non-invasive, real-time measure of language processing that doesn’t require meta-linguistic reasoning (cf. grammaticality judgments, lexical decision), reading ability (cf. self-paced reading), or a lengthy set-up (cf. electroencephalography). Children similarly look to relevant stimuli shortly after the onset of linguistic cues, and visual-world experiments have been used with children to study multiple levels of language processing, including phonological (e.g., McMurray et al., Reference McMurray, Danelz, Rigler and Seedorff2018; Sekerina & Brooks, Reference Sekerina and Brooks2007), morphological (e.g., Özge et al., Reference Özge, Kornfilt, Maquate, Küntay and Snedeker2022; Zhou et al., Reference Zhou, Crain and Zahn2014), syntactic (e.g., Contemori et al., Reference Contemori, Carlson and Marnis2018; Snedeker & Trueswell, Reference Snedeker and Trueswell2004; Trueswell et al., Reference Trueswell, Sekerina, Hill and Logrip1999), semantic (e.g., Borovsky et al., Reference Borovsky, Elman and Fernald2012; Brouwer et al., Reference Brouwer, Özkan and Küntay2019), and pragmatic processing (e.g., Cooper-Cunningham et al., Reference Cooper-Cunningham, Charest, Porretta and Järvikivi2020; Huang & Snedeker, Reference Huang and Snedeker2009; Kampa & Papafragou, Reference Kampa and Papafragou2020).

Visual-world experiments are primarily conducted in university labs where researchers employ specialized equipment to monitor participant gaze (e.g., SR Research, 2021; Tobii, 2021). More recently, however, algorithms that determine gaze location based on webcam video have increased interest in conducting eye-tracking experiments without specialized equipment and outside of lab settings (e.g., Erel et al., Reference Erel, Shannon, Chu, Scott, Kline Struhl, Cao, Tan, Hart, Raz, Piccolo, Mei, Potter, Jaffe-Dax, Lew-Williams, Tenenbaum, Fairchild, Barmano and Liu2022; Fraser et al., Reference Fraser, Gattas, Hurman, Robison, Duta and Scerif2021; Papoutsaki et al., Reference Papoutsaki, Sangkloy, Laskey, Daskalova, Huang and Hays2016; Valenti et al., Reference Valenti, Staiano, Sebe and Gevers2009; Valliappan et al., Reference Valliappan, Dai, Steinberg, He, Rogers, Ramachandran, Xu, Shojaeizadeh, Guo, Kohlhoff and Navalpakkam2020; Xu et al., Reference Xu, Ehinger, Zhang, Finkelstein, Kulkarni and Xiao2015). Webcam-based eye-tracking allows researchers to conduct experiments over the internet, in either supervised settings (with an experimenter present over video conferencing) or unsupervised settings (with no experimenter present). Web-based testing has several advantages, many of which are particularly relevant to child research. Participants can complete experiments from the comfort of their own homes, where children may feel more at ease. This frees families from needing to travel to the lab and make babysitting arrangements for siblings. Unsupervised web-based experiments allow for even more efficient data collection, as sessions can occur outside of working hours at whatever time is most convenient for families. Collecting data over the internet gives researchers access to more diverse populations (see Henrich et al., Reference Henrich, Heine and Norenzayan2010 for the importance of sample diversity) and languages not spoken near their home institutions. Webcam-based eye-tracking can also be used in conjunction with direct participant contact, allowing researchers to set up mobile labs wherever they can bring a laptop (e.g., schools, parks, museums, etc.).

Of the algorithms that track eye-gaze from webcam videos, the JavaScript library WebGazer.js (hereafter “WebGazer”; Papoutsaki et al., Reference Papoutsaki, Sangkloy, Laskey, Daskalova, Huang and Hays2016) has garnered the most attention from behavioral researchers. WebGazer is open-source and has been integrated into popular frameworks for running online behavioral tasks, such as PCIbex (Zehr & Schwarz, Reference Zehr and Schwarz2018), JsPsych (de Leeuw, Reference de Leeuw2015), and Gorilla (Anwyl-Irvine et al., Reference Anwyl-Irvine, Massonnié, Flitton, Kirkam and Evershed2020). Gaze estimation occurs locally in the user’s web-browser, and no video is saved, thus maintaining participant privacy. Although initially designed to detect eye-gaze during user interactions with webpages (Papoutsaki et al., Reference Papoutsaki, Sangkloy, Laskey, Daskalova, Huang and Hays2016), recent studies have explored WebGazer’s suitability for behavioral research with adults.

The results of these investigations are promising. WebGazer detects looks to perceptual stimuli shortly after they appear (e.g., Semmelmann & Weigelt, Reference Semmelmann and Weigelt2018; Slim & Hartsuiker, Reference Slim and Hartsuiker2022) and has been used to replicate previously-observed eye-tracking effects in a variety of domains, including visual inspection of faces (Semmelmann & Weigelt, Reference Semmelmann and Weigelt2018), decision making (X. Yang & Krajbich, Reference Yang and Krajbich2021), and language processing (Degen et al., Reference Degen, Kursat and Leigh2021; Slim & Hartsuiker, Reference Slim and Hartsuiker2022; Vos et al., Reference Vos, Minor and Ramchand2022). However, WebGazer has limitations compared to the eye-tracking devices typically used for in-lab studies. Specifically, the offset between estimated gaze and stimulus locations is greater and looking patterns are delayed relative to in-lab studies (e.g., Degen et al., Reference Degen, Kursat and Leigh2021; Semmelmann & Weigelt, Reference Semmelmann and Weigelt2018; Slim & Hartsuiker, Reference Slim and Hartsuiker2022). At present, it is not clear to what extent this noise is attributable to WebGazer itself as opposed to properties of the less controlled web-based setting (e.g., variations in software, hardware, environments, and internet connections) or differences in participant behavior when completing studies online.

Given these findings with adults, it seems reasonable to consider using WebGazer for web-based psycholinguistic studies with children. However, it is not obvious that WebGazer would perform as well when estimating child gaze. Child faces are smaller than those of adults, and children are likely to be in a different position relative to the webcam because of their height, which could reduce the accuracy of WebGazer’s pupil detection and gaze estimation algorithms. In addition, young children are less likely to remain in the same position for the duration of a task, and they are unlikely to have the patience to sit through extensive calibration/recalibration procedures that improve accuracy in adult studies (e.g., Semmelmann & Weigelt, Reference Semmelmann and Weigelt2018; X. Yang & Krajbich, Reference Yang and Krajbich2021). In fact, even high-end in-lab eye-trackers are less accurate when used with children (Dalrymple et al., Reference Dalrymple, Manner, Harmelink, Teska and Elison2018). Furthermore, children may have more difficulty maintaining attention when completing an experiment from home, where there may be more distractions than in controlled lab settings.

In the present study, we investigate whether it is possible to run web-based visual-world studies with school-aged children. We test two webcam eye-tracking methods: automatic gaze estimation with WebGazer and frame-by-frame annotation of gaze direction (e.g., Snedeker & Trueswell, Reference Snedeker and Trueswell2004) from webcam videos recorded via Zoom teleconferencing software (https://zoom.us/). Experiment 1 directly compares these two methods in a visual-world language task with five to six year-old children. We assess how well these methods discriminate both robust fixation patterns (looks to target stimuli) as well as more subtle eye-movement patterns of the kind relevant to child language researchers (phonemic cohort competition effects; e.g., Allopenna et al., Reference Allopenna, Magnuson and Tanenhaus1998; Sekerina & Brooks, Reference Sekerina and Brooks2007). By collecting both forms of gaze data simultaneously, we can assess the extent to which any noise observed in the WebGazer data stems from WebGazer itself as opposed to participant behavior or the web-based setting. Experiment 2 focuses more specifically on WebGazer, assessing its performance with child participants aged four to twelve years in a visual-fixation task. Experiment 2 was run without an experimenter present, allowing us to assess the feasibility of conducting unsupervised web-based eye-tracking studies with child participants.

Experiment 1: visual-world task

Experiment 1 comprised two linked experiments focused on the phonemic cohort competition effect. This effect is well-suited for testing the efficacy of web-based visual-world eye-tracking, as it has been replicated many times with both adults (e.g., Allopenna et al., Reference Allopenna, Magnuson and Tanenhaus1998; Dahan & Gaskell, Reference Dahan and Gaskell2007; Dahan et al., Reference Dahan, Magnuson and Tanenhaus2001; Farris-Trimble & McMurray, Reference Farris-Trimble and McMurray2013; Magnuson et al., Reference Magnuson, Tanenhaus, Aslin and Dahan1999; inter alia) and children (e.g., Desroches et al., Reference Desroches, Joanisse and Robertson2006; Sekerina & Brooks, Reference Sekerina and Brooks2007; Rigler et al., Reference Rigler, Farris-Trimble, Greiner, Walker, Tomblin and McMurray2015; Weighall et al., Reference Weighall, Henderson, Barr, Cairney and Gaskell2017; inter alia), and the presence of cohort activation is often used to investigate higher-level linguistic constraints on incremental language processing (e.g., Dahan & Tanenhaus, Reference Dahan and Tanenhaus2004; Gaston et al., Reference Gaston, Lau and Phillips2020; Ito et al., Reference Ito, Pickering and Corley2018; Li et al., Reference Li, Li and Qu2022; Paul et al., Reference Paul, Ziegler, Chalmers and Snedeker2019). In a visual-world context, cohort competition effects arise when listeners hear a target word that shares onset phonemes with one of the images on the screen; when hearing the onset of the target word (e.g., beaker), listeners fixate more on the image of a cohort competitor (e.g., beetle) than phonologically-unrelated distractors (e.g., carriage) (e.g., Allopenna et al., Reference Allopenna, Magnuson and Tanenhaus1998). The onset of competition effects follows a similar time-course in both adults and children, though effects continue longer in young children (Sekerina & Brooks, Reference Sekerina and Brooks2007).

Experiment 1 used two different visual displays to see how each is affected by the noise introduced in web-based experimentation. Experiment 1A used a simple two-image display (with images on the left and right), similar to many infant preferential-looking studies. Experiment 1B used the four-image display that is common in visual-world studies (one image in each quadrant). Experiment 1B’s four-image display further allows us to assess the performance of the eye-tracking methods on horizontal and vertical look discrimination.

The experiment methods and WebGazer phonemic cohort analysis were preregistered (https://osf.io/cn3ur/). The analysis of the webcam video data was exploratory. Prior to conducting Experiment 1, we ran a pilot experiment (N=24) to assess WebGazer’s performance with adult participants (see Supplementary Materials).

Methods

A more detailed description of the methods is available in the Supplementary Materials. All experiments reported in this paper were approved by the Harvard University-Area Committee on the Use of Human Subjects.

Participants

Experiment 1 had 64 participants of five and six years of age who were native monolingual speakers of American English. Half completed Experiment 1A (N=32, 14 F, 18 M; M_age=5.8 years, SD=0.6, range=5;0–6;11), and half completed Experiment 1B (N=32, 20 F, 12 M; M_age=6.2 years, SD=0.5, range=5;0–6;11). Our sample size (32 participants per experiment) is similar to psycholinguistic experiments in general and to previous studies of the phonemic cohort effect (e.g., Farris-Trimble & McMurray, Reference Farris-Trimble and McMurray2013; Huettig & McQueen, Reference Huettig and McQueen2007). Informed written consent was received from the parent or guardian for their child’s participation. Participants were compensated with a $5.00 gift card.

Materials

We selected 36 target–cohort pairs with onset overlap of one or more phonemes. As a control, each target word was pseudo-randomly assigned a competitor from another target–cohort pair with no onset overlap. The experiments consisted of 36 trials (one per word pair). The trial displays included a target image (corresponding to the target word) and a competitor image. In Experiment 1B, the displays also included two pseudo-randomly assigned distractor images whose names had different onsets from the target and competitor.Footnote ¹ The trials were rotated through two conditions in two presentation lists. In the cohort condition, the competitor image depicted the cohort pair of the target (e.g., the target milk appeared with the competitor mitten). In the control condition, the target appeared with its control competitor (e.g., the target milk appeared with the competitor windmill from the cohort pair window – windmill). The cohort effect was assessed by comparing looks to the competitor images in the cohort and control conditions.

The experiments were built in PCIbex (Zehr & Schwarz, Reference Zehr and Schwarz2018) using PCIbex’s implementation of WebGazer v2 and were completed in the participant’s web-browser. To accommodate the variability in screen-sizes across participant computers, stimulus size and location were defined by browser window size (equivalent to screen-size since the experiment was displayed fullscreen). Images appeared on canvases centered in their quadrant or half of the screen (Figure 1). Throughout each trial, WebGazer tracked looks to these canvases. When WebGazer detected a look to a canvas, the canvas border turned purple.Footnote ²

Figure 1. Example Experiment 1A (left) and Experiment 1B (right) trials. Each competitor image (e.g., mitten) appeared with its own target in the cohort condition (e.g., milk, right) and with another target in the control condition (e.g., banana, left). Image canvas borders turned from gray to purple when WebGazer estimated eye-gaze to fall on the image. Stills include images from Duñabeitia et al. (Reference Duñabeitia, Crepaldi, Meyer, New, Pliatsikas, Smolka and Brysbaert2018) and Rossion and Pourtois (Reference Rossion and Pourtois2004).

Procedure

Participants completed the experiment while in a Zoom teleconference call with the experimenter(s), and the session was recorded via the Zoom meeting recording function. The participant opened the link to the experiment on their computer in Google Chrome or Mozilla Firefox and used the Zoom screen-sharing function to share the display with the experimenter. Participants using a non-Mac computer (with the exception of one Chromebook user) turned off their Zoom video prior to opening the experiment, as piloting revealed that many of these computers do not allow the same webcam to be used by Zoom and WebGazer simultaneously.

At the beginning of the experiment, the participant completed an audio check and a WebGazer calibration sequence. As we were interested in the range of calibration accuracy that would be obtained with our sample, we did not specify a minimum calibration threshold. After calibration, participants completed three practice trials followed by the 36 experimental trials. Each trial started with a calibration check. Next, the images appeared. After 2000ms, participants heard pre-recorded audio instructions telling them to Look at the + [target word]. The images remained on screen for 2250ms after audio offset. The full experiment session took approximately 20–30 minutes.

Analysis

The data for Experiments 1A and 1B were analyzed separately. All analyses were conducted using R v4.1.0 (R Core Team, 2021).

WebGazer

In each trial, WebGazer recorded looks from trial onset to two seconds after audio offset. In each sample, a 0 or 1 was recorded for each image canvas indicating whether or not participant gaze fell upon it (0=no, 1=yes). Sampling rate varied by participant, likely dependent upon their computer, webcam, and internet connection (grand mean time between samples=96ms, SD=43ms).Footnote ³ Samples which recorded no looks to any of the image canvases were excluded from analysis (41.24% of Experiment 1A samples; 27.07% of Experiment 1B samples). To regularize sampling rates prior to analysis, we analyzed gaze locations in bins of 100ms. A time bin received a value of 1 for a canvas if at least 50% of recorded looks within the bin fell on that canvas.

We preregistered a cluster permutation analysis to investigate competitor looks 0–2000ms after target onset (e.g., Hahn et al., Reference Hahn, Snedeker and Rabagliati2015; Yacovone et al., Reference Yacovone, Shafto, Worek and Snedeker2021). This analysis assessed the effect of interest at each time step using generalized linear mixed-effect models (GLMMs) with a binomial distribution and logit link (step size=100ms).Footnote ⁴ All models in the present study were fit using the {lme4} package v1.1-27.1 (Bates et al., Reference Bates, Mächler, Bolker and Walker2015). The models had looks to the competitor image (0, 1) as the dependent variable, a fixed effect of condition (cohort, control), and random slopes and intercepts for condition by participant and item. Item was individuated by competitor image identity to account for variance in properties of the competitor images. An effect was considered reliable at a step if the absolute value of its z-value was greater than 2 (Gelman & Hill, Reference Gelman and Hill2007).Footnote ⁵ A minimum of two sequential reliable effects were required to comprise a cluster. To assess cluster reliability, we performed 1000 simulations reshuffling the condition labels for each participant. In each simulation, we summed the z-values of the adjacent steps in identified clusters to obtain a z-sum statistic. We compared the z-sum of the observed cluster to the distribution of each simulation’s largest z-sum. A p-value for the observed cluster was determined by its position in this distribution (e.g., for a p-value of <0.05, 95% of the z-sums in the distribution must be greater than or equal to the observed statistic).Footnote ⁶

We also analyzed the effect of condition on competitor looks in two time windows: 300–700ms after target onset (preregistered) and 600–1000ms after target onset (exploratory to account for a potential WebGazer delay in look detection). The results of these analyses are broadly consistent with the findings from the cluster analyses reported below and appear in the Supplementary Materials.

We conducted an additional exploratory analysis to investigate when target image looks were reliably different from chance in each condition. For each condition, we performed cluster permutation analyses assessing looks to the side of the screen containing the target image 0–2000ms after target onset; for Experiment 1B, we performed separate analyses for the horizontal and vertical side distinctions. In Experiment 1A, a look was considered to fall on the same side of the screen as the target if it fell on the target image; the analysis thus assesses the likelihood of target image looks. In Experiment 1B, a look was considered to fall on the same side of the screen as the target if it fell on the target or on the image vertically-adjacent (for the horizontal-side analysis) or horizontally-adjacent (for the vertical-side analysis). The analyses followed the same procedure described above, except that to assess reliability, we reshuffled the trial image location configurations by participant (thus preserving for each participant the overall number of target and non-target images appearing in each quadrant). The GLMMs computed at each step had target side looks (0, 1) as the dependent variable and random intercepts for participant and item (i.e., target image identity); as the model had no fixed effect, the likelihood of target side looks was compared to chance (50%). This analysis allows us to identify when each method is able to discriminate looks to the target quadrant along both the horizontal and vertical dimensions. For Experiment 1B, we supported the results of this analysis with a multinomial regression analysis assessing when looks differed between the target and the horizontally-, vertically-, and diagonally-adjacent images (see Supplementary Materials); the results align with the target side looks analyses.

Webcam video annotation

To gain further information about the eye-gaze patterns of our participants, we hand annotated gaze direction in the webcam videos of all participants who were able to keep their Zoom video on as they completed the experiment. Trial onsets times were identified from Zoom screen recordings using Python scripts that detected when the colored stimulus images appeared on screen (Anthony Yacovone, personal communication). These onsets were used to divide the continuous webcam videos into separate trial videos. Coders (blind to condition and target/competitor location) annotated gaze direction for each frame of these videos (annotation script by Anthony Yacovone).

Paralleling the WebGazer analysis, samples that were not coded as looks to one of the image locations were removed from analysis (i.e., center looks, blinks, etc.) (34.72% of Experiment 1A samples; 23.24% of Experiment 1B samples). The webcam videos had 40ms between samples. To compare to the WebGazer data, we analyzed gaze locations in bins of 100ms, following the binning procedure described above.

All videos were annotated by a single coder. To assess reliability, each video was additionally annotated by a secondary coder. Within our cluster analysis window (0–2000ms after target onset), inter-coder agreement was 92.18% in the Experiment 1A dataset and 90.02% in the Experiment 1B dataset (see Supplementary Materials for details). We performed the same analyses on the webcam video data as on the WebGazer data.

WebGazer results

Ten Experiment 1A trials across eight participants and 18 Experiment 1B trials across seven participants were omitted from the WebGazer analysis because no data were saved for them on our server.

Calibration scores

Participant calibration scores in the initial calibration sequence ranged from 2–80% across Experiments 1A and 1B, with an average of 43% (SD=18, see Supplementary Materials for plots and more detail). Mean participant calibration scores during the calibration checks at the beginning of each experimental trial ranged from 8–50%, with an average of 30% (SD=11).

Experiment 1A

Figure 2 illustrates the increase in looks to the target image in the WebGazer output following target word articulation in both the cohort and control conditions. This pattern was similar for targets on the left and right of the screen (see Supplementary Materials). While there was a substantial rise in target looks in both conditions (~75% of looks), this rise was smaller than commonly observed in two-image studies with children and adults (e.g., 80–85% with adults and three to four year-olds in Simmons, Reference Simmons2017).

Figure 2. Mean WebGazer looks to the target and competitor images by condition in Experiment 1A. Ribbons indicate standard error. Vertical lines indicate average target word duration. Shading indicates when looks to the target image differed from chance.

Target looks were reliably different from chance in clusters starting 800ms after target onset in the control condition (z-sum=64.94, p<0.001) and 1000ms after target onset in the cohort condition (z-sum=61.82, p<0.001).

Figure 3 focuses on the cohort effect by plotting looks to the competitor image in the cohort and control conditions. Prior to target word onset, looks to the competitor image were at chance (50%). These looks began to decline approximately 700ms after target word offset (as target looks increased). Our analyses explored whether this decline was faster in the control condition than the cohort condition. The analysis identified a reliable difference in competitor looks between conditions in a cluster 900–1099ms after target onset (z-sum=4.87, p=0.02).

Figure 3. Mean WebGazer looks to the competitor image by condition in Experiment 1A. Ribbons indicate standard error. Vertical lines indicate average target word duration. Shading indicates when looks between conditions were reliably different in the cluster analysis.

Experiment 1B

Figure 4 shows looks to the target image, competitor image, and two distractor images (collapsed) as detected by WebGazer in the cohort and control conditions. In both conditions, WebGazer detected increased looks to the target image following target word onset. However, the effects appeared smaller than in previous studies (≤50% in the present study vs. >60% with five and six year-olds in Sekerina & Brooks, Reference Sekerina and Brooks2007).

Figure 4. Mean WebGazer looks to the target image, competitor image, and distractor images (collapsed) by condition in Experiment 1B. Ribbons indicate standard error. Vertical lines indicate average target word duration. Shading indicates the temporal overlap of the clusters when target side looks differed from chance in both the horizontal and vertical directions.

In the control condition, looks to the side of the screen containing the target were reliably different from chance in clusters starting 900ms after target onset along the horizontal axis (z-sum=62.73, p<0.001) and 1200ms after target onset along the vertical axis (z-sum=29.10, p<0.001). In the cohort condition, clusters emerged 1000ms after target onset for the horizontal-side distinction (z-sum=50.49, p<0.001) and 1400ms after target onset for the vertical-side distinction (z-sum=17.83, p=001).

The observed clusters for the horizontal-side distinction had similar onsets to those in Experiment 1A (800ms in the control condition, 1000ms in the cohort condition) – however, the observed clusters for the vertical-side distinction started 300–400ms later, suggesting that WebGazer may have more difficulty discriminating looks along the vertical axis. Figure 5 plots participant looks to the target and distractor (non-target) images in the control condition 1200–2000ms after target onset (when participants were likely fixating on the target quadrant, according to WebGazer). In this window, there were more looks to the vertical distractor than the other non-target images, supporting the hypothesis that WebGazer has increased difficulty discriminating vertical looks (this pattern was confirmed in an exploratory multinomial analysis; see Supplementary Materials). A figure showing target and distractor looks by target location is available in the Supplementary Materials.

Figure 5. Boxplot of participant WebGazer fixation proportions to the target and non-target images in the Experiment 1B control trials from 1200–2000ms after target onset. Mean fixation proportions for each image are labeled and identified by black diamonds. The gray points represent participant means.

Figure 6 plots looks to the competitor image in the cohort and control conditions. Prior to target word onset, looks to the competitor image were at chance (25%). These looks began to decrease approximately 1200ms after target onset. The cluster analysis did not identify any clusters where competitor looks differed in the two conditions. Thus, we did not replicate the phonemic cohort effect.

Figure 6. Mean WebGazer looks to the competitor image by condition in Experiment 1B. Ribbons indicate standard error. Vertical lines indicate average target word duration.

Webcam video annotation results

We had video data for 13 of 32 participants for each experiment.

Experiment 1A

Figure 7 plots looks to the target and competitor images in the cohort and control conditions as detected by hand annotation and WebGazer for the 13 participants with video data. Webcam video annotation identified a higher proportion of target image looks than WebGazer. The pattern of performance was similar for targets on the left and right of the screen (see Supplementary Materials).

Figure 7. Mean looks to the target and competitor images by condition in the Experiment 1A annotated webcam video data and in the WebGazer data from the same participants. Ribbons indicate standard error. Vertical lines indicate average target word duration. Shading indicates when looks to the target image differed from chance.

Target looks were reliably different from chance in clusters starting 500ms after target onset in the control condition (z-sum=93.02, p<0.001) and 800ms after target onset in the cohort condition (z-sum=75.36, p<0.001). These clusters started earlier than in the WebGazer data from the same participants, in which the corresponding clusters began 1100ms after target onset in both the control (z-sum=38.65, p<0.001) and cohort (z-sum=35.28, p<0.001) conditions.

Figure 8 shows looks to the competitor image in the cohort and control conditions. In the video data, competitor looks in the control condition decreased during target word articulation, whereas looks in the cohort condition did not decrease until target word offset. In contrast, in the WebGazer data from the same participants, competitor looks decreased only after target word offset in both conditions (similar to the pattern observed in the full WebGazer dataset), and competitor looks were more similar in the two conditions. In the video data, the analysis identified a reliable difference in competitor looks between conditions in a cluster 700–1099ms after target onset (z-sum=13.26, p=0.001), thereby showing evidence of a phonemic cohort effect. A cluster analysis of the corresponding WebGazer data did not identify any clusters.

Figure 8. Mean looks to the competitor image by condition in the Experiment 1A annotated webcam video data and in the WebGazer data from the same participants. Ribbons indicate standard error. Vertical lines indicate average target word duration. Shading indicates when looks between conditions reliably differed.

Experiment 1B

Figure 9 plots looks to the target and competitor images in the cohort and control conditions, as identified by webcam video annotation and WebGazer for the same 13 participants (plots including distractor images are available in the Supplementary Materials). Target looks rose earlier and reached higher proportions in the video data than the WebGazer data.

Figure 9. Mean looks to the target and competitor images by condition in the Experiment 1B annotated webcam video data and in the WebGazer data from the same participants. Ribbons indicate standard error. Vertical lines indicate average target word duration. Shading indicates the temporal overlap of the clusters when target side looks differed from chance in both the horizontal and vertical directions.

In the video data for the control condition, looks to the side of the screen containing the target were reliably different from chance in clusters starting 600ms after target onset along the horizontal axis (z-sum=60.27, p<0.001) and 700ms after target onset along the vertical axis (z-sum=63.13, p<0.001). In the cohort condition, clusters emerged 800ms after target onset for the horizontal-side distinction (z-sum=65.33, p<0.001) and 600ms after target onset for the vertical-side distinction (z-sum=75.98, p<0.001).

In the WebGazer data from the same participants, the detection of target looks appeared considerably later. In the control condition, target-side looks were reliably different from chance in clusters emerging 1200ms after target onset along both the horizontal (z-sum=34.73, p<0.001) and vertical (z-sum=15.96, p=0.001) axes.Footnote ⁷ In the cohort condition, clusters emerged 1000ms after target onset for the horizontal-side distinction (z-sum=36.28, p<0.001) and 1400ms after target onset for the vertical-side distinction (z-sum=15.80, p=0.001).

Figure 10 shows participants’ looks to the target and distractor images in the control condition 700–2000ms after target onset (when participants were likely fixating on the target quadrant, according to the video annotation) for the webcam video data. The proportion of looks to the target was higher than during detected target fixations in the full WebGazer sample (Figure 5), and there were fewer distractor looks. Similar to the full WebGazer sample, there was a slight preference for vertical distractors over the other non-target images (this pattern was confirmed in an exploratory multinomial analysis; see Supplementary Materials) – however, the relative differences were smaller in the webcam video data. A figure showing target and distractor looks by target location is available in the Supplementary Materials.

Figure 10. Boxplot of participant fixation proportions to the target and non-target images in the Experiment 1B control trials from 700–2000ms after target onset for the annotated webcam video data. Mean fixation proportions for each image are labeled and identified by black diamonds. The gray points represent participant means.

Figure 11 shows looks to the competitor image in the cohort and control conditions. In the video data, looks to the competitor image in the cohort condition increased during target articulation, while looks in the control condition decreased. In the WebGazer data from the same participants, there was no obvious difference between conditions. In the video data, the analysis identified a reliable difference in competitor image looks between conditions in a cluster 600–999ms after target onset (z-sum=11.19, p<0.01), thus finding evidence of a phonemic cohort effect. A cluster analysis of the corresponding WebGazer data did not identify any clusters.

Figure 11. Mean looks to the competitor image by condition in the Experiment 1B annotated webcam video data and in the WebGazer data from the same participants. Ribbons indicate standard error. Vertical lines indicate average target word duration. Shading indicates when looks between conditions reliably differed.

Experiment 1 summary

Experiment 1 used a standard visual-world task to assess the relative performance of two webcam-based eye-tracking methods with five to six year-old children: automatic WebGazer gaze coding and hand annotation of gaze direction from recorded webcam videos. Both methods detected increased looks to named (target) images in both two- and four-image displays. However, the rise in target fixations was lower and later in the WebGazer data compared to in-lab experiments with children of the same age or younger (e.g., Sekerina & Brooks, Reference Sekerina and Brooks2007; Simmons, Reference Simmons2017). The annotated video data, on the other hand, looked more like data collected in in-lab experiments: the onset of target looks was faster, and the proportion of target looks was considerably higher than in simultaneously-collected WebGazer data. Interestingly, for both methods, unrelated images vertically-adjacent to the target received more looks than distractor images in the other locations of the display; this pattern was especially notable in the WebGazer data.

The differences between the two methods were particularly pronounced in the analysis of the phonemic cohort effect. In the video data, the cohort effect emerged in both the four- and two-image displays in clusters beginning 600–700ms after target onset and was detectable in a sample of just 13 children. This effect is later than observed in previous lab-based studies, in which cohort effects began 200–400ms after target onset (e.g., Allopenna et al., Reference Allopenna, Magnuson and Tanenhaus1998; Huettig & McQueen, Reference Huettig and McQueen2007; Sekerina & Brooks, Reference Sekerina and Brooks2007). While this difference could reflect our small sample size or a difference in our analysis method, it is consistent with other research using webcam video annotation (i.e., the web-based replication of Allopenna et al., Reference Allopenna, Magnuson and Tanenhaus1998 by Ovans, Reference Ovans2022). In the WebGazer data, the effect was detectable only in the two-image display with a larger sample (N=32), and this effect window emerged later (900ms after target onset). These results suggest that while WebGazer can detect robust fixation patterns like target looks, webcam video annotation is better suited to detecting more fine-grained effects.

In Experiment 1 we tracked looks in a binary fashion, monitoring whether or not a look fell inside a particular region. While this measure reflects how visual-world studies are generally conducted, we cannot tell from these results how close WebGazer’s gaze estimates are to the true locations of visual stimuli. Experiment 2 explores WebGazer’s accuracy more directly. This additionally allows us to address one limitation of Experiment 1: because our image canvases did not cover the full halves (Experiment 1A) or quadrants (Experiment 1B) of the screen, gazes that were estimated to fall near a canvas, but not within it, may have been coded as looks in our video data but not in the WebGazer data.

Experiment 2: fixation task

Experiment 2 used a visual-fixation task to investigate the spatial and temporal resolution of WebGazer’s gaze estimation with four to twelve year-old children. This task was adapted from Slim and Hartsuiker (Reference Slim and Hartsuiker2022) (“S&H2022”). The experiment had four goals: i) to assess the feasibility of conducting web-based eye-tracking tasks with children without an experimenter present; ii) to assess how closely WebGazer estimates correspond to stimulus locations; iii) to assess whether there are age-related differences in WebGazer performance between four and twelve years; and iv) to assess whether the accuracy of quadrant-based analyses with WebGazer is improved by using larger canvases.