1. Introduction
There exist a logically infinite number of potential sound sequences in any given language, yet only some are considered permissible (or well-formed) by speakers. The term phonotactics refers to this implicit knowledge that speakers use to determine the permissible sound sequences in their language. Phonotactic knowledge does not apply uniformly to the entire lexicon – certain lexical exceptions can violate otherwise universally applicable patterns (Guy Reference Guy2007; Wolf Reference Wolf2011). However, children can acquire regular patterns in the presence of lexical exceptions. For example, despite the existence of disharmonic sequences in their language experience, experimental studies have shown that Turkish infants tune in to non-local phonotactics in vowel harmony patterns as early as six months (Hohenberger et al. Reference Hohenberger, Altan, Kaya, Tuncer, Avcu, Ketrez and Haznedar2016; Sundara et al. Reference Sundara, Zhou, Breiss, Katsuda and Steffman2022; see §7 for details).
The challenge of phonotactic learning in the presence of lexical exceptions is illustrated in Figure 1. Under the positive-evidence-only assumption, the learner relies exclusively on unlabelled input data (Marcus Reference Marcus1993), denoted by the filled dots in the figures; conversely, the unfilled dots represent unattested data that are absent from the input. The learning problem is to arrive at a target grammar that can differentiate between grammatical and ungrammatical sequences, represented by 1s and 0s in Figure 1b.

Figure 1 The learning problem in the presence of exceptions (adapted from Mohri et al. Reference Mohri, Rostamizadeh and Talwalkar2018: 8). In both (a) and (b), filled dots represent attested data, while unfilled dots indicate unattested data. In (b), 0 indicates the ungrammatical items and 1 indicates grammatical items, assuming Boolean grammaticality.
Learning models that assume all attested sound sequences are grammatical run the risk of building attested but ungrammatical noise into the model. This is a case of ‘overfitting’ in machine learning, in which a model is trained too well on the input data, to the extent that it starts to fit noise, consequently reducing its ability to generalise to unseen data (Mohri et al. Reference Mohri, Rostamizadeh and Talwalkar2018). The optimal model does not necessarily fit the input data perfectly; instead, it should filter out or heavily penalise lexical exceptions as perceived noise.
Although exceptionality has been a topic of perennial interest in phonology (Wolf Reference Wolf2011; Moore-Cantwell & Pater Reference Moore-Cantwell and Pater2016; Mayer et al. Reference Mayer, McCollum and Eziz2022),Footnote 1 learning models based on categorical grammars capable of handling exceptions remain to be developed. Categorical grammars make clear-cut demarcations between grammatical and ungrammatical sequences (Yang Reference Yang2016: 3), which can facilitate the identification of lexical exceptions. However, learning models based on categorical grammars are generally considered vulnerable to exceptions in naturalistic corpora, as discussed in Gouskova & Gallagher (Reference Gouskova and Gallagher2020: 107; emphasis added):
In contrast to our approach, Heinz (Reference Heinz2010), Jardine (Reference Jardine2016) and Jardine and Heinz (Reference Jardine and Heinz2016) characterise non-local phonology as an idealised problem of searching for unattested substrings. Their learners memorise attested precedence relations between segments and induce constraints against those sequences that they have not encountered. One of the problems with this approach is that it can reify accidental gaps to the level of categorical phonotactic constraints, whereas stochastic patterns with exceptions will stymie it (Wilson & Gallagher, Reference Wilson2018).
However, it would be uninsightful to dismiss categorical grammars altogether based on the modest performance of several idealised models, which were designed to explore the mathematical underpinnings of phonological learning, instead of handling real-world corpora. Recent developments have both demonstrated promising results using simple categorical phonotactic learning models in naturalistic corpora (Gorman Reference Gorman2013; Durvasula Reference Durvasula2020; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022) and begun to address challenges such as accidental gaps (Rawski Reference Rawski2021).
The current study undertakes a similar endeavour: rooted in formal language theory, it proposes a novel approach to address the problem of exceptions by integrating frequency information from the input data. This proposal draws inspiration from probabilistic approaches, especially the Hayes & Wilson (Reference Hayes and Wilson2008) phonotactic learner and traditional observed-over-expected (
$O/E$
) criterion (Pierrehumbert Reference Pierrehumbert1993), and takes the initiative to bridge the gap between the mathematical underpinnings of phonological learning and realistic data, harnessing the potential that categorical grammars can offer. The discrete nature of categorical grammars allows the proposed model to completely filter out lexical exceptions and demonstrates robust performance across noisy corpora from English, Polish and Turkish, successfully learning phonotactic grammars that approximate acceptability judgements in behavioural experiments. Compared to benchmark models, the model performs increasingly better with data that contain a higher proportion of lexical exceptions, reaching its peak in learning Turkish non-local vowel phonotactics despite the complexity introduced by disharmonic forms in the input data.
This article is structured as follows: §2 outlines the theoretical background and related assumptions; §3 introduces the current proposal, the Exception-Filtering learning algorithm; §4 illustrates the evaluation methods and provides an overview of the three subsequent case studies in English (§5), Polish (§6) and Turkish (§7). §8 discusses topics arising from the current study and outlines the directions for future work.
2. Background
This section outlines the essential concepts, underlying assumptions and relevant evidence involved in the current proposal.
2.1. The competence–performance dichotomy
This study assumes three interconnected components involved in phonotactic learning: grammar, lexicon and performance. The relationship between these components is visualised in Figure 2. Together, the lexicon and grammar constitute competence, representing internalised knowledge of a language. Speakers’ acceptability judgments are influenced by both competence and performance factors. For example, a word [sfid] will receive low acceptability if the lexicon does not contain the word and the grammar penalises its substructure *sf. (The rating ‘1 out of 7’ is provided as an example and does not represent actual data.) As highlighted in Figure 2, this article focuses on the acquisition of grammar, abstracting away from lexicon acquisition and general performance factors.

Figure 2 The relationship between lexicon, grammar and performance.
The current study distinguishes between the terms grammaticality (or well-formedness) and acceptability, which have frequently been conflated in previous research in phonology (Hayes & Wilson Reference Hayes and Wilson2008; Albright Reference Albright2009). In this context, acceptability refers to the judgements made by speakers on real-world performance, which can be influenced by both grammar and extragrammatical factors, such as processing difficulty, lexical frequency and similarity (Schütze Reference Schütze1996; see §8 for detailed discussion). In contrast, grammaticality refers to the abstract, internalised knowledge represented by the grammar, such as phonotactic constraints in the current article, independent of any extragrammatical factors, such as frequency information. A sound sequence is deemed grammatical only if it adheres strictly to the hypothesis grammar. As in the dual-route model (Pinker & Prince Reference Pinker and Prince1988; Zuraw Reference Zuraw2000; Zuraw et al. Reference Zuraw, Lin, Yang and Peperkamp2021), the lexical route allows the speaker to access the lexicon and evaluate the acceptability of existing (or attested) words, regardless of possible grammar violations. If the lexicon does not contain certain sound sequences, as in nonce words, the speaker instead evaluates their acceptability in the grammar via the non-lexical route, in which grammaticality is predicted based on grammar. This grammaticality then interacts with other extragrammatical factors and results in acceptability in the performance level.
Therefore, the relationship between grammaticality and acceptability is not one-to-one: certain ungrammatical forms in the lexicon could be deemed more acceptable than some grammatical forms. Due to the existence of extragrammatical factors, models that perfectly align with acceptability could actually deviate from the grammar. This is not due to any inherent inability to explain acceptability, but rather to an overreach in explanatory power, which is caused by representing extragrammatical factors in the grammar (Kahng & Durvasula Reference Kahng and Durvasula2023: 3).
Acceptability judgements are commonly collected through rating tasks employing a numeric Likert scale and characterised as ‘gradient’ (non-categorical) in nature (Albright Reference Albright2009). Individual Likert ratings correspond to categorical multilevel, rather than continuous, values (e.g., 1 = ‘strongly disagree’, 2 = ‘disagree’, 3 = ‘neutral’, 4 = ‘agree’, 5 = ‘strongly agree’), exhibiting considerable individual variability, which are not incompatible with categorical grammars.Footnote 2 When averaged over multiple participants, these results can present as gradient values, hinting at the need to incorporate individual variability within a categorical framework (see §8 for a discussion). Furthermore, because they are influenced by task effects, rating tasks can elicit gradient responses even for inherently discrete concepts, such as the concept of odd and even numbers (Armstrong et al. Reference Armstrong, Gleitman and Gleitman1983; Gorman Reference Gorman2013). Another extragrammatical factor in play in the acceptability judgement is auditory illusions, as shown in Kahng & Durvasula (Reference Kahng and Durvasula2023).
In light of these considerations, the acceptability judgements reported in previous studies are not incompatible with categorical grammar. On the one hand, the current study assumes that the grammaticality of sound sequences, categorical or probabilistic, is reflected in acceptability judgements, and a successful grammar should exhibit a robust correlation between predicted grammaticality and acceptability judgements to allow ‘direct investigation’ of linguistic competence (Lau et al. Reference Lau, Clark and Lappin2017). On the other hand, the current study argues that gradient acceptability collected through numerical rating tasks does not necessitate gradient or probabilistic grammars, nor does it negate categorical grammars (cf. Coleman & Pierrehumbert Reference Coleman, Pierrehumbert and Coleman1997; Hayes & Wilson Reference Hayes and Wilson2008).
The current study employs categorical grammars using a discrete set of constraints that simply accept grammatical sequences and reject ungrammatical ones. In contrast, probabilistic grammars, such as maximum entropy (MaxEnt) grammars (Hayes & Wilson Reference Hayes and Wilson2008), involve constraints with continuous weights, assigning a probability continuum across all possible sequences. Analogous to probabilistic grammars, grammaticality in categorical grammars is associated with discrete, often binary values, where 0 signifies ungrammatical sequences, and 1 designates grammatical ones. However, categorical grammars cannot be conflated with probabilistic grammars with thresholds (Hale & Smolensky Reference Hale, Smolensky, Smolensky and Legendre2006), which cannot define infinite languages, as mathematically demonstrated in Alves (Reference Alves2023). Probabilistic grammars have been noted for their ability to model human sensitivity to frequency information and approximate gradient acceptability judgements (Hayes & Wilson Reference Hayes and Wilson2008), whereas categorical grammars delineate a clear boundary between grammatical words and lexical exceptions (Yang Reference Yang2016: 3). This discrete nature can be used to facilitate phonological learning, as shown in the current study.
Hale & Reiss (Reference Hale and Reiss2008: 18) adopt a nihilistic view of phonotactic grammars, arguing that phonotactics is not part of phonological grammar, as it is computationally inert in morphophonological alternations (Reiss Reference Reiss2017, §6). However, experimental evidence has shown that infants do acquire phonotactics (Jusczyk et al. Reference Jusczyk, Friederici, Wessels, Jeanine, Svenkerud and Jusczyk1993, Reference Jusczyk, Luce and Charles-Luce1994; Jusczyk & Aslin Reference Jusczyk and Aslin1995; Archer & Curtin Reference Archer and Curtin2016). Recent work also shows that phonotactics can facilitate the learning of morphophonological alternations (Chong Reference Chong2021). Gorman (Reference Gorman2013, §1) demonstrates the internalisation of phonotactic constraints in various domains, such as wordlikeness judgements and loanword adaptation. Furthermore, the current study maintains the concept of categorical grammars, which essentially motivated the adoption of the nihilistic view (see Reiss Reference Reiss2017: 436, who cautions against throwing out the ‘categorical baby’ with – or instead of – the ‘phonotactic bathwater’). In light of this, the current study models the learning of phonotactic grammar as a crucial component within a broader framework of phonological learning (see the discussion in §8).
2.2. Attestedness vs. grammaticality
While the grammar is a finite system representing an infinite number of grammatical sound sequences, the lexicon lists all words that a speaker knows, including all exceptional and unpredictable features of attested input data (Chomsky Reference Chomsky1965: 229; Chomsky & Halle Reference Chomsky and Halle1965; Jackendoff Reference Jackendoff2002: 153). In turn, the input data in phonotactic learning drawn from the lexicon can include sound sequences that deviate from the grammar.
The current study assumes that exceptionality is not labelled in the input data or the lexicon but emerges from the discrepancy between attestedness in the input data and grammaticality with respect to the hypothesis grammar. Grammaticality indicates whether phonological representations conform to the hypothesis grammar internalised by the learner. Researchers have used various converging methodologies to approximate the hypothesis grammar, especially statistical generalisations (e.g., observed-to-expected ratio; detailed in §3) or performance data such as nonce-word acceptability (detailed below) and speech errors.Footnote 3 For convenience in the discussion, consider a hypothesis grammar consists of categorical constraints {*sf, *bn}. The symbol * is used here only to indicate ungrammatical sequences (as opposed to unattested ones). In contrast, attestedness indicates whether a sound sequence occurs in the input data. [brɪk] (as in brick) and *[sfiə] (sphere) are both attested in the English lexicon, while [blɪk] (blick) and *[bnɪk] (bnick) are not, as illustrated in Table 1.
Table 1 The distinction between attestedness and grammaticality (adapted from Hyman Reference Hyman1975)

This discrepancy between attestedness and grammaticality yields both accidental gaps (grammatical but unattested) and lexical exceptions (attested but ungrammatical), with this article particularly emphasising the latter. For example, although both are nonexistent words, blick is grammatical while *bnick is not, as speakers uniformly reject *bnick while accepting blick, a classic example of an accidental gap (Chomsky & Halle Reference Chomsky and Halle1965; Hayes & Wilson Reference Hayes and Wilson2008).Footnote 4
The attested sequences are considered lexical exceptions if and only if they violate the hypothesis grammar, such as
$\{$
*sf, *bn
$\}$
in the above example. Sphere is a classic example of a lexical exception: the onset [sf] rarely occurs in English and has been labelled ungrammatical in previous work (Hyman Reference Hyman1975; Algeo Reference Algeo1978; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022). The architecture in Figure 2 predicts that the acceptability of the attested word ‘sphere’ itself is directly influenced by the lexicon and is considered highly acceptable by some speakers. However, when they are not stored in the lexicon, [sf]-onset nonce words are commonly judged unacceptable, as shown in an experiment conducted by Scholes (Reference Scholes1966: 114): 33 seventh-grade English speakers were asked if a nonce word ‘is likely to be usable as a word of English’. Only seven participants responded ‘yes’ to the [sf]-onset nonce word [sfid], lower than [blʌŋ] (31 ‘yes’), and even lower than words with unattested onsets such as [mlʌŋ] (13 ‘yes’). Leveraging the converging evidence that *sf is a phonotactic constraint in hypothesis grammar, the attested [sf]-onset word sphere can be considered as a lexical exception, in contrast to attested and grammatical brick.Footnote
5
Lexical exceptions are also commonly observed in loanwords, leading to an evolving lexicon that could incorporate ungrammatical sound sequences from various languages (Kang Reference Kang, van Oostendorp, Ewen, Hume and Rice2011). For example, exceptional onsets can be observed in English loanwords, such as [bw] Bois, [sr] sri, [ʃm] schmuck, [ʃl] schlock, [ʃt] shtick, [zl] zloty, and adapted names from different languages, including [vr] Vradenburg. All these onsets exhibit low type frequencies in English, according to the CMU Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict), and they receive relatively low acceptability scores in nonce word judgements (Scholes Reference Scholes1966; see also §5). Similar examples have been observed in other languages where putative phonotactic restrictions do not extend to loanwords (Gorman Reference Gorman2013: 6–7). Thus, this article takes the position that input data drawn from the lexicon can contain lexical exceptions according to the hypothesised phonotactic grammar.
2.3. Summary
This section has underscored the tension between competence and performance and clarified the nuanced distinctions between acceptability and grammaticality. It uses a categorical grammar that distinguishes between grammatical and ungrammatical data. This section argues that the learning model should correlate the grammaticality scores predicted by the learnt grammar with acceptability judgements and handle lexical exceptions by using an exception-filtering mechanism based on frequency information.
3. The Exception-Filtering phonotactic learner
This section proposes a categorical-grammar-plus-exception-filtering approach to select a hypothesised categorical grammar (hereafter ‘hypothesis grammar’) from the hypothesis space. This section starts by justifying the concepts and assumptions of the current proposal and then introduces the core learning algorithm in §3.4.
3.1. Segment-based representation
The primary objective of this study is not to build an all-around model of phonotactic learning, but to distill the problem of exceptions to its essence at the computational level (Marr Reference Marr1982). For this reason, the current proposal adopts segmental representations derived from input data for their practical advantages, a departure from prespecified feature representations advocated by previous studies (Hayes & Wilson Reference Hayes and Wilson2008; Gouskova & Gallagher Reference Gouskova and Gallagher2020). In this article, a segmental approach facilitates the analysis of exceptions tied to segment-based constraints. For example, the presence of [sf] in the word sphere explicitly violates a single segmental constraint *sf but could be associated with several feature-based constraints such as *[+sibilant, −voice][+labiodental, −voice] and *[+alveolar][+labiodental]. Moreover, when training data are phonemically transcribed, segmental representations can be directly obtained from the input data, independent of any prespecified feature system. Employing segmental representations also significantly narrows down the hypothesis space, as discussed below.
3.2. The structure of grammars and hypothesis space
Phonotactic learning involves selecting a hypothesis grammar (G, a set of constraints) from the hypothesis space (, adapted from the OT terminology). The current study uses a non-cumulative, inviolable and unranked categorical grammar, labelling any sequence with non-zero constraint violations as ‘ungrammatical’ and those with zero violations as ‘grammatical’. The current study intentionally departs from the cumulative effects suggested in previous experimental work (Coleman & Pierrehumbert Reference Coleman, Pierrehumbert and Coleman1997; Breiss Reference Breiss2020; Kawahara & Breiss Reference Kawahara and Breiss2021), and primarily investigates whether phonotactic learning of categorical grammars is possible in the presence of exceptions. One possible way to incorporate cumulativity in the future could involve replacing the grammaticality function with the sum of constraint violations (see also §8).
This structure, while similar to that of Optimality Theory (OT; Prince & Smolensky Reference Prince and Smolensky1993; Prince & Tesar Reference Prince and Tesar2004), diverges significantly from OT’s cumulative, violable and ranked constraint grammars. In contrast to OT, the hypothesis grammar in the current proposal is drawn from a highly restrictive hypothesis space.Footnote
6
Based on the analytical results of formal language theory (FLT), the current study adopts tier-based strictly k-local (TSL
$_k$
) languages (Heinz et al. Reference Heinz, Rawal and Tanner2011; Jardine & Heinz Reference Jardine and Heinz2016; Lambert & Rogers Reference Lambert and Rogers2020) as the hypothesis space. In formal language theory, a ‘language’ is a set of strings (e.g., sound sequences) that adhere to its associated grammar, which can be mathematically characterised as a set of forbidden structures.
k-factors are substrings of length k. A TSL
$_k$
grammar consists of all forbidden k-factors on a specific tier, known as TSL
$_k$
constraints. The tier, also referred to as a projection (Hayes & Wilson Reference Hayes and Wilson2008), functions as a targeted subset of the inventory of phonological representations (e.g., segments, consonants, vowels) for constraint evaluation. In the context of local phonotactics, the tier encompasses the full inventory, such as all segments, while in non-local phonotactics, it includes only specific segments, such as vowels. For example, as shown in Figure 3, the Turkish word [døviz] ‘currency’ is represented as [øi] on the vowel tier. Non-tier segments are ignored during the evaluation of tier-based constraints. Therefore, [døviz] violates a tier-based local constraint *øi on the vowel tier. This concept of a tier is similar to, but distinct from, the traditional feature-based definition in autosegmental phonology (Goldsmith Reference Goldsmith1976).

Figure 3 Extraction of vowel tier from the Turkish word [døviz] ‘currency’. The vowel tier contains the vowels in this word, disregarding the non-tier consonants.
A string is labelled as grammatical if it does not contain any forbidden k-factors specified by the grammar; otherwise, the string is considered ungrammatical. This can be formalised by the function
$\texttt {factor}(s,k)$
, which generates all k-factors of a string s. For example,
$\texttt {factor}(\text {CCV},2) = \{\text {CC}, \text {CV}\}$
and
$\texttt {factor}(\text {CVC},2) = \{\text {CV}, \text {VC}\}$
. The grammaticality score of a string s under a grammar G, denoted as
$g(s, G)$
, is defined as follows:
For example, consider a grammar
$G = $
$\{$
*CC
$\}$
, which forbids any strings containing the sequence CC. In this case, the string CCV would be deemed ungrammatical, while the string CVC would be classified as grammatical.
TSL
$_k$
languages delineate a formally restrictive but typologically robust hypothesis space, capturing a range of local and non-local phonotactics (Heinz et al. Reference Heinz, Rawal and Tanner2011). Specifically, McMullin & Hansson (Reference McMullin and Hansson2019) provide experimental evidence for TSL
$_2$
as a viable working hypothesis space for phonotactic learning, demonstrating that adult participants in artificial learning experiments were able to learn TSL
$_2$
patterns, but struggled with patterns that fall outside the TSL
$_2$
class. Formal language-theoretic studies have also demonstrated that this hypothesis space is accompanied by efficient learning properties (Heinz et al. Reference Heinz, Rawal and Tanner2011; Jardine & Heinz Reference Jardine and Heinz2016; Jardine & McMullin Reference Jardine and McMullin2017). This approach has been successfully applied in previous work spanning both probabilistic and categorical approaches (Heinz Reference Heinz2007; Hayes & Wilson Reference Hayes and Wilson2008; Jardine & Heinz Reference Jardine and Heinz2016; Gouskova & Gallagher Reference Gouskova and Gallagher2020; Mayer Reference Mayer2021; Dai et al. Reference Dai, Mayer and Futrell2023).
One of the main challenges of phonotactic learning, as discussed in Hayes & Wilson (Reference Hayes and Wilson2008: 392), is the rapid growth of the hypothesis space with increasing size of k. In response to this challenge, the current study limits k to two (TSL
$_2$
), which is sufficient to capture a large number of local and non-local phonotactic patterns. Although this article only examines local phonotactics of English and Polish onsets and non-local phonotactics of Turkish vowels, the proposed hypothesis space is broadly applicable for suitable domains, extending to phenomena such as non-local laryngeal phonotactics in Quechua (Gouskova & Gallagher Reference Gouskova and Gallagher2020), Hungarian vowel harmony (Hayes & Londe Reference Hayes and Londe2006) and Arabic OCP-Place patterning (Frisch & Zawaydeh Reference Frisch and Zawaydeh2001; Frisch et al. Reference Frisch, Pierrehumbert and Broe2004). To summarise, the learner hypothesises a non-cumulative, inviolable and unranked categorical TSL
$_2$
grammar, derived from the hypothesis space of TSL
$_2$
languages.
3.3. The Exception-Filtering mechanism and O/E criterion
The goal of phonotactic learning is to select the grammar that distinguishes between grammatical and ungrammatical sequences from unlabelled input data. This problem is challenging in the presence of exceptions because intrusions of ungrammatical sequences can mislead the learner to build exceptional patterns into the hypothesis grammar. Computationally, a learning model exposed solely to positive evidence struggles to identify the target grammar from the hypothesis space of numerous formal language classes (Gold Reference Gold1967; Osherson et al. Reference Osherson, Stob and Weinstein1986). This challenge is particularly evident in classes of linguistic interest, such as the (tier-based) strictly 2-local languages. An in-depth review of this issue can be found in Wu & Heinz (Reference Wu and Heinz2023).
One approach to address the challenge of exceptions uses an exception-filtering mechanism to exclude exceptions while learning categorical grammars. Hayes & Wilson (Reference Hayes and Wilson2008: 427–428) hypothesise that children possess an innate ability to discern the unique status of certain exotic items, and improve their learning results by excluding exotic items from input data. This ability to detect and exclude anomalies aligns closely with the concept of exception-filtering in the current proposal. Although such a mechanism has been considered challenging to formulate (Clark & Lappin Reference Clark and Lappin2011: 105), the current study achieves it by leveraging indirect negative evidence derived from frequency information (Clark & Lappin Reference Clark and Lappin2009; Pearl & Lidz Reference Pearl and Lidz2009; Yang Reference Yang2016), specifically from type frequency (Pierrehumbert Reference Pierrehumbert2001; Hayes & Wilson Reference Hayes and Wilson2008; Richtsmeier Reference Richtsmeier2011).Footnote 7 Indirect negative evidence allows learners to infer grammaticality labels from unseen data, despite the absence of such labels in positive evidence, guided by the principle that a sequence that occurs less frequently than expected in the input data are likely ungrammatical.
The comparison between observed (O) and expected (E) type frequencies embodies the exception-filtering mechanism in the current study and has been widely applied in the identification of phonotactic constraints (Pierrehumbert Reference Pierrehumbert1993, Reference Pierrehumbert2001; Frisch et al. Reference Frisch, Pierrehumbert and Broe2004; Hayes & Wilson Reference Hayes and Wilson2008) since Trubetzkoy (Reference Trubetzkoy1939, ch. 7). For instance, the exceptional [sf] sequence would have the same expected type frequency as grammatical sequences like [br] (as in brick) if no constraints are present in the current grammar. However, if [sf] only appears in a limited number of words, such as sphere, its observed type frequency would be significantly lower than its expected type frequency. This discrepancy allows the learner to infer a *sf constraint and classify the observed sphere as a lexical exception.
The traditional
$O/E$
equation proposed by Pierrehumbert (Reference Pierrehumbert1993) has been widely applied to discover phonotactic constraints (Pierrehumbert Reference Pierrehumbert2001; Frisch et al. Reference Frisch, Pierrehumbert and Broe2004). However, this equation assumes an empty hypothesis grammar, which becomes inaccurate once any constraint is added, as discussed in Wilson & Obdeyn (Reference Wilson and Obdeyn2009) and Wilson (Reference Wilson2022).
The current criterion
$O/E$
draws inspiration from Hayes & Wilson (Reference Hayes and Wilson2008), while a crucial difference lies in the definition in which the hypothesis grammar is non-cumulative, leading to distinct calculations of O and E. The observed type frequency (O) of a potential constraint C is determined by the count of unique strings in the sample that violate C:
In a toy sample
$S = $
$\{$
CVC, CVV, VVC, VVV, VCV, CCV
$\}$
,
$O[\text {*CC}]=1$
,
$O[\text {*CV}]=4$
,
$O[\text {*VC}]=3$
,
$O[\text {*VV}]=3$
. Here,
$O[\text {*VV}]$
is
$3$
rather than
$4$
because, by definition,
$O[C]$
counts the number of strings that violate the potential constraint (at least once) rather than the cumulative number of substring violations across all strings. Therefore, the string VVV is counted only once in
$O[\text {*VV}]$
. Moreover, O is updated during the learning process, as the learner filters out lexical exceptions from the input data S every time a new constraint is added to the hypothesis grammar.
The expected type frequency
$E[C]$
represents the number of unique strings in the hypothesised language L that violate C, under a non-cumulative hypothesis grammar G.Footnote
8
Following Hayes & Wilson (Reference Hayes and Wilson2008), the current study works with an estimation to
$E[C]$
by limiting the maximum string length in L to
$\ell _{\max }$
, the length of the longest string in the input data S.
$E[C]$
is then approximated by
Here, the learner first partitions the input data
$S = S_1 \cup S_2 \cup \ldots \cup S_{\ell _{\max }}$
and the hypothesised language
$L = L_1 \cup L_2 \cup \ldots \cup L_{\ell _{\max }}$
into subsets by string lengths.
$E_{\ell }[C]$
is the expected number of unique strings in each
$S_{\ell }$
that violate C:
$\textit {Ratio}(C, G, \ell )$
represents the proportion of strings of
$\ell $
length accepted by G but violating C. This is found by comparing the accepted strings in G and
$G' = G \cup \{C\}$
, where C is added to G.Footnote
9
$\textit {Count}(G,\ell )$
is the count of unique
$\ell $
-length strings in the hypothesis language L accepted by G. Therefore,
$\textit {Count}(G,\ell )-\textit {Count}(G',\ell )$
is the number of unique strings that violate C in L.
Table 2 illustrates this calculation with exception-free input data that perfectly align with each corresponding hypothesis grammar G. The first row shows an empty hypothesis grammar (
$G=\emptyset $
) along with input data
$\{$
CCC, CCV, CVC, CVV, VVV, VCV, VCC, VVC
$\}$
(where
$|S_{3}| = 8$
).
$\textit {Count}(\emptyset ,3) = 8$
, given that the empty hypothesis grammar permits eight potential strings
$\{$
CCC, CCV, VCC, CVC, CVV, VVV, VCV, VVC
$\}$
of length 3.
Table 2 The list of idealised input data and corresponding hypothesis grammar, as well as expected frequencies for length 3. The input data
$S_3$
here is idealised and identical to the target language
$L_3$

When *CC is added to the intersected grammar, resulting
$G' = \{\text {*CC}\}$
,
$G'$
only permits five strings
$\{$
CVC, CVV, VVV, VCV, VVC
$\}$
(
$\textit {Count}(\{\text {*CC}\},3) = 5$
). The expected frequency of *CC is calculated as in (6):
This matches the fact that three strings
$\{$
CCC, CCV, VCC
$\}$
violate the potential constraint *CC in the idealised input data
$L_3$
in the first row. Here,
$E\left [\text {*CC}\right ] = E_3\left [\text {*CC}\right ]$
because only strings of length 3 exist in the input data.
Following this update, ungrammatical strings (violating G) are filtered from the input data S. When G becomes
$\{$
*CC
$\}$
, as shown in the second row of Table 2, the input data shrink to
$\{$
CVC, CVV, VVV, VCV, VVC
$\}$
(
$|S_3| = 5$
).
$E[\text {*CC}]$
drops to zero, because *CC is already penalised by G (
$\text {*CC}\in G$
). In other potential constraints, for example,
$E[\text {*VV}] = |S_3| \times (\frac {5-2}{5}) = 5 \times \frac {3}{5} = 3$
, three of the five strings allowed by
$G = \{\text {*CC}\}$
violate *VV.
Although alternative calculations, such as
$O-E$
, yield similar learning results,
$O/E$
has the advantage of a clear range from
$0$
(
$O=0$
) to
$1$
(
$O=E$
). During the learning process, a constraint is included in the grammar if the
$O/E$
ratio falls below a specified threshold (
$O/E<\theta $
). This comparison is performed at increasing threshold levels, ranging from
$0.001$
to
$\theta _{\max }$
, also known as the accuracy schedule (Hayes & Wilson Reference Hayes and Wilson2008), where the interval after
$0.1$
is fixed to
$0.1$
. For example, the accuracy schedule Θ = [0.001, 0.01, 0.1, 0.2, 0.3, …, 1] if
$\theta _{\max } = 1$
. This structure prioritises the integration of potential constraints with the lowest
$O/E$
values.Footnote
10
$\theta _{\max }$
can be interpreted as follows: the higher
$\theta _{\max }$
indicates the need for more statistical support, that is, higher
$O/E$
, before considering a two-factor as grammatical. This also allows for the modelling of individual variability in phonotactic learning, where some learners require more statistical support for grammatical sequences, reflected by a higher
$\theta _{\max }$
.
Equipping the Exception-Filtering learner with the accuracy schedule adapted from Hayes & Wilson (Reference Hayes and Wilson2008) controls the contrast between them and facilitates direct comparison between their best-performing models. Dealing with realistic corpora and experimental data requires posterior adjustments of
$\theta _{\max }$
: the analyst/user sets this hyperparameter to the value between
$0$
and
$1$
that achieves the highest scores on all statistical tests in each test data set. In this article,
$\theta _{\max }$
is set to
$0.1$
for the English and Polish case studies and
$0.5$
for the Turkish case study. The current study shows that once an appropriate hyperparameter is in place, the proposed model can successfully acquire categorical grammars despite the existence of lexical exceptions.
Future psycholinguistic studies are required to better model the factors that determine
$\theta _{\max }$
. For example, Frisch et al. (Reference Frisch, Large, Zawaydeh, Pisoni, Bybee and Hopper2001) showed that the larger the lexicon size of individual participants in their experiment, the more likely they would accept sequences with low type frequency, which means lower
$\theta _{\max }$
in the Exception-Filtering learner.
3.4. Learning procedure
Building on the concepts above, the Exception-Filtering learner models how a child learner acquires a categorical phonotactic grammar given the input data. The learning problem in the presence of exceptions is formalised as follows: given the input data S, select a hypothesis grammar G from the hypothesis space, so that G approximates the target grammar
$\mathcal {T}$
that defines the target language
$\mathcal {L}$
.Footnote
11
The input data S includes grammatical strings from
$\mathcal {L}$
and a limited number of ungrammatical strings outside
$\mathcal {L}$
, that is, lexical exceptions, disregarding speech errors and other noise reserved for future investigations.
Let us look at a toy example: given the tier (also the inventory)
$\{$
C, V
$\}$
, consider the target grammar
$\mathcal {T}=\{\text {*CC}\}$
. The hypothesis space consists of all possible two-factors on the tier
$\{$
*CC,*CV,*VV,*VC
$\}$
. The toy input data
$\mathcal {S}=\{\text {CVC, CVV, VVC, VVV, VCV, CCV}\}$
includes one exception, CCV, which violates the target grammar
$\mathcal {T}$
. Though the toy example limits the string length to three, the learner can handle samples with varying lengths.
As visualised in Figure 4, given the input data S, tier and the maximum
$O/E$
threshold
$\theta _{\max }$
, the learner first initialises an empty hypothesis grammar G and hypothesis space Con (Step 1). The learner then selects the next threshold
$\theta $
from the accuracy schedule
$\Theta $
(Step 2). Subsequently, the learner computes
$O/E$
for each potential constraint in Con (Step 3). Constraints with
$O/E < \theta $
are integrated into G and removed from Con and all lexical exceptions that violate these constraints are filtered out of the input data S (Step 4). This is followed by a reselection of
$\theta $
, a reevaluation of the values of
$O/E$
and an update of
(Steps 2, 3 and 4). The learner follows the accuracy schedule and incrementally sets a higher threshold for constraint selection. The iteration continues until the threshold reaches a maximum value (
$\theta = \theta _{\max }$
), marking the termination. The following paragraphs illustrate this learning procedure using the toy input data with the exception of *CCV. Given the page limitations, a simplified accuracy schedule
$\Theta = [0.5, 1]$
with
$\theta _{\max }=1$
is used to avoid too many iterations.

Figure 4 The learning procedure of the Exception-Filtering learner.
3.4.1. Step 1: initialisation
Given the input data S and tier
$\{$
C, V
$\}$
, the learning process begins with the initialisation of a hypothetical grammar G. Initially, G is an empty set, implying that all possible sequences are assumed to be grammatical prior to the learning procedure. The learner also defines the hypothesis space Con, which encompasses all forbidden two-factors. This initialisation process is shown in Table 3, where the left side shows the initialisation of O and E, and the right side stores the variables.
Table 3 Initialisation

3.4.2. Steps 2 and 3: select θ, compute O/E
Following the initialisation, the learner selects the first
$\theta = 0.5$
from the accuracy schedule and calculates the observed type frequency O and expected type frequency E for each potential constraint within the hypothesis space Con. In essence,
$O[C]$
represents the proportion of strings that violate a potential constraint C in the input data, while
$E[C]$
represents the proportion of strings that violate C in the current grammar G.
Consider the toy input data
$S = \text {\{CVC, CVV, VVC, VVV, VCV, CCV\}}$
(
$|S|=6$
). For the potential constraint *CC,
$\textit {Count}(G,3) = 8$
and
$\textit {Count}(G',3) = 5$
because three strings in the language defined by G (namely, CCV, VCC, CCC) violate the updated grammar
$G' = \{\text {*CC}\}$
. The ratio that a string violates *CC in the sample is
$\textit {Ratio} (\text {*CC}, \emptyset , 3) = 1-\frac {5}{8} = \frac {3}{8}$
. As a result,
$E[\text {*CC}] = |S|\times \textit {Ratio} (\text {*CC}, \emptyset , 3)=6\times \frac {3}{8}=2.25$
, as illustrated in Table 4.
Table 4 Compute O and E

3.4.3. Step 4: update G, Con and S (exception filtering)
The learner then stores potential constraints with
$O/E < \theta $
in G. Here, the learner updates G with *CC, as shown in Table 5. The sample S is also updated, and strings that contradict the updated hypothesis grammar are filtered out. In this case, the potential constraint *CC is added to G and removed from Con, and the string CCV is removed from S. This process is depicted in Table 5.
Table 5 Update G, Con and S

To prevent the overestimation of
$O/E$
, the learner filters out ungrammatical strings, including exceptions, from the input data. This is because adding one constraint to the hypothesis grammar has an impact on the expected frequency of other two-factors.Footnote
12
For instance, after integrating *CC into the hypothesis grammar, CCV, VCC and CCC should no longer be considered in the expected frequency count, thereby reducing the expected frequency of *CV and *VC. This mechanism ensures the learner continue the subsequent learning process without the negative impact of identified lexical exceptions.
3.4.4. Iteration and termination
The learner then enters an iterative process and returns to Step 2 to reselect
$\theta $
and recalculate O and E based on the updated hypothesis grammar G. This iteration is crucial as the values of O and E depend on the current state of G. The process continues until the accuracy schedule is exhausted (
$\theta =\theta _{\max }$
), indicating that there are no more potential constraints, marking the termination of learning. The term convergence is avoided in this context because establishing its conditions requires a more general proof, which is reserved for future research.
In the second iteration of the toy example, after *CC is added to G and removed from Con (hence ‘–’ in the
$O[\text {*CC}]$
and
$E[\text {*CC}]$
of Table 6),
$\theta $
is reassigned to
$1$
, and no constraint satisfies
$O/E<\theta $
.
$\theta =\theta _{\max }=1$
indicates the termination of the learning process. The learnt grammar matches the target grammar
$\mathcal {T} = $
$\{$
*CC
$\}$
, as shown in Table 6.
Table 6 Steps 2 and 3 after the first iteration

3.5. Summary
To summarise, the Exception-Filtering learner initiates the learning process with an empty hypothesis grammar, allowing all possible sequences. As it accumulates indirect negative evidence from input data, the learner gradually filters out exceptions, shrinks the space of possible sequences, and updates the hypothesis grammar G with respect to the comparison of the observed and expected type frequency. The learner iteratively filters out lexical exceptions from the input data, rather than accepting them in the hypothesis grammar.
4. Evaluation
This section aims to provide a clear methodology for evaluating the proposed learning model. Inspired by Hastie et al. (Reference Hastie, Tibshirani and Friedman2009), the evaluation in the current study consists of four dimensions (two analytical and two statistical):
-
1. Scalability: can the model be applied successfully to a wide range of data sets?
-
2. Interpretability: can human analysts (e.g., linguists) interpret the learnt grammar?
-
3. Model assessment: evaluating the performance of the model with new data. This is achieved through the statistical tests against test data set as discussed below;
-
4. Model comparison: comparing the performance of different models.
The current study examines these four dimensions through three case studies in representative data sets: local onset phonotactics in English and Polish and non-local vowel phonotactics in Turkish. Learning from onset phonotactics controls the influence of syllable structures and considerably simplifies the learning problem (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011;Jarosz Reference Jarosz2017; Jarosz & Rysling Reference Jarosz and Rysling2017). In Turkish, however, learning models are applied to vowel tiers without specified syllabic structures.
Moreover, the current proposal is compared to the learning algorithm proposed by Hayes & Wilson (Reference Hayes and Wilson2008) – henceforth referred to as the HW learner – due to its widespread acceptance in the field and its accessible software (the UCLA Phonotactic Learner, available at https://linguistics.ucla.edu/people/hayes/Phonotactics/), making it an ideal benchmark for comparison. In the case studies, the hyperparameters Max
$O/E$
(0.1 to 1; similar to
$\theta _{\max }$
in this article) and Max gram size n (2 to 3) in the HW learner were fine-tuned so that only the highest-performing models across all tests are reported.Footnote
13
A 300 Maximum constraint limit was only established in the Turkish case study due to hardware limitations when handling a large corpus. Moreover, the default Gaussian prior is used to reduce overfitting and handle exceptions (Hayes & Wilson Reference Hayes and Wilson2008: 387;
$\mu =0,\sigma =1$
).Footnote
14
The current study also implements a simple categorical tier-based strictly 2-local phonotactic learner (henceforth Baseline, capitalised to distinguish it from other baseline models), adapted from the memory-seg learner (Wilson & Gallagher Reference Wilson and Gallagher2018) and other previous work (Gorman Reference Gorman2013; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022), in which a string is considered grammatical (
$g = 1$
) if all its two-factors have non-zero frequency in the input data, and ungrammatical (
$g = 0$
) otherwise.
As the current study proposes a categorical-grammar-plus-exception-filtering-mechanism approach, contrasting it with the HW learner sheds light on the role of categorical grammars, while comparing it with the Baseline learner highlights the significance of the exception-filtering mechanism. All models are trained on the same input data.
Although none of the learning models here claim to be the exact algorithm performed by child learners, comparing their learning results and behavioural data provides valuable insights into the underlying mechanisms of phonotactic learning in the face of exceptions. In English and Polish case studies, the learnt grammars are tested on the acceptability judgements from behavioural data. In the Turkish case study, while conducting a new experiment falls outside the scope of the current study, the study approximates the acceptability judgements using the experimental data collected by Zimmer (Reference Zimmer1969). This is in line with the methodology employed by Hayes & Wilson (Reference Hayes and Wilson2008) for deriving acceptability judgements in English from Scholes (Reference Scholes1966). Moreover, the learnt grammar is contrasted with the documented grammar as analysed by human linguists. This has been a standard method in phonotactic modelling. For example, Hayes & Wilson (Reference Hayes and Wilson2008) compared the learnt grammars of Shona and Wargamay with the phonological generalisations in the previous literature. Gouskova & Gallagher (Reference Gouskova and Gallagher2020) used a method to generate grammaticality labels for nonce words based on phonological generalisations that are experimentally verified (§7). The major statistical tests for model assessment and comparison are described below.
4.1. Correlation tests
The correlation between predicted judgements and gradient acceptability judgements, often based on Likert scales, can be assessed using various correlation tests: Pearson’s r (Pearson Reference Pearson1895), Spearman’s ρ (Spearman Reference Spearman1904), Goodman and Kruskal’s γ (Goodman & Kruskal Reference Goodman and Kruskal1954) and Kendall’s τ (Kendall Reference Kendall1938). These values range from −1 (highly negative) to 1 (highly positive).
Pearson’s r requires the assumption of linearity, positing that intervals between ratings are of equal size (e.g., the distance between 1 and 2 is the same as between 4 and 5). However, this assumption may not hold for Likert ratings (Gorman Reference Gorman2013; Dillon & Wagers Reference Dillon and Wagers2021), even if they are averaged over participants. Moreover, the Pearson correlation test also requires both variables to be continuous and their relationship to be normally distributed. The categorical grammaticality predicted in the current proposal does not satisfy this requirement. Therefore, Pearson’s r is not reported in this study.Footnote 15
Non-parametric tests measuring rank correlations are more appropriate, as they make weaker assumptions about the distribution of acceptability judgements (Gorman Reference Gorman2013: 27). Spearman’s ρ assumes monotonicity, meaning that the lower values in acceptability consistently correspond to lower levels of predicted grammaticality score. Spearman’s ρ requires stronger monotonic relationships to produce higher correlation coefficients, making the score more sensitive to inconsistent performance of subjects, compared to other non-parametric tests. For example, if subjects assign ratings on a scale of 1 to 6 inconsistently to intermediate judgements, such that a score of 4 could represent grammaticality less than or equal to a score of 2, this will disrupt monotonicity and thus greatly lower Spearman’s ρ.
In Goodman and Kruskal’s γ and Kendall’s τ test, pairs of observations
$(X_i, Y_i)$
and
$(X_j, Y_j)$
from predicted judgements (X) and gradient acceptability judgements (Y) are classified as concordant, discordant, or tied. A pair is considered concordant if the order of elements in X matches that of Y (
$X_i< X_j$
implies
$Y_i < Y_j$
), and discordant if the orders are reversed. If
$X_i=X_j$
or
$Y_i=Y_j$
, the pair is considered a tie.
Goodman and Kruskal’s γ calculates the difference between the number of concordant and discordant pairs, normalised by the total number of non-tied pairs: γ
$=$
(concordant − discordant) / (concordant + discordant). Tied pairs are ignored in this computation. Kendall’s τ penalises tied pairs by modifying the denominator in γ based on the number of tied pairs. Goodman and Kruskal’s γ acts as a benchmark when Kendall’s τ incurs severe penalty in categorical grammar, which often produces a large number of tied pairs.
4.2. Classification accuracy
When categorical grammaticality labels are provided in the test data, this article utilises binary accuracy and the F-score as performance measures for predicted grammaticality in the classification task. The binary accuracy represents the proportion of correct predictions of all labels. This value is then separately calculated for ‘ungrammatical’ and ‘grammatical’ labels. F-score is an accuracy metric that takes into account both precision and recall. Precision is the ratio of true positives to the sum of true positives and false positives. Recall is the ratio of true positives to the sum of true positives and false negatives. The F-score is the harmonic mean of precision and recall (
$2\ \times\ (\text {precision}\ \times\ \text {recall})/ (\text {precision}\ +\ \text {recall}$
)), ranging from 0 to 1. A model devoid of false positives obtains a precision score of 1, while one without false negatives achieves a recall of 1. A model without both errors yields an F-score of 1.
To evaluate the HW learner in binary classification, a thresholding method was used to transform the harmony scores of the learnt MaxEnt grammar into categorical grammaticality judgements (Hayes & Wilson Reference Hayes and Wilson2008: 385). Specifically, sequences with harmony scores equal to or below a certain threshold were classified as grammatical, whereas those with harmony scores exceeding the threshold were classified as ungrammatical. The optimal threshold was chosen, from the minimum to the maximum of all harmony scores, to maximise the binary accuracy of the learnt MaxEnt grammar. In other words, the current proposal is compared to the maximal performance that a MaxEnt grammar can achieve in binary accuracy. The current study will evaluate this thresholding method empirically, while Alves (Reference Alves2023) has mathematically and theoretically shown the consequences of probabilistic grammars with thresholds.
The following three sections employ the methodologies described above to the case studies of English and Polish onsets and Turkish vowel phonotactics.
5. Case study: English onsets
Gorman (Reference Gorman2013: 36) has shown that the HW learner does not reliably outperform the baseline learning model based on categorical grammar. This observation was based on the test data set from studies conducted by Albright & Hayes (Reference Albright and Hayes2003) and Scholes (Reference Scholes1966). This section extends this investigation by modelling the learning process from an exceptionful input data set and evaluating the learning results against a novel test data set drawn from Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011).
5.1. English input data
To facilitate comparison with previous work, this case study uses the ‘modestly’ exceptionful data in Hayes & Wilson (Reference Hayes and Wilson2008, appendix B) as the input, assuming that this data set has a distribution of type frequencies similar to children’s learning experience. The data set consists of 31,985 onsets taken from distinct word types drawn from the CMU Pronouncing Dictionary. Each of these words has been encountered at least once in the CELEX English database (Baayen et al. Reference Baayen, Piepenbrock and Gulikers1995; Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011; Hayes Reference Hayes2012). This provides a representative sample that approximates the type frequencies of English onsets in the language experience of English speakers.
There are 90 unique onsets in the input data. Table 7 illustrates how the majority of the input data (31,641 to be precise) are classified as non-exotic (Table 7a), while the onsets of 344 words are considered exotic (Table 7b) per Hayes & Wilson (Reference Hayes and Wilson2008). The HW learner yields worse performance when exposed to input data with ‘exotic’ items compared to samples containing only non-exotic items. The current study claims that some, if not all, of these exotic items are lexical exceptions, especially those sequences borrowed from other languages, such as [zl] zloty from Polish. Following Hayes & Wilson (Reference Hayes and Wilson2008: 395), [Cj] onsets are removed from the corpus due to considerable phonological evidence indicating that the [j] portion of [Cj] onsets is better parsed as part of the nucleus and rhyme; for example, spew is analysed as [[sp]onset [ju]rhyme].Footnote 16 This filtering of [Cj] onsets leads to the input data being characterised as ‘modestly exceptionful’ because there are only a few remaining exotic onsets.
Table 7 Type frequency of English onsets in the input data

Several phonotactic patterns are worth noting while interpreting the learnt grammar, especially whether the attested ‘exotic’ onsets such as [sf, zl, zw] are deemed ungrammatical. Moreover, previous studies have emphasised the impact of the Sonority Sequencing Principle (SSP) on English phonotactic judgements. According to the SSP, onsets featuring large sonority rises, such as stop+liquid combinations (e.g., [pl, bl, dr]), are generally favoured as being well-formed (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011).Footnote 17 The current study only uses the SSP to better interpret the learnt grammar. Capturing the effects of the SSP on unattested clusters, also known as sonority projection (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011; Jarosz & Rysling Reference Jarosz and Rysling2017), would require feature-based representations, which are beyond the scope of this article.
5.2. Learning procedure and learnt grammar
For the given input data and the tier (all segments of the input data), the Exception-Filtering learner first initialises a hypothesis space for 23 consonants that appear in the input data based on the TSL
$_2$
language, excluding phonemes that never occur at word initial positions such as [x] (as in loch) and [ŋ] (ring). As a result, the hypothesis space is populated with a total of
$23 \times 23 = 529$
potential constraints for the English input data. For all case studies, two-factors involving the initial word boundary (#) and each consonant (e.g., *#z) are considered in the hypothesis space, but are ignored in the article, because they are always deemed grammatical in learnt grammars.
The Exception-Filtering learner learns consistent categorical grammars in every simulation, owing to the discrete nature of constraint selection. Arranged according to the sonority hierarchy, Table 8 illustrates the learnt grammar when the maximum threshold
$\theta _{\max }$
is set at 0.1, which delivers the best performance during the evaluation. The rows of the table, labelled at the left, represent the first symbol in each two-factor, and the columns, labelled at the top, represent the second symbol. The learner deems grammatical two-factors, such as [pl], as
$1$
, and ungrammatical ones, such as [pt], as
$0$
. The grammatical two-factors such as [bl] in the learnt grammar are all attested, while the attested ungrammatical two-factors such as [pw] indicate detected lexical exceptions. The value
$\theta _{\max } = 0.1$
demarcates ungrammatical two-factors (e.g., [dw]:
$O/E = 17/174 \approx 0.098$
) from grammatical ones (e.g., [ʃr]:
$O/E = 40/265 \approx 0.151$
).
Table 8 A grammar learnt from the English sample. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

Interpreting the learnt grammar yields several interesting insights. Only clusters with large sonority rises are permitted by the learnt grammar, such as stop+liquid and fricative+liquid, which is consistent with SSP and previous studies (Jarosz Reference Jarosz2017: 270), except for [s]-initial two-factors [sp, st, sk]. Moreover, most detected lexical exceptions involve a consonant followed by an approximant, as seen in [zl] zloty, [sr] Sri Lanka and [pw] Pueblo. These exceptional two-factors all exhibit substantial sonority rises, indicating a conflict between SSP and the learnt grammar.
Furthermore, many learnt segment-based constraints match the MaxEnt grammar learnt in Hayes & Wilson (Reference Hayes and Wilson2008: 397). For instance, the learnt grammar bans sonorants before other onset consonants (*[+sonorant][]; e.g., *rt) and fricative clusters with a preceding consonant (*[][+continuant]; e.g., *sf). Also identified are exceptional two-factors such as *gw, *dw, *θw, also noted by Hayes & Wilson, in which these two-factors are treated as violable constraints.
5.3. Model evaluation in English
This section evaluates whether the learnt grammar approximates the acceptability judgements from the experimental data in Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011). The test data set includes 96 nonce words with a CC-VCVC structure, for example,
$\textit {pr-} + \textit {-eebid} = \textit {preebid}$
. The 48 word-initial CC onsets of these words were randomly concatenated with 6 VCVC tails. There are 18 onsets that never occur as English onsets (unattesteds), for example, [tl], [rg], and 18 clusters that frequently occur as English onsets (attesteds) as well as 12 clusters that are found only rarely or in loanwords (marginals), for example, [gw] in Gwendolyn, [ʃl] in schlep (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011: 203).
Then each nonce word was rated on a Likert scale, ranging from 1 (unlikely) to 6 (likely), by highly proficient English speakers who were recruited through the Mechanical Turk platform (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011). Individual scores were not disclosed by the authors, and the test data set only has averaged Likert ratings over all participants.
Table 9 shows the onsets presented to the subjects and the corresponding type frequency in the input data, the average Likert ratings and the predicted grammaticality (g) of the learnt grammar. Detected exceptions (non-zero frequency but deemed ungrammatical) are highlighted. Notably, the ungrammatical two-factors identified by the Exception-Filtering learner receive low to modest ratings (between 1.325 and 3.124), compared to grammatical two-factors (between 3 and 4.525).
Table 9 Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of English nonce word onsets, sorted by averaged Likert ratings. Detected exceptions (non-zero frequency and g = 0) are shaded

Table 10 provides a performance comparison among the Exception-Filtering (
$\theta _{\max } = 0.1$
), Baseline and HW learners (Max
$O/E=0.3$
, Max gram
$n=3$
, the same as Hayes & Wilson Reference Hayes and Wilson2008). Correlation scores are compared across the entire test data set as a whole. It should be noted that the test data set from Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011) excludes several exceptional onsets penalised by the Exception-Filtering learner, such as *[sf].
Table 10 Results of the best performances by the Exception-Filtering (
$\theta _{\max }$
= 0.1), Baseline and HW learners (Max
$O/E$
= 0.3,
$n$
= 3). Correlation tests are reported with respect to averaged Likert ratings in English; best scores are in bold

The reported correlation scores of all models are significantly different from zero at a two-tailed alpha of 0.01. Both the Exception-Filtering and Baseline learners delivered comparable performances,Footnote 18 while the HW learner demonstrated slightly superior results, especially in terms of Spearman’s ρ and Kendall’s τ. Interestingly, the close-to-one Goodman and Kruskal’s γ observed in both Exception-Filtering and Baseline learners indicates a higher number of tied pairs in nonparametric tests, leading to a marginally reduced Kendall’s τ.
Although the Exception-Filtering learner shows a comparable performance on par with other well-established models, it did not stand out in approximating the acceptability judgements of Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011). However, the relatively modest performance of the Exception-Filtering learner in the modestly exceptionful input data sets the stage for improved learning results in the forthcoming sections dealing with highly exceptionful data sets.
In summary, the proposed learner successfully learns a categorical phonotactic grammar from naturalistic input data of English onsets. The learnt grammar reveals several interesting observations in English phonotactics, and approximates gradient acceptability judgements from the behavioural data in Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011), and managed to deliver a robust performance comparable to benchmark HW model in a modestly exceptionful input data.
6. Case study: Polish onsets
In this section, the Exception-Filtering learner is applied to the input data and gradient behavioural data concerning Polish onsets (Jarosz Reference Jarosz2017; Jarosz & Rysling Reference Jarosz and Rysling2017).
6.1. Polish input data
To model the language acquisition experiences of children, the model was trained on input data consisting of 39,174 word-initial onsets, sourced from a phonetically transcribed Polish lexicon (Jarosz Reference Jarosz2017; Jarosz et al. Reference Jarosz, Calamaro and Zentz2017) derived from a corpus of spontaneous child-directed speech (Haman et al. Reference Haman, Etenkowski, Łuniewska, Szwabe, Dabrowska, Szreder and Łaziński2011). There are 384 unique onsets in the input data.
Table 11 shows the consonants that appear in the input data. The current study uses a uniform system for converting orthography to IPA, remaining neutral on the ongoing debate surrounding the specific phonetic properties of certain segments, particularly the retroflex consonants cz [\t{t\textrtails}], drz/dż [\t{d\textrtailz}], sz [ʂ] and rz/ż [ʐ] (Jarosz & Rysling Reference Jarosz and Rysling2017; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022). Polish is known for allowing complex onsets (up to four consonants such as [vzdw]) that defy SSP (Jarosz Reference Jarosz2017; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022).Footnote 19 For example, a large number of glide+stop, liquid+fricative and nasal+stop sequences are attested, such as [wb, rʐ, mk]. Moreover, many attested onsets are equally or even less acceptable than unattested onsets, as shown in the test data set below, which provides a unique challenge for the Exception-Filtering learner.
Table 11 Polish consonant inventory (derived from the input data)

6.2. Learning procedure and learnt grammar in Polish
Similar to the English case study, for the given input data and tier (all segments from the input data), the Exception-Filtering learner initialises possible constraints for 30 consonants that appear in the input data. As a result, the hypothesis space includes a total of 30
$\times $
30 = 900 two-factors for the Polish input data. As above, two-factors involving the initial word boundary (#) are ignored because they are all considered grammatical by the learnt grammar.
After the learning process, Table 12, arranged according to the sonority hierarchy, illustrates the learnt grammar when
$\theta _{\max }$
is set at 0.1, which delivers the best performance during the evaluation. The learnt grammar provides intriguing information on attested SSP-defying onsets (Jarosz Reference Jarosz2017). Most grammatical two-factors that violate the SSP are obstruent pairs such as fricative+stop and fricative+fricative. Rubach & Booij (Reference Rubach and Booij1990) proposed that stops, affricates and fricatives have indistinguishable sonority and should be considered as a single category, obstruents, in the context of the SSP. If one follows this proposition and disregards obstruent-initial onsets, most of the remaining SSP-defying two-factors, such as nasal+obstruent [rʐ] and glide+stop [wd], have relatively low type frequencies and are deemed ungrammatical by the learnt grammar. Only 4 of 900 two-factors are grammatical while defying SSP (having equal sonority or a low rise), namely [lv, rv, mn, mɲ]. In essence, while a comprehensive evaluation of SSP’s role in phonotactic learning is beyond the scope of this study, it is noteworthy that the learnt grammar here shows a viable approach to interpreting SSP-defying onsets in the context of lexical exceptions.
Table 12 Learnt grammar from Polish input data. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

6.3. Model evaluation in Polish data
This section evaluates the degree to which the learnt grammar reflects acceptability judgements gathered from experimental data in Polish. The test data set consists of 159 nonce words, which are constructed from a combination of 53 word-initial onsets (heads) and 3 trisyllabic VCVC(C)V(C) tails. The test data set also includes 240 attested fillers, varying in word length (1 to 4 syllables) and onset length (0 to 3 consonants). This setting allows for the evaluation of the learner’s performance on both attested and unattested sound sequences. Likert ratings were collected from 81 native Polish-speaking adults through an online experiment conducted on Ibex Farm (Jarosz & Rysling Reference Jarosz and Rysling2017).
Table 13 shows the onsets presented to the subjects and the corresponding type frequency in the input data, Likert ratings (average by onsets) and the predicted grammaticality (g) of the learnt grammar. Exceptions detected by the learnt grammar (non-zero frequency and
$g = 0$
) are highlighted.Footnote
20
For instance, [ʐj] is deemed ungrammatical, which is reflected in its average score of 2.259 on a 1 to 7 Likert scale.
Table 13 Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of Polish onsets, sorted by Likert rating. Detected exceptional onsets are highlighted

Table 14 shows the correlation with respect to the average Likert ratings in Table 13.Footnote
21
Correlations in all models significantly differ from zero at a two-tailed alpha of 0.01. In all correlation tests, the Exception-Filtering learner modestly outperforms the Baseline learner. It performs comparably to the benchmark HW learner (Max
$O/E = 0.7$
,
$n=2$
), with a modestly lower Spearman’s ρ and a modestly higher Kendall’s τ.
Table 14 Results of the best performance in Exception-Filtering (
$\theta _{\max }$
= 0.1), Baseline and HW learner (Max
$O/E$
= 0.7,
$n$
= 2). Correlation tests are approximating averaged Likert ratings in Polish; categorised based on attestedness; best scores are in bold

The Exception-Filtering learner identified more exceptional two-factors in the Polish input data. Moreover, its performance relative to the benchmark models improved compared to the English case study and surpassed the Baseline learner, which lacks the exception-filtering mechanism. These findings highlight the value of the exception-filtering mechanism in phonotactic learning, particularly when dealing with exceptionful real-world corpora.
To summarise, the Exception-Filtering learner, trained on Polish child-directed corpus, has illustrated its potential in extracting categorical grammars that approximate acceptability judgements. The performance of the model is on par with the HW learner in Spearman’s ρ, and modestly outperforms the benchmark HW learner and the Baseline learner both in Goodman and Kruskal’s γ and in Kendall’s τ test, demonstrating its capability in approximating acceptability judgements. These results further substantiate the potential of the Exception-Filtering learner in inducing phonotactic patterns from realistic corpora.
7. Case study: Turkish vowel phonotactics
This section tests the Exception-Filtering learner’s capability in capturing non-local vowel phonotactics from highly exceptionful input data drawn from a comprehensive corpus in Turkish.
7.1. Turkish vowel phonotactics
This section applies the current proposal to vowel phonotactic patterns in Turkish. Turkish vowels are shown in Table 15. Turkish orthography is converted to IPA, including ö [ø], ü [y] and ı [ɯ].Footnote 22
Table 15 Turkish vowel system

Turkish vowel phonotactic patterns are summarised as follows, adapted from Kabak (Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011):
-
1. Backness harmony: All vowels must agree in frontness or backness.
-
2. Roundedness harmony: High vowels must also agree in roundness with the immediately preceding vowel; hence, no high rounded vowels can be found after the unrounded vowels within a word.
-
3. No non-initial mid round vowels: No mid rounded vowels (i.e., [o] and [ø]) may be present in a non-initial syllable of a word, which means that they cannot follow other vowels.
First, a vowel cannot follow another vowel with a different [back] value (backness harmony). This is clearly demonstrated in morphophonological alternations, as shown in (7a) and (7b), adapted from Gorman (Reference Gorman2013: 46). For instance, when the plural suffix /lAr/ follows the root /pul/ ‘stamp’, it surfaces as [lɑr] rather than [ler]. This can be attributed to the non-local phonotactic constraint against the co-occurrence of u…e. In contrast, when /køj/ ‘village’ is combined with /lAr/, the resulting form is [køj-ler], demonstrating the non-local *ø…ɑ co-occurrence restriction. However, exceptions to this generalisation exist both within roots and across root–affix boundaries, as illustrated in (7c) and (7d). For example, both the root [silɑh] ‘weapon’ and its plural form [silɑh-lɑr] violate the vowel co-occurrence restriction *i…ɑ.
In the second phonotactic constraint related to harmony, a high vowel cannot follow another vowel with a different value for [round] (roundness harmony). (8) provides examples of this pattern. Yet again, exceptions are found, as in [boɰɑz-ɯn].Footnote 23
Last but not the least, the mid round vowels [ø] and [o] are typically restricted to initial position in native Turkish words, as in [ødev] ‘homework’ and [ojun] ‘game’. Consequently, these vowels should not follow any other vowels, for example, *ɑ…ø and *e…o. However, in loanwords, mid round vowels may occur freely in any position.
Generally, a substantial number of exceptions to these phonotactic patterns arise from compounds and loanwords (Lewis Reference Lewis2001; Göksel & Kerslake Reference Göksel and Kerslake2004; Kabak Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011). For example, the loanword [piskopos] borrowed from Greek epískopos ‘bishop’ violates both the roundness harmony and the constraint on non-initial mid round vowels.
Despite many exceptions, these generalisations are not only well-documented in the literature, including Underhill (Reference Underhill1976: 25), Lewis (Reference Lewis2001: 16), Göksel & Kerslake (Reference Göksel and Kerslake2004: 11) and Kabak (Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011: 4), but also supported by experimental studies (Zimmer Reference Zimmer1969; Arik Reference Arik2015). Furthermore, recent acquisition studies reveal that some harmony patterns are discernible by infants as young as six months old, who extract and pay attention to the harmonic patterns present in their language environment, filtering out any disharmonic tokens (Hohenberger et al. Reference Hohenberger, Altan, Kaya, Tuncer, Avcu, Ketrez and Haznedar2016).
Another layer of complexity in Turkish vowel phonotactics comes from root harmony. Turkish vowel phonotactic constraints are applicable within roots and across morpheme boundaries (Zimmer Reference Zimmer1969; Arik Reference Arik2015), while it is still a matter of debate whether harmony patterns in the domain of roots should be analysed as active phonological processes given the existence of exceptions in disharmonic roots (Kabak Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011: 17), some of which may originate from loanwords. However, from the perspective of phonological learning, these roots constitute a significant part of the input data exposed to human learners, as most Turkish roots can stand alone.
Therefore, Turkish vowel phonotactic patterns pose a unique challenge for phonological learning: how does the learner acquire vowel phonotactic generalisations from both roots and derived forms, despite the high level of lexical exceptions in the input data?
7.2. Turkish input data and learning procedure
The current study uses the Turkish Electronic Living Lexicon (TELL; https://linguistics.berkeley.edu/TELL/; Inkelas et al. Reference Inkelas, Aylin, Orhan Orgun and Sprouse2000) as input data, which consists of approximately 66,000 roots and the elicited derived forms (root+affixes) produced by two adult native Turkish speakers.Footnote 24 Table 16 shows the type frequency of all non-local two-factors on the vowel tier in TELL. Two-factors that follow the Turkish vowel phonotactics introduced above are highlighted. This corpus is a great testing ground for evaluating the role of the exception-filtering mechanism. Notably, every non-local two-factor has a non-zero frequency in this data set. Therefore, any phonotactic learner that assumes every attested two-factor to be grammatical would invariably conclude that all combinations are allowed and completely miss the vowel harmony patterns.
Table 16 The type frequency of two-factors in the input data; cells of documented grammatical two-factors are highlighted

Similar to previous case studies, for the given input data and tier (all vowels from the input data), the Exception-Filtering learner initialises possible constraints for eight Turkish vowels, which yields 64 two-factors in the hypothesis space. The optimal maximum
$O/E$
threshold is
$0.5$
. The learnt grammar is illustrated in the first test data set below.
7.3. Model evaluation
This section evaluates the learning models in two separate test data sets below.
7.3.1. The first test data set (categorical labels)
The first test data set consists of 64 nonce words in the template of [tV
$_1$
kV
$_2$
z], such as [tokuz], representing all possible two-factors on the vowel tier. Each word is categorically labelled 1 (‘grammatical’; 16 in total) or 0 (‘ungrammatical’: 48 in total) based on the aforementioned well-documented phonotactic generalisations.Footnote
25
Only roots are included in this analysis, as the learning model disregards morpheme boundaries.
It is important to note that individual variability is expected and that the grammaticality labels here may not match the exact target grammar of every speakers. However, these categorical labels are supported by Zimmer’s (Reference Zimmer1969) behavioural experiment. In a binary wordlikeness task, Zimmer (Reference Zimmer1969) asked Turkish native adult speakers to select which of a pair of nonce words (e.g., temez and temaz) was ‘more like Turkish’. Experiment 1 had 23 participants and Experiment 2 had 32 (see Supplementary Material for details); the majority of participants preferred the harmonic to the disharmonic roots in a yes/no rating task, which provides evidence for the psychological reality of Turkish vowel phonotactic patterns encoded in the first test data set. In other words, the first test data set aims to evaluate how well the learnt grammar mirrors the categorical phonotactic judgements of the majority of participants in Zimmer’s (Reference Zimmer1969) experiment. This follows the common practice in previous computational studies when acceptability judgements of nonce words in the test data set are not accessible. For example, Gouskova & Gallagher (Reference Gouskova and Gallagher2020) manually labelled the categorical grammaticality of nonce words in the test data set based on documented phonotactic generalisations supported by behavioural experiments (Gallagher Reference Gallagher2014, Reference Gallagher2015, Reference Gallagher2016).
Table 17 summarises the tests of classification accuracy on the first test data set with categorical labels. The Baseline learner miscategorised all nonce words as grammatical, which caused it to achieve perfect recall but at the expense of the lowest precision (0.238), F-score (0.385) and binary accuracy (0.238) due to false positives.
Table 17 Performance comparison of Exception-Filtering (
$\theta _{\max }$
= 0.5), Baseline and HW learner (
$\text {Max } O/E$
= 0.7,
$n$
= 3) in the first test data set (categorical labels). Best scores are in bold

As discussed in §4, the harmony scores of the benchmark HW learner are transformed into categorical labels to produce its highest binary accuracy. However, even at its best performance (Max
$O/E = 0.7$
,
$n=3$
, vowel tier: [high], [round], [back], [word boundary]), the HW learner displayed higher error rates in the classification of Turkish phonotactics than the Exception-Filtering learner.
When tested against these categorical labels, the Exception-Filtering learner (
$\theta _{\max } = 0.5$
) demonstrated outstanding performance in binary classification with an F-score of 0.933 and a total binary accuracy of 0.969. Table 18 compares the grammars acquired by the Exception-Filtering learner (a) and the benchmark HW learner (b). A score of 0 indicates that a two-factor has been classified as ungrammatical, whereas a score of 1 designates it as grammatical. In (b), the degree of shading is proportional to the negative harmony score, which is rescaled according to the minimum and maximum harmony scores.
Table 18 Comparing the learnt grammars of (a) the Exception-Filtering learner and (b) the HW learner

Compared to phonotactic generalisations in Turkish, the learnt grammar in the Exception-Filtering learner predicts two false negatives, which are reflected in the relatively lower recall (0.875) in classification accuracy. These two mismatches have an unexpectedly low type frequency (ø…e: 982; ø…y: 1,179), compared to other grammatical two-factors. The errors of the learnt MaxEnt grammar, on the contrary, are mostly false positives misled by their high type frequency, such as e…ɑ (2,873), ɑ…i (4,369), ɑ…u (1,526), and ɑ…e (3,197). The Exception-Filtering learner avoids these false positives by categorically penalising these exceptional two-factors.Footnote 26
7.3.2. The second test data set (approximated acceptability judgements)
The purpose of the second test data set is to demonstrate that the learnt categorical grammar can approximate the acceptability judgements in the behavioural data. The second testing data set includes 36 nonce words from Zimmer (Reference Zimmer1969), and takes the proportion of ‘yes’ responses averaged across participants to approximate the acceptability judgements of speakers. The data show a gradient transition from harmonic words (e.g., [temez], with a score of
$19/23\approx 0.826$
) to disharmonic ones (e.g., [temɑz], at
$3/23 \approx 0.130$
). This method is similar to Hayes & Wilson’s (Reference Hayes and Wilson2008) approach to creating gradient acceptability judgements from Scholes’s (Reference Scholes1966) experiment, following previous studies (Pierrehumbert Reference Pierrehumbert and Keating1994; Coleman & Pierrehumbert Reference Coleman, Pierrehumbert and Coleman1997). In Zimmer’s (Reference Zimmer1969) study, some words were tested twice, leading to minor variations in response rates (e.g., [tatuz] receives either 0.375 or 0.3125), which do not significantly influence the results of the statistical tests below.
Table 19 presents the results of the statistical tests. The Baseline learner is omitted due to its lack of standard deviation, which makes correlation tests inapplicable. Notably, while correlations in all models differ significantly from zero at a two-tailed alpha of 0.01, the Exception-Filtering learner scored higher than the benchmark HW learner in all tests.
Table 19 Performance comparison of Exception-Filtering and HW learner in the second test data set adapted from Zimmer’s (Reference Zimmer1969) experiment. Best scores are in bold

Figure 5 visualises the distribution of predicted score against the approximated acceptability in both the Exception-Filtering and the HW learner. Some words have two response rates as they appeared in two separate experiments. A simple linear regression line is fitted in the plot, where the predictor (x-axis) is the predicted grammaticality score in the Exception-Filtering learner, and the exponentiated negative harmony score in the HW learner. The outcome (y-axis) is the proportion of ‘yes’ responses in Zimmer (Reference Zimmer1969), which approximates the acceptability judgments. The predicted scores of the Exception-Filtering learner cluster at 0 and 1, while exp(−harmony) is on a continuum.Footnote 27

Figure 5 Scatter plots based on the learning results of two learners. Expected grammaticality is highlighted based on documented phonotactic generalizations. Some words have two response rates as they appeared in two separate experiments. Overlapped words are omitted from the plots.
Both regression models reject the null hypothesis that the predicted judgements have no effect on the proportion of ‘yes’ responses (Exception-Filtering:
$\text {residual deviance} = 2.264$
,
$p < 0.001$
; HW:
$\text {residual deviance} = 4.073$
,
$p = 0.013$
), at an alpha level of 0.05. Furthermore, Figure 5 shows that the Exception-Filtering learner is capable of categorically penalising lexical exceptions, such as ɑ…i in [tɑtiz], which can mislead the HW learner to assign relatively higher probabilities than harmonic sequences such as e…e in [pemez].
To summarise, the Exception-Filtering learner trained using a Turkish corpus acquired the documented vowel phonotactics in Turkish except for two mismatches. The Exception-Filtering learner not only succeeded in classifying grammatical and ungrammatical words, but also achieved a high correlation between the predicted judgement and the approximated acceptability judgement of nonce words from previous behavioural experiment. This result indicates the capability of the Exception-Filtering model in modelling phonotactic patterns with exceptions.
8. Discussion
To summarise the case studies, in terms of interpretability and scalability, the categorical grammars learnt in the case studies of English and Polish onset phonotactics largely align with the Sonority Sequencing Principle that penalises most sequences with low sonority rises. The proposed learner also successfully generalised Turkish vowel phonotactics from highly exceptionful input data with both roots and derived forms. When it comes to model assessment and comparison, the grammaticality scores generated by the learnt grammars closely approximate the acceptability judgements observed in behavioural experiments and demonstrate competitive performance in model comparisons, highlighting the effectiveness of the exception-filtering mechanism. The following section discusses topics that arise from the current study and outlines directions for future work.
8.1. Extragrammatical factors
As elaborated in §2, this research adopts the competence–performance dichotomy (Pinker & Prince Reference Pinker and Prince1988; Zuraw Reference Zuraw2000; Zuraw et al. Reference Zuraw, Lin, Yang and Peperkamp2021). Within this framework, extragrammatical factors are conceptualised as originating from two main sources: performance-related and lexicon-related variables. Performance-related variables include individual differences, auditory illusions (Kahng & Durvasula Reference Kahng and Durvasula2023) and task effects in general (Armstrong et al. Reference Armstrong, Gleitman and Gleitman1983; Gorman Reference Gorman2013). Lexicon-related variables include lexical information such as lexical similarity (Bailey & Hahn Reference Bailey and Hahn2001, Reference Bailey and Hahn2005; Avcu et al. Reference Avcu, Newman, Ahlfors and Gow2023), frequency (Frisch et al. Reference Frisch, Large and Pisoni2000; Ernestus & Baayen Reference Ernestus and Baayen2003), etc.
In the current study, in tandem with the learnt grammar, extragrammatical factors contribute to acceptability judgements in behavioural experiments. For example, previous studies have shown that lexical similarity and frequency are significant predictors of acceptability judgements (Frisch et al. Reference Frisch, Large and Pisoni2000; Bailey & Hahn Reference Bailey and Hahn2001, Reference Bailey and Hahn2005). Performance-related variables, such as individual differences and task effects, can also influence acceptability judgements. Therefore, a comprehensive evaluation of a learnt grammar against acceptability judgements should take these factors into account. In future research, this evaluation could be carried out by adopting a mixed-effects regression model, in which the grammaticality score is treated as a fixed effect and extragrammatical factors are treated as other effects.
8.2. Accidental gaps
Accidental gaps, the unattested but grammatical sequences emerging from the lexicon–grammar discrepancy, pose a significant challenge to phonotactic learning. Given that there are logically infinite numbers of grammatical strings and only some of them are associated with lexical meaning, gaps in the input data are inevitable. These accidental gaps can lower the
$O/E$
ratio because expected sequences are absent in the input data, which could potentially lead the learner to misinterpret these sequences as ungrammatical. This issue does not cause severe problems in the current proposal, because the learner can potentially avoid the misgeneralisation of accidental gaps by adjusting the maximum threshold. However, this is not a fundamental solution and places an excessive burden on a simple statistical criterion.
A more principled solution to the challenge of accidental gaps is to incorporate feature-based constraints, as suggested by Wilson & Gallagher (Reference Wilson and Gallagher2018). Segmental representations may overlook subsegmental generalisations – underrepresented segmental two-factors in the input data can exhibit high frequency in feature-based generalisations. For instance, in English, b[+approximant] sequences (e.g., [br], [bl]) are highly frequent, except for [bw], which has only three unique occurrences. In contrast, all segmental two-factors of the form b−approximant] are unattested (e.g., [bn], [bg], [bt]). A feature-based grammar can penalise −approximant] after b, but allow b[+approximant], hence avoiding overpenalising accidental gaps with [bw] onsets. By considering the entire natural class, the grammar can recognise subsegmental patterns that are overlooked in segmental representation. As Hayes & Wilson (Reference Hayes and Wilson2008: 401) demonstrate, a feature-based model outperforms a segment-based model in their English case study.
It is feasible to integrate feature-based representations into the current approach using the generality heuristics in Hayes & Wilson (Reference Hayes and Wilson2008) and the bottom-up strategies proposed by Rawski (Reference Rawski2021). The current study offers a straightforward demonstration of the concept here: consider a simplified feature system illustrated in (9). A feature-based Exception-Filtering learner initialises the most general feature-based potential constraints, for example, *[+F][+F], *[+F][+G], etc.
After selecting the next threshold from the accuracy schedule, and computing the
$O/E$
for each possible two-factor, the learner adds a two-factor to the hypothesis grammar if (a) the two-factor is not implied by any previously learnt constraints, and (b) the
$O/E$
of the two-factor is lower than the current threshold. For example, a constraint such as *[+G][+G] would imply more specific two-factors such as *[+G][+F,+G] and *[+F,+G][+G], but not *[+F][+F]. Therefore, if *[+G][+G] is already learnt, the learner will not consider the implied *[+G][+F,+G], regardless of its
$O/E$
value. The learning process continues until all thresholds have been exhausted. The next step of the current study is to incorporate more learning strategies proposed in Hayes & Wilson (Reference Hayes and Wilson2008) and Rawski (Reference Rawski2021) to optimise the learner for natural language corpora.
8.3. Hayes & Wilson’s (Reference Hayes and Wilson2008) learner
The Exception-Filtering learner drew inspiration from probabilistic approaches, especially the benchmark HW learner, which learns a MaxEnt grammar (Berger et al. Reference Berger, Della Pietra and Della Pietra1996; Goldwater & Johnson Reference Goldwater and Johnson2003) from input data. The HW learner adjusts constraint weights to maximise the likelihood of the observed data predicted by the hypothesis grammar, also known as maximum likelihood estimation (MLE), aiming to approximate the underlying target grammar by maximising the likelihood of observed input data, including lexical exceptions.
Interestingly, although the HW learner also uses the
$O/E$
criterion in constraint selection, it cannot exclude lexical exceptions from the input data even with the correct constraints selected. The principle of MLE prevents the probabilistic grammar from assigning a zero probability to observed lexical exceptions and from completely excluding these anomalies. The underpenalisation of lexical exceptions can compromise generalisations for non-exceptional candidates (Moore-Cantwell & Pater Reference Moore-Cantwell and Pater2016). For example, in the Turkish case study, the HW learner underpenalised the highly frequent disharmonic patterns such as ɑ…i in [tɑtiz] (Figure 5). As a result, researchers usually manually remove the strings considered lexical exceptions from the training data prior to simulations, such as in the English case study of Hayes & Wilson (Reference Hayes and Wilson2008).
This issue has motivated several interesting proposals to handle exceptions within the HW learner. Hayes & Wilson (Reference Hayes and Wilson2008: 386) add a Gaussian prior to prevent overfitting by adjusting the standard deviation
$\sigma $
of the Gaussian distribution for constraint weights. Although this method proves effective for certain data sets based on their specific noise distribution, it still assigns non-zero, albeit low, probabilities to lexical exceptions.
Another strategy is to include lexically specific constraints in the hypothesis space to handle lexical exceptions (Pater Reference Pater2000; Linzen et al. Reference Linzen, Kasyanenko and Gouskova2013; Moore-Cantwell & Pater Reference Moore-Cantwell and Pater2016; Hughto et al. Reference Hughto, Lamont, Prickett and Jarosz2019; O’Hara Reference O’Hara2020). For example, a lexically specific constraint *sf
$_i$
would penalise the sequence [sf] except when it is in the indexed lexical exception sphere
$_i$
. In this way, the learnt grammar is able to allow exceptions without compromising the generalisations for non-exceptional candidates. Meanwhile, nonce words are evaluated according to the general constraints of the grammar, as they do not have lexical indices. However, lexically specific constraints considerably escalate the computational complexity of the learning model, because the hypothesis space grows exponentially with the size of the input data. Such computational complexity not only restricts our capacity to test the proposal adequately, but also raises questions about its plausibility in child language acquisition.
Both proposals above handle the exception-related overfitting problem through the incorporation of a regularisation function during maximum likelihood estimation. An open question is whether the HW learner can be improved by incorporating the exception-filtering mechanism advocated in the current proposal, so that identified anomalies can be removed from input data during the learning process.
8.4. O/E and alternative criteria
Both the Exception-Filtering learner and the HW learner employ a ‘greedy’ algorithm that selects constraints whenever
$O/E$
is below the selected threshold in an accuracy schedule. This approach, while computationally efficient, does not guarantee the discovery of a globally optimal grammar, given that the addition of one constraint may influence the
$O/E$
of others. As the learning model does not possess the capacity to look ahead, it becomes vital for the analyst to thoroughly examine the learning results across various threshold levels to uncover potential implications and enhancements. In the context of learning phonotactic grammars from exceptionful data, the
$O/E$
criterion has proved to be an effective measure in case studies.
An alternative strategy, such as the use of a depth-first search algorithm, could circumvent local optima by allowing the learner to examine future constraints before committing to the current one. However, this method comes with a considerable increase in computational complexity.
To ultimately solve the problem of local optima, a future direction is to consider other criteria, such as gain (Della Pietra et al. Reference Della Pietra, Della Pietra and Lafferty1997; Berent et al. Reference Berent, Wilson, Marcus and Bemis2012) or the Tolerance Principle (Yang Reference Yang2016). Similar to
$\theta _{\max }$
in the accuracy schedule, gain is set at a specific threshold – the higher the gain, the more statistical support is required for a constraint to be added to the hypothesis grammar (Gouskova & Gallagher Reference Gouskova and Gallagher2020: 5). The gain criterion was originally designed for well-defined probabilistic distributions, and its convex property ensures that the added constraints approximate a global optimum. Generalising this criterion to the current proposal involves some non-trivial adjustments, especially deriving a probabilistic distribution from categorical grammars.
The Tolerance Principle proposes that a rule will be generalised if the number of exceptions does not exceed
$\frac {N}{\ln N}$
, where N is the number of words in the relevant category. This threshold is set a priori for each N before the learner is exposed to the training data, rather than induced as in the current proposal. Although this constitutes a promising avenue for future research, it is worth noting that the Tolerance Principle was not originally formulated with phonotactic learning in mind, and it requires non-trivial adjustment in defining the scope of phonotactic constraint.
8.5. Other future directions
The current study represents an initial step towards understanding the interplay between lexical exceptions and phonotactic learning. The primary objective of this study has been to address the issue of exceptions, rather than developing an all-encompassing learning model. This has led to significant simplifications in the proposed learning model. Therefore, the next step is to enhance the current proposal towards a more comprehensive model. First, this study uses a simplified non-cumulative categorical grammar, while experimental evidence has indicated a cumulative effect on phonotactic learning (Breiss Reference Breiss2020; Kawahara & Breiss Reference Kawahara and Breiss2021). A future direction involves adapting the current proposal to accommodate a cumulative grammar, which would subsequently alter the assignment of grammaticality and the calculation of
$O/E$
. Second, the learnt grammar in Polish shows a viable approach to interpret SSP-defying onsets in the context of lexical exceptions (Jarosz Reference Jarosz2017). Third, this study prespecifies tiers for the hypothesis space during phonotactic learning. In the future, it would be beneficial to integrate an automatic tier-induction algorithm based on the principles proposed in previous studies (Jardine & Heinz Reference Jardine and Heinz2016; Gouskova & Gallagher Reference Gouskova and Gallagher2020). Another promising direction is to extend the current approach to the hypothesis space defined by other formal languages (Jäger & Rogers Reference Jäger and Rogers2012).
Last but not least, while phonotactic learning facilitates the learning of morphophonological alternations, it cannot independently motivate alternation learning, as shown in experimental studies (Pater & Tessier Reference Pater, Tessier, Slabakova, Montrul and Prévost2006; Chong Reference Chong2021). Given this evidence, a future direction is to model phonotactic and alternation learning as distinct but interconnected components. The phonotactic model proposed in this article can be used to filter out lexical exceptions that interfere with alternation learning. For example, in Turkish rounding harmony, after the rounded stem vowel [ø], the high front vowel /i/ in the suffixes typically changes to round [y]. However, in noisy real-world data, unrounded [i] can exceptionally surface after [ø]. The phonotactic model proposed in the current study can be used to filter out illicit sequences such as [ø…i] during alternation learning, allowing feature-based generalisations such as i
$\rightarrow $
[+round]/[+round]\_.
9. Conclusion
This research represents a significant step forward in two key areas: first, it pioneers a categorical-grammar-plus-exception-filtering-mechanism approach for learning categorical grammars from naturalistic input data with lexical exceptions. Moreover, while the current study primarily focusses on the learning of categorical grammars, it lays the groundwork for integrating learnt grammars with extragrammatical factors to model behavioural data, and marks initial steps in reassessing the ability of categorical grammars to approximate human judgements.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0952675725000028.
Data availability statement
Replication data and code can be found at https://github.com/hutengdai/exception-filtering-phonotactic-learner. Zimmer’s (Reference Zimmer1969) original experimental data and the Polish training data can be found in the Supplementary Material.
Acknowledgments
I would like to thank Colin Wilson, Adam Jardine, Bruce Tesar, Yang Wang, Adam McCollum, Jeff Heinz, Tim Hunter, Bruce Hayes, Gaja Jarosz, Caleb Belth, Jon Rawski, anonymous reviewers, the audience at AMP 2022, UCI QuantLang Lab, and MIT Exp/Comp group for their help and suggestions.
Competing interests
The author declares that there are no competing interests regarding the publication of this article.
Ethical standards
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.