Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad
coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of
the corpus, producing either a simple or probabilistic context-free grammar. Such grammars,
however, can be very large, presenting problems for the subsequent computational costs of
parsing under the grammar. In this paper, we explore ways by which a treebank grammar
can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i)
thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which
has both probabilistic and non-probabilistic variants. Our results show that by a combined
use of these two techniques, a probabilistic context-free grammar can be reduced in size by
62% without any loss in parsing performance, and by 71% to give a gain in recall, but some
loss in precision.