The TIPSTER Text Summarization Evaluation (SUMMAC) has developed several new
extrinsic and intrinsic methods for evaluating summaries. It has established definitively that
automatic text summarization is very effective in relevance assessment tasks on news articles.
Summaries as short as 17% of full text length sped up decision-making by almost a factor
of 2 with no statistically significant degradation in accuracy. Analysis of feedback forms
filled in after each decision indicated that the intelligibility of present-day machine-generated
summaries is high. Systems that performed most accurately in the production of indicative
and informative topic-related summaries used term frequency and co-occurrence statistics,
and vocabulary overlap comparisons between text passages. However, in the absence of a
topic, these statistical methods do not appear to provide any additional leverage: in the case
of generic summaries, the systems were indistinguishable in accuracy. The paper discusses
some of the tradeoffs and challenges faced by the evaluation, and also lists some of the
lessons learned, impacts, and possible future directions. The evaluation methods used in the
SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation
of other ‘output-related’ NLP technologies, where there may be many potentially acceptable
outputs, with no automatic way to compare them.