Hostname: page-component-586b7cd67f-vdxz6 Total loading time: 0 Render date: 2024-11-27T19:50:04.708Z Has data issue: false hasContentIssue false

Questionnaires for eliciting evaluation data from users of interactive question answering systems

Published online by Cambridge University Press:  01 January 2009

D. KELLY*
Affiliation:
University of North Carolina, Chapel Hill, NC 27599-3360, USA e-mail: [email protected]
P. B. KANTOR
Affiliation:
Rutgers University, New Brunswick, NJ 08901, USA e-mail: [email protected]
E. L. MORSE
Affiliation:
National Institute of Standards & Technology, Gaithersburg, MD 20899, USA e-mail: [email protected]
J. SCHOLTZ
Affiliation:
Pacific Northwest National Laboratory, Richland, WA 99352, USA e-mail: [email protected]
Y. SUN
Affiliation:
University at Buffalo, The State University of New York, Buffalo, NY 14260, USA e-mail: [email protected]
*
To whom all correspondences should be addressed.

Abstract

Evaluating interactive question answering (QA) systems with real users can be challenging because traditional evaluation measures based on the relevance of items returned are difficult to employ since relevance judgments can be unstable in multi-user evaluations. The work reported in this paper evaluates, in distinguishing among a set of interactive QA systems, the effectiveness of three questionnaires: a Cognitive Workload Questionnaire (NASA TLX), and Task and System Questionnaires customized to a specific interactive QA application. These Questionnaires were evaluated with four systems, seven analysts, and eight scenarios during a 2-week workshop. Overall, results demonstrate that all three Questionnaires are effective at distinguishing among systems, with the Task Questionnaire being the most sensitive. Results also provide initial support for the validity and reliability of the Questionnaires.

Type
Papers
Copyright
Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Borlund, P. 2003a. The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research 8 (3), Paper no. 152.Google Scholar
Borlund, P. 2003b. The concept of relevance in IR. Journal of the American Society for Information Science 54 (10), 913925.CrossRefGoogle Scholar
Chin, J. P., Diehl, V. A., and Norman, K. L. 1988. Development of an instrument measuring user satisfaction of the human–computer interface. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '88), Washington, DC, pp. 213–218.Google Scholar
Cowley, P., Haack, J., Littlefield, R., and Hampson, E. 2006. Glass box: capturing, archiving and retrieving workstation activities. In Proceedings of the second ACM Workshop on Continuous Archival and Retrieval of Personal Experiences (CARPE '05), Santa Barbara, CA, pp. 13–18.Google Scholar
Cowley, P., Nowell, L., and Scholtz, J. 2005. Glass box: an instrumented infrastructure for supporting human interaction with information. In Proceedings of the 38th Hawaii International Conference on System Sciences, Waikoloa, Hawaii.Google Scholar
Dang, H., Lin, J., and Kelly, D. 2007. Overview of the TREC 2006 question answering track. In Voorhees, E., and Buckland, L. P. (eds.), TREC2006, Proceedings of the Fifteenth Text Retrieval Conference, Washington, DC. GPO.Google Scholar
den Os, E., and Bloothooft, G. 1998. Evaluating various spoken language dialogue systems with a single questionnaire: analysis of the ELSNET Olympics. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC '98), Granada, Spain, pp. 51–54.Google Scholar
Diekema, A. R., Yilmazel, O., Chen, J., Harwell, S., He, L., and Liddy, E. D. 2004. Finding answers to complex questions. In Maybury, M. T. (ed.), New Directions in Question Answering, pp. 141152. Cambridge, MA: MIT Press.Google Scholar
Dumais, S. T., and Belkin, N. J. 2005. The TREC Interactive Tracks: putting the user into search. In Voorhees, E. M., and Harman, D. K. (eds.), TREC: Experiment and Evaluation in Information Retrieval, pp. 123153. Cambridge, MA: MIT Press.Google Scholar
Harabagiu, S., Hickl, A., Lehmann, J., and Moldovan, D. 2005. Experiments with interactive question-answering. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, pp. 205–214.Google Scholar
Hart, S. G., and Staveland, L. E. 1988. Development of a NASA-TLX (task load index): results of empirical and theoretical research. In Hancock, P., and Meshkati, N. (eds.), Human Mental Workload, Amsterdam, North-Holland, pp. 139183.CrossRefGoogle Scholar
Hersh, W. 2006. Evaluating interactive question answering. In Strzalkowski, T., and Harabagiu, S. (eds.), Advances in Open Domain Question Answering, pp. 431455. Dordrecht, The Netherlands: Springer.Google Scholar
Hersh, W., and Over, P. 2001. Introduction to a special issue on interactivity at the Text Retrieval Conference (TREC). Information Processing and Management 37 (3): 365367.Google Scholar
Kelly, D., Kantor, P., Morse, E. L., Scholtz, J., and Sun, Y. 2006. User-centered evaluation of interactive question answering systems. In Proceedings of the Workshop on Interactive Question Answering at the Human Language Technology Conference (HLT-NAACL '06), New York, NY.CrossRefGoogle Scholar
Kelly, D., Wacholder, N., Rittman, R., Sun, Y., Kantor, P., Small, S., and Strzalkowski, T. 2007. Using interview data to identify evaluation criteria for interactive, analytical question answering systems. Journal of the American Society for Information Science and Technology 58 (7): 10321043.CrossRefGoogle Scholar
Liddy, E. D., Diekema, A. R., and Yilmazel, O. 2004. Context-based question-answering evaluation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '04), Sheffield, UK, pp. 508–509.Google Scholar
Likert, R. 1932. A technique for the measurement of attitudes. Archives of Psychology 140: 155.Google Scholar
Maiorano, S. J. 2006. Question answering: technology for intelligence analysis. In Strzalkowski, T., and Harabagiu, S. (eds.), Advances in Open Domain Question Answering, pp. 477504. Dordrecht, The Netherlands: Springer.Google Scholar
Maybury, M. T. 2004. Question answering: an introduction. In Maybury, M. T. (ed.), New Directions in Question Answering, pp. 314. Cambridge, MA: MIT Press.Google Scholar
Small, S., Strzalkowski, T., Janack, T., Liu, T., Ryan, S., Salkin, R., Shimizu, N., Kantor, P., Kelly, D., Rittman, R., Wacholder, N., and Yamrom, B. 2004. HITIQA: scenario-based question answering. In Proceedings of the Workshop on Pragmatics of Question Answering at HLT-NAACL 2004, Boston, MA, pp. 52–59.Google Scholar
Strzalkowski, T., and Harabagiu, S. 2006. Advances in Open Domain Question Answering. Dordrecht, The Netherlands: Springer.Google Scholar
Sun, Y., and Kantor, P. 2006. Cross-evaluation: a new model for information system evaluation. Journal of American Society for Information Science and Technology 56 (5): 614628.CrossRefGoogle Scholar
Tague-Sutcliffe, J. 1992. The pragmatics of information retrieval experimentation, revisted. Information Processing and Management 28 (4): 467490.CrossRefGoogle Scholar
Voorhees, E. M. 2005. Question answering in TREC. In Voorhees, E. M., and Harman, D. K. (eds.), TREC: Experiment and Evaluation in Information Retrieval, pp. 233257. Cambridge, MA: MIT Press.Google Scholar
Voorhees, E. M., and Harman, D. K. 2005. TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA: MIT Press.Google Scholar
Wacholder, N., Kelly, D., Rittman, R., Sun, Y., Kantor, P., Small, S., and Strzalkowski, T. 2007. A model for realistic evaluation of an end-to-end question answering system. Journal of the American Society for Information Science and Technology 58 (8): 10821099.CrossRefGoogle Scholar
Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. 1997. PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL '97), Madrid, Spain, pp. 271–280.Google Scholar
Woods, W. A., Kaplan, R. M., and Nash-Webber, B. L. 1972. The Lunar Sciences Natural Language Information System: Final Report, BBN Report 2378, Cambridge, MA.Google Scholar