Using Latent Semantic Analysis and the Predication Algorithm to Improve Extraction of Meanings from a Diagnostic Corpus

Guillermo Jorge-Botana; Ricardo Olmos; José Antonio León

doi:10.1017/S1138741600001815

Using Latent Semantic Analysis and the Predication Algorithm to Improve Extraction of Meanings from a Diagnostic Corpus

Published online by Cambridge University Press: 10 January 2013

Guillermo Jorge-Botana ,

Ricardo Olmos and

José Antonio León

Show author details

Guillermo Jorge-Botana: Affiliation:
Universidad Autónoma de Madrid (Spain)
Ricardo Olmos: Affiliation:
Universidad Autónoma de Madrid (Spain)
José Antonio León*: Affiliation:
Universidad Autónoma de Madrid (Spain)
*: Correspondence concerning this article should be addressed to José Antonio León. Departamento de Psicología Básica, Facultad de Psicología, Universidad Autónoma de Madrid, Campus de Cantoblanco, 28049 Madrid (Spain). Phone: +34-914975226. Fax: +34-914975215. E-mail: [email protected].

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector space models have been successfully used to this end (Lee, Cimino, Zhu, Sable, Shanker, Ely & Yu, 2006; Pakhomov, Buntrock & Chute, 2006). In this study we use a computational model known as Latent Semantic Analysis (LSA) on a diagnostic corpus with the aim of retrieving definitions (in the form of lists of semantic neighbors) of common structures it contains (e.g. “storm phobia”, “dog phobia”) or less common structures that might be formed by logical combinations of categories and diagnostic symptoms (e.g. “gun personality” or “germ personality”). In the quest to bring definitions into line with the meaning of structures and make them in some way representative, various problems commonly arise while recovering content using vector space models. We propose some approaches which bypass these problems, such as Kintsch's (2001) predication algorithm and some corrections to the way lists of neighbors are obtained, which have already been tested on semantic spaces in a non-specific domain (Jorge-Botana, León, Olmos & Hassan-Montero, under review). The results support the idea that the predication algorithm may also be useful for extracting more precise meanings of certain structures from scientific corpora, and that the introduction of some corrections based on vector length may increases its efficiency on non-representative terms.

Actualmente existe un amplio interés en la indexación y extracción de información provenientes de grandes bancos de textos de índole taxonómica. Por ejemplo, la categorización automática de diagnósticos médicos o psicológicos redactados de manera informal y su consiguiente extracción de información epidemiológica o incluso en la extracción de términos y estructuras para la creación de preguntas-guía que asistan de forma heurística a los médicos en la búsqueda de información. Los modelos espacio-vectoriales han sido empleados con éxito en estos propósitos (Lee, Cimino, Zhu, Sable, Shanker, Ely, & Yu, 2006; Pakhomov, Buntrock, & Chute, 2006). En este estudio utilizamos un modelo computacional conocido como Análisis Semántico Latente (LSA) sobre un corpus diagnóstico con la motivación de recuperar definiciones (en forma de listados de vecinos semánticos) de estructuras habituales en ellos (e.g., “fobia a las tormentas”, “fobia a los perros”) o estructuras menos habituales, pero que pueden formarse por combinaciones lógicas de las categorías y síntomas diagnósticos (e.g., “personalidad de la pistola” o “personalidad de los gérmenes”). Para conseguir que las definiciones sean ajustadas al significado de las estructuras, y mínimamente representativas, se discuten algunos problemas que suelen surgir en la recuperación de contenidos con los modelos espacio-vectoriales, y se proponen algunas formas de evitarlos como el algoritmo de predicación de Kintsch (2001) y algunas correcciones en el modo de extraer listados de vecinos ya experimentadas sobre espacios semánticos de dominio general (Jorge-Botana, León, Olmos & Hassan-Montero, in review). Los resultados apoyan la idea de que el algoritmo de predicación puede ser también útil para extraer acepciones más precisas de ciertas estructuras en corpus científicos y que la introducción de algunas correcciones en base a la longitud de vector puede aumentar su eficacia ante términos poco representativos.

Keywords

LSA latent semantic analysis predication algorithm taxonomy discourse evaluation knowledge representation LSA análisis de la semántica latente algoritmo de predicación taxonomía evaluación del discurso representación del conocimiento

Type: Research Article
Information: The Spanish Journal of Psychology , Volume 12 , Issue 2 , November 2009 , pp. 424 - 440

DOI: https://doi.org/10.1017/S1138741600001815 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2009

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Blackmon, M.H., Polson, P.G., Kitajima, M.& Lewis, C. (2002). Cognitive Walkthrough for the Web. In CHI 2002: Proceedings of the conference on Human Factors in Computing Systems, (pp. 463–470).Google Scholar

Blackmon, M. H. Cognitive Walkthrough. In Bainbridge, W. S. (Ed.), Encyclopedia of Human-Computer Interaction, 2 volumes. Great Barrington, MA: Berkshire Publishing, 2004.Google Scholar

Burek, G., Vargas-Vera, M.& Moreale, E. (2004). Document retrieval based on intelligent query formulation. Techreport ID: kmi-04-13 [Previously known as KMI-TR-148].Google Scholar

Burgess, C. (2000). Theory and operational definitions in computational memory models: A response to Glenberg and Robertson. Journal of Memory and Language, 43, 402–408.CrossRef Google Scholar

Cederberg, S.& Widdows, D. (2003). Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction. Human Language Technology Conference archive. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL. Edmonton, Canada, 4.Google Scholar

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K.& Harshman, R. (1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391–407.3.0.CO;2-9>CrossRef Google Scholar

Denhière, G., Lemaire, B., Bellissens, C.& Jhean-Larose, S. (2007). A Semantic Space Modelling Children's Semantic Memory. In Landauer, T. K. McNamara, D., Dennis, S. & Kintsch, W. (Eds.). The handbook of Latent Semantic Analysis (pp.143–167). Mahwah, NJ: Erlbaum.Google Scholar

Dumais, S. (2003). Data-Driven approaches to information access, Cognitive Science, 2, 491–524.Google Scholar

Glenberg, A. M.& Robertson, D. A. (2000). Symbol grounding and meaning: A comparison of high-dimensional and embodied theories of meaning. Journal of Memory and Language, 43(3), 379–401.CrossRef Google Scholar

Jorge-Botana, G., León, J. A., Olmos, R.& Hassan-Montero, Y. (under review) Visualizing polysemic structures using LSA and the predication algorithm. Journal of the American society for Information science and Technology.Google Scholar

Juvina, I.& van Oostendorp, H. (2005). Bringing cognitive models into the domain of web accessibility. In Proceedings of the HCII2005 Conference, Las Vegas, USA.Google Scholar

Juvina, I., van Oostendorp, H., Karbor, P.& Pauw, B. (2005). Towards modeling contextual information in web navigation. In Bara, B. G. & Barsalou, L. & Bucciarelli, M. (Eds.), In Proceedings of the 27th Annual Meeting of the Cognitive Science Society, CogSci2005. Austin, Texas: The Cognitive Science Society, Inc, (pp. 1078–1083).Google Scholar

Kintsch, W. (1998). Comprehension: A paradigm for cognition. New York: Cambridge University Press.Google Scholar

Kintsch, W. (2000). Metaphor comprehension: A computational theory. Psychonomic Bulletin and Review, 7, 257–266.CrossRef Google Scholar PubMed

Kintsch, W. (2001). Predication. Cognitive Science, 25, 173–202.CrossRef Google Scholar

Kintsch, W. (2002). On the notion of theme and topic in psychological process models of text comprehension. In Louwerse, M. & Peer, W. van (Eds.), Thematics, Interdisciplinary Studies (pp. 157–170). Amsterdam, John Benjamins B.V.CrossRef Google Scholar

Kintsch, W.& Bowles, A. (2002). Metaphor comprehension: What makes a metaphor difficult to understand? Metaphor and Symbol, 17, 249–262.CrossRef Google Scholar

Kurby, C. A., Wiemer-Hastings, K., Ganduri, N., Magliano, J. P., Millis, K. K.& McNamara, D. S. (2003). Computerizing reading training: Evaluation of a latent semantic analysis space for science text. Behavior Research Methods, Instruments & Computers, 35, 244–250.CrossRef Google Scholar PubMed

Landauer, T. K. (2002). On the computational basis of learning and cognition: Arguments from LSA. In Ross, N. (Ed.), The Psychology of Learning and Motivation: Advances in research and theory (pp. 43–84). San Diego: Academic Press.Google Scholar

Landauer, T. K.& Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.CrossRef Google Scholar

Landauer, T. K., Foltz, P. W.& Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.CrossRef Google Scholar

Lemaire, B.& Denhière, G. (2006). Effects of High-Order Co-occurrences on Word Semantic Similarity. Current Psychology Letters, 18, 1.Google Scholar

Lee, M, Cimino, J, Zhu, H, Sable, C, Shanker, V, Ely, J et al. , Beyond information retrieval – Medical question answering. In Proceedings of the American Medical Informatics Association. Washington DC, USA; 2006.Google Scholar

Lemaire, B., Denhière, G., Bellissens, C.& Jhean-Larose, S. (2006). A Computational Model for Simulating Text Comprehension. Behavior Research Methods, 38(4), 628–637.CrossRef Google Scholar PubMed

Mandl, T. (1999). Efficient Preprocessing for Information Retrieval with Neural Networks. In: Zimmermann, Hans-Jürgen (ed.): In Proceedings of the EUFIT '99. 7th European Congress on Intelligent Techniques and Soft Computing. Aachen, Germany, 13.Google Scholar

Mill, W.& Kontostathis, A. (2004). Analysis of the values in the LSI term-term matrix. Technical Report. http://webpages.ursinus.edu/akontostathis/MillPaper.pdf Google Scholar

Nakov, P., Popova, A.& Mateev, P. (2001). Weight functions impact on LSA performance. In Proceedings of the EuroConference RANLP'2001 (Recent Advances in NLP). Tzigov Chark, Bulgaria, 187–193.Google Scholar

Pakhomov, S., Buntrock, J. D.& Chute, C. G. (2006). Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. Journal of the American Medical Informatics Association, 13(5), 516–525.CrossRef Google Scholar PubMed

Quesada, J. (2007). Creating Your Own LSA Spaces. In Landauer, T. K., McNamara, D., Dennis, S. & Kintsch, W. (Eds.), The handbook of Latent Semantic Analysis (pp. 71–88). Mahwah, NJ: Erlbaum.Google Scholar

Quesada, J.F., Kintsch, W.& Gomez-Milán, E. (2001). A Computational Theory of Complex Problem Solving Using the Vector Space Model (part II): Latent Semantic Analysis Applied to Empirical Results from Adaptation Experiments. In Cañas, (Ed.) Cognitive research with Microworlds, (pp. 147–158).Google Scholar

Rehder, B., Schreiner, M. E., Wolfe, M. B., Laham, D., Landauer, T. K.& Kintsch, W. (1998). Using Latent Semantic Analysis to assess knowledge: Some technical considerations. Discourse Processes, 25, 337–354.CrossRef Google Scholar

Rosch, E.& Mervis, C. B. (1975). Family resemblances: Studies in the internal structures of categories. Cognitive Psychology, 7, 573–605.CrossRef Google Scholar

Rumelhart, D., E., & McClelland, . (1992). Introducción al procesamiento distribuido en paralelo. Alianza Editorial, Madrid.Google Scholar

Chen, Rung-Ching, Lee, Ya-Ching & Pan, Ren-Hao (2006). Adding New Concepts On The Domain Ontology Based on Semantic Similarity, In Proceedings of the International Conference on Business and Information. July 12–14, 2006, Singapore.Google Scholar

Skoyles, J. R. (1999). Autistic language abnormality: Is it a secondorder context learning defect?: The view from Latent Semantic Analysis. In Barriere, I., Chiat, Morgan S. G.& Woll, B. (Eds.), In Proceedings of Child Language Seminar. London, pp 1.Google Scholar

Seidenberg, M. S.& McClelland, J. L. (1989). A Distributed, Developmental Model of Word Recognition and Naming. Psychological Review, 96, 523–568.CrossRef Google Scholar PubMed

Serafin, R.& Di Eugenio, B. (2003). FLSA: Extending Latent Semantic Analysis with features for dialogue act classification. In Proceedings of ACL04, 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, July. (pp 692-es)Google Scholar

Schunn, C. D. (1999). The presence and absence of category knowledge in LSA. In the Proceedings of the 21st Annual Conference of the Cognitive Science Society. Mahwah, NJ: Erlbaum.Google Scholar

Turney, P. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In De Raedt, L.& Flach, P. (Eds.). In Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, (pp. 491–502).Google Scholar

Wiemer-Hastings, P., Wiemer-Hastings, K.& Graesser, A. (1999). Improving an intelligent tutor's comprehension of students with Latent Semantic Analysis. In Lajoie, S.P. and Vivet, M. (Eds.), Artificial Intelligence in Education (pp. 535–542). Amsterdam: IOS Press.Google Scholar

Wiemer-Hastings, P. (2000). Adding syntactic information to LSA. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society. Erlbaum, Mahwah, NJ, (pp. 989–993).Google Scholar

Wiemer-Hastings, P.& Zipitria, I. (2001). Rules for syntax, vectors for semantics. In Proceedings of the 23rd Cognitive Science Conference. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar

Wild, F., Stahl, C., Stermsek, G., & Neumann, G. (2005). Parameters Driving Effectiveness of Automated Essay Scoring with LSA. In Proceedings of the 9th International Computer Assisted Assessment Conference. Loughborough, UK, (pp. 485–494).Google Scholar

Article contents

Using Latent Semantic Analysis and the Predication Algorithm to Improve Extraction of Meanings from a Diagnostic Corpus

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests