Hostname: page-component-586b7cd67f-dlnhk Total loading time: 0 Render date: 2024-11-30T15:18:28.241Z Has data issue: false hasContentIssue false

Keyword extraction from emails*

Published online by Cambridge University Press:  09 September 2016

S. LAHIRI
Affiliation:
University of Michigan, Ann Arbor, MI, USA 48109 e-mail: [email protected], [email protected]
R. MIHALCEA
Affiliation:
University of Michigan, Ann Arbor, MI, USA 48109 e-mail: [email protected], [email protected]
P.-H. LAI
Affiliation:
Samsung Research America, Richardson, TX, USA 75082 e-mail: [email protected]

Abstract

Emails constitute an important genre of online communication. Many of us are often faced with the daunting task of sifting through increasingly large amounts of emails on a daily basis. Keywords extracted from emails can help us combat such information overload by allowing a systematic exploration of the topics contained in emails. Existing literature on keyword extraction has not covered the email genre, and no human-annotated gold standard datasets are currently available. In this paper, we introduce a new dataset for keyword extraction from emails, and evaluate supervised and unsupervised methods for keyword extraction from emails. The results obtained with our supervised keyword extraction system (38.99% F-score) improve over the results obtained with the best performing systems participating in the SemEval 2010 keyword extraction task.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

We are grateful to the annotators who made this work possible. This material is based in part upon work supported by Samsung Research America under agreement GN0005468 and by the National Science Foundation under IIS award #1018613. Any opinions, findings, conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of Samsung Research America or the National Science Foundation. We also thank the anonymous reviewers whose insightful comments helped improve the draft substantially.

References

Batagelj, V., and Zaveršnik, M. 2003. An O(m) algorithm for cores decomposition of networks. CoRR cs.DS/0310049, 1–10.Google Scholar
Berend, G., 2011. Opinion expression mining by exploiting keyphrase extraction. In Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand: Asian Federation of Natural Language Processing, pp. 1162–1170.Google Scholar
Berend, G., and Farkas, R. 2010. SZTERGAK: feature engineering for keyphrase extraction. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden.Google Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I., 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 : 9931022.Google Scholar
Boudin, F. 2013. A comparison of centrality measures for graph-based keyphrase extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing, Nagoya, Japan.Google Scholar
Chuang, J., Manning, C. D., and Heer, J. 2012. ‘Without the clutter of unimportant words’: descriptive keyphrases for text visualization. ACM Transactions on Computer-Human Interaction 19 (3): 19:119:29.CrossRefGoogle Scholar
Clear, J. H. 1993. The British national corpus. In Landow, G. P., and Delany, P. (eds.), The Digital Word, pp. 163187. Cambridge, MA, USA: MIT Press.Google Scholar
Csomai, A., and Mihalcea, R., 2007. Investigations in unsupervised back-of-the-book indexing. In FLAIRS Conference, Columbus, Ohio, USA, pp. 211216.Google Scholar
Csomai, A., and Mihalcea, R. 2008. Linguistically motivated features for enhanced back-of-the-book indexing. In McKeown, K., Moore, J. D., Teufel, S., Allan, J., and Furui, S. (eds.), ACL, Key West, Florida, USA, pp. 932940.Google Scholar
Dredze, M., Wallach, H. M., Puller, D., and Pereira, F. 2008. Generating summary keywords for emails using topics. In Proceedings of the 13th International Conference on Intelligent User Interfaces (IUI ’08). ACM, New York, NY, USA, pp. 199–206.Google Scholar
Ferrara, F., Pudota, N., and Tasso, C. 2011. A keyphrase-based paper recommender system. In Agosti, R., Esposito, F., Meghini, C., and Orio, N. (eds.), Digital Libraries and Archives, Communications in Computer and Information Science, Vol. 249. pp. 1425. Berlin Heidelberg: Springer.CrossRefGoogle Scholar
Finkel, J. R., Grenager, T., and Manning, C., 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the Association for Computational Linguistics, ACL ’05, Ann Arbor, Michigan, USA, pp. 363–370.CrossRefGoogle Scholar
Goodman, Joshua, and Carvalho, Vitor R. 2005. Implicit Queries for Email. In Proceedings of the Second Conference on Email and Anti-Spam (CEAS). July. Stanford, California, USA.Google Scholar
Grineva, M., Grinev, M., and Lizorkin, D., 2009. Extracting key terms from noisy and multi-theme documents. In Proceedings of the 18th International World Wide Web Conference, WWW 2009, Madrid, Spain, pp. 661–670.CrossRefGoogle Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H., 2009. The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11 (1): 1018.CrossRefGoogle Scholar
Hasan, K. S., and Ng, V., 2010. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China, pp. 365–373.Google Scholar
Hasan, K. S., and Ng, V. 2014. Automatic keyphrase extraction: a survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Baltimore, Maryland, pp. 1262–1273.Google Scholar
Hulth, A., 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03, Sapporo, Japan, pp. 216–223.CrossRefGoogle Scholar
Jiang, X., Hu, Y., and Li, H., 2009. A ranking approach to keyphrase extraction. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, Boston, Massachusetts, USA, pp. 756–757.CrossRefGoogle Scholar
Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T., 2010. SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 21–26.Google Scholar
Kleinberg, J. M., 1999. Authoritative sources in a hyperlinked environment. J. ACM 46 (5): 604632.CrossRefGoogle Scholar
Klimt, B., and Yang, Y. 2004. Introducing the enron corpus. In Proceedings of the 1st Conference on Email and Anti-Spam (CEAS), Mountain View, California, USA.Google Scholar
Laclavík, M., and Maynard, D. 2009. Motivating intelligent e-mail in business: an investigation into current trends for e-mail processing and communication research. In IEEE Conference on Commerce and Enterprise Computing. CEC ’09, Vienna, Austria.CrossRefGoogle Scholar
Lee, S., and Kim, H.-J., 2008. News keyword extraction for topic tracking. In Proceedings of the 2008 4th International Conference on Networked Computing and Advanced Information Management - Volume 02, NCM ’08, Washington, DC, USA: IEEE Computer Society, pp. 554–559.CrossRefGoogle Scholar
Li, Z., Zhou, D., Juan, Y.-F., and Han, J., 2010. Keyword extraction for social snippets. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, Raleigh, North Carolina, USA, pp. 1143–1144.CrossRefGoogle Scholar
Litvak, M., and Last, M., 2008. Graph-based keyword extraction for single-document summarization. In Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, MMIES ’08, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 17–24.CrossRefGoogle Scholar
Liu, F., Pennell, D., Liu, F., and Liu, Y., 2009. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, Boulder, Colorado, USA, pp. 620–628.CrossRefGoogle Scholar
Liu, Z., Huang, W., Zheng, Y., and Sun, M., 2010. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, MIT, Massachusetts, USA, pp. 366–376.Google Scholar
Loza, V., Lahiri, S., Mihalcea, R., and Lai, P.-H. 2014. Building a dataset for summarization and keyword extraction from emails. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland, pp. 26–31.Google Scholar
Mihalcea, R., and Csomai, A., 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, Lisboa, Portugal, pp. 233–242.CrossRefGoogle Scholar
Mihalcea, R., and Tarau, P. 2004. TextRank: bringing order into texts. In Lin, D., and Wu, D. (eds.), Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, pp 404–411.Google Scholar
Nguyen, T. D., and Kan, M.-Y., 2007. Keyphrase extraction in scientific publications. In Proceedings of the 10th International Conference on Asian Digital Libraries: Looking Back 10 Years and Forging New Frontiers, ICADL’07, Hanoi, Vietnam, pp. 317–326.CrossRefGoogle Scholar
Page, L., Brin, S., Motwani, R., and Winograd, T., 1998. The PageRank citation ranking: bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 161–172.Google Scholar
Phan, X.-H. 2006. CRFTagger: CRF English POS Tagger.Google Scholar
Pianta, E., and Tonelli, S. 2010. KX: a flexible system for keyphrase eXtraction. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden.Google Scholar
Seidman, S. B., 1983. Network structure and minimum degree. Social Networks 5 (3): 269287.CrossRefGoogle Scholar
Tomokiyo, T., and Hurst, M., 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment-Volume 18, Association for Computational Linguistics, Sapporo, Japan, pp. 33–40.CrossRefGoogle Scholar
Tonella, P., Ricca, F., Pianta, E., and Girardi, C., 2003. Using keyword extraction for web site clustering. In Proceedings of the 5th IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture, Amsterdam, The Netherlands, pp. 41–48.Google Scholar
Turney, P. D., 2000. Learning algorithms for keyphrase extraction. Information Retrieval 2 (4): 303336.CrossRefGoogle Scholar
Wan, X., and Xiao, J., 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, AAAI Press, Chicago, Illinois, USA, pp. 855–860.Google Scholar
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G., 1999. KEA: practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Libraries, DL ’99, Berkeley, California, USA, pp. 254–255.Google Scholar
Yih, W.-tau, Goodman, J., and Carvalho, V. R., 2006. Finding advertising keywords on web pages. In Proceedings of the 15th International Conference on World Wide Web, WWW ’06, New York, NY, USA: ACM, pp. 213–222.CrossRefGoogle Scholar