Natural Language Processing

doi:10.1017/9781108973021.036

35 - Natural Language Processing

from Part 6 - Experimental and Quantitative Approaches

Published online by Cambridge University Press: 16 May 2024

Tomaž Erjavec

Edited by

Danko Šipka and

Wayles Browne

Show author details

Danko Šipka: Affiliation:
Arizona State University
Wayles Browne: Affiliation:
Cornell University, New York

Book contents

Get access

Summary

This chapter surveys the history and main directions of natural language processing research in general, and for Slavic languages in particular. The field has grown enormously since its beginning. Especially since 2010, the amount of digital texts has been rapidly growing; furthermore, research has yielded an ever-greater number of highly usable applications. This is reflected in the increasing number and attendance of NLP conferences and workshops. Slavic countries are no exception; several have been organising international conferences for decades, and their proceedings are the best place to find publications on Slavic NLP research. The general trend of the evolution of NLP is difficult to predict. It is certain that deep learning, including various new types (e.g. contextual, multilingual) of word embeddings and similar ‘deep’ models will play an increasing role, while predictions also mention the increasing importance of the Universal Dependencies framework and treebanks and research into the theory, not only the practice, of deep learning, coupled with attempts at achieving better explainability of the resulting models.

Keywords

computational linguistics corpus linguistics natural language processing language resources taggers parsers

Type: Chapter
Information: The Cambridge Handbook of Slavic Linguistics , pp. 732 - 750

DOI: https://doi.org/10.1017/9781108973021.036 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2024

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Azarova, I., Mitrofanova, O., Sinopalnikova, A., Yavorskaya, M., & Oparin, I. (2002). RussNet: Building a Lexical Database for the Russian Language. Proceedings of the LREC Workshop on WordNet Structures and Standardization, and How These Affect Wordnet Applications and Evaluation, Las Palmas, 60–64.Google Scholar

Babych, B., Kanishcheva, O., Nakov, P., Piskorski, J., Pivovarova, L., Starko, V., Steinberger, J. Yangarber, R., Marcińczuk, M., Pollak, S., Přibáň, P., & Robnik-Šikonja, M., eds. (2021). Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics. https://aclanthology.org/2021.bsnlp-1.Google Scholar

Bojar, O. & Hajič, J. (2008). Phrase-based and deep syntactic English-to-Czech statistical machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, pp. 143–146.CrossRef Google Scholar

Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R. L., & Roossin, P. S. (1988). A statistical approach to language translation. Coling’88. Association for Computational Linguistics, 1, 71–76.Google Scholar

Church, K. & Liberman, M. (2021). The future of Computational Linguistics: On beyond alchemy. In Frontiers in Artificial Intelligence, 4. https://doi.org/10.3389/frai.2021.625341.CrossRef Google Scholar PubMed

Dobrovoljc, K., Krek, S., & Erjavec, T. (2017). The Sloleks morphological lexicon and its future development. In Gorjanc, V. et al., eds., Dictionary of Modern Slovenian: Problems and Solutions, Ljubljana: Ljubljana University Press, pp. 42–63.Google Scholar

Džeroski, S., Erjavec, T., & Zavrel, J. (2000). Morphosyntactic tagging of Slovenian: Evaluating taggers and tagsets. Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece, 31 May–2 June 2000, pp. 1099–1104. https://aclanthology.org/L00-1108/.Google Scholar

Erjavec, T. (2012). MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131–142.CrossRef Google Scholar

Erjavec, T. (2015). The IMP historical Slovenian language resources. Language Resources and Evaluation, 49(3), 753–775. https://doi.org/10.1007/s10579-015-9294-7.CrossRef Google Scholar

Erjavec, T. & Džeroski, S. (2004). Machine learning of morphosyntactic structure: Lemmatizing unknown Slovenian words. Applied Artificial Intelligence, 18, 17–41.CrossRef Google Scholar

Erjavec, T., Piskorski, J., Pivovarova, L., Šnajder, J., Steinberger, J., & Yangarber, R., eds., (2017). Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics. https://aclanthology.org/W17–1400.Google Scholar

Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, pp. 1–32 [reprinted in Palmer, F. R., ed. (1968), Selected Papers of J.R. Firth 1952–1959, London: Longman].Google Scholar

Fišer, D. & Sagot, B. (2015). Constructing a poor man’s wordnet in a resource-rich world. Language Resources and Evaluation, 49, 601. https://doi.org/10.1007/s10579-015-9295-6.CrossRef Google Scholar

Fucíková, E., Hajič, J., Šindlerová, J., & Uresová, Z. (2015). Czech-English bilingual valency lexicon online. In 14th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 61–71.Google Scholar

Hajič, J., Bejček, E., Hlaváčová, J., Mikulová, M., Straka, M., Štěpánek, J., & Štěpánková, B. (2020). Prague Dependency Treebank – Consolidated 1.0. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), European Language Resources Association (ELRA), pp. 5208–5218.Google Scholar

Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M. A., Màrquez, L., Meyers, A., Nivre, J., Padó, S., Štěpánek, J., Straňák, P., Surdeanu, M., Xue, N., & Zhang, Y. (2009). The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task. ACL.Google Scholar

Hajič, J., Hajičová, E., Mikulová, M., & Mírovský, J. (2017). Prague dependency treebank. In Handbook of Linguistic Annotation, Dordrecht: Springer, pp. 555–594.CrossRef Google Scholar

Hajič, J., & Hladká, B. (1998). Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics. https://aclanthology.org/C98-1077/.Google Scholar

Hajič, J., Hric, J., & Kuboň, V. (2000). Machine translation of very close languages. In Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLC ’00), pp. 7–12. https://doi.org/10.3115/974147.974149.CrossRef Google Scholar

Hajič, J., Panevová, J., Uresová, Z., Bémová, A., Kolárová, V., & Pajas, P. (2003). PDT-VALLEX: Creating a large-coverage valency lexicon for treebank annotation. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories, Vol. 9, pp. 57–68.Google Scholar

Hutchins, J. (2001). Machine Translation over fifty years. Histoire Épistémologie Langage, 23(1), 7–31. https://doi.org/10.3406/hel.2001.2815 CrossRef Google Scholar

Hutchins, J. & Lovtskii, E. (2000). Petr Petrovich Troyanskii (1894–1950): A forgotten pioneer of Mechanical Translation. Machine Translation 15(3), 187–221. https://doi.org/10.1023/A:1011653602669.CrossRef Google Scholar

Jurish, B. (2011). Finite-State Canonicalization Techniques for Historical German. PhD thesis, University of Potsdam.Google Scholar

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography ASIALEX, 1, 7–36. https://doi.org/10.1007/s40607-014-0009-9.CrossRef Google Scholar

Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., & Rychlý, P. (2008). GDEX: Automatically finding good dictionary examples in a corpus. In Proceedings of the 13th EURALEX International Congress. Spain, July 2008, pp. 425–432.Google Scholar

Kilgarriff, A., Kovář, V., Krek, S., Srdanović, I., & Tiberius, C. (2010). A quantitative evaluation of word sketches. In Proceedings of the 14th EURALEX International Congress, Fryske Akademy, pp. 372–379.Google Scholar

Kobyliński, K. (2014). PoliTa: A multitagger for Polish. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). https://aclanthology.org/L14-1014/.Google Scholar

Kosem, I., Husák, M., & McCarthy, D. (2011). GDEX for Slovene. In Proceedings of eLex 2011: Electronic Lexicography in the 21st Century: New Applications for New Users, Ljubljana: Trojina, Institute for Applied Slovene Studies, pp. 150–159.Google Scholar

Krek, S., Erjavec, T., Dobrovoljc, K., Gantar, P., Holdt, S. A., Čibej, J., & Brank, J. (2020). The ssj500k training corpus for Slovenian language processing. In Proceedings of the Conference on Language Technologies and Digital Humanities. http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Krek-et-al_The-ssj500k-Training-Corpus-for-Slovenian-Language-Processing.pdf.Google Scholar

Krstev, C. & Vitas, D. (2007). Extending the Serbian E-dictionary by using lexical transducers. In Koeva, S. et al., eds., Formaliser les langues avec l’ordinateur: de INTEX à NooJ, Besançon: Presses universitaires de Franche-Comté, pp. 147–168. http://books.openedition.org/pufc/27079.CrossRef Google Scholar

Kulagina, O. S. & Mel‘čuk, I. A. (1967). Automatic translation: some theoretical aspects and the design of a translation system. In Nirenburg, S. et al., eds., Readings in Machine Translation, Cambridge, MA: MIT Press, pp. 157–175.Google Scholar

Ljubešić, N., Agić, Z., Batanović, V., & Erjavec, T. (2018). hr500k – a reference training corpus of Croatian. In Proceedings of the Conference on Language Technologies and Digital Humanities, pp. 154–161. www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Ljubesic-et-al_hr500k-A-Reference-Training-Corpus-of-Croatian.pdf.Google Scholar

Ljubešić, N. & Dobrovoljc, K. (2019). What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of Slovenian, Croatian and Serbian. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics, pp. 29–34. https://doi.org/10.18653/v1/W19-3704.CrossRef Google Scholar

Ljubešić, N., Klubička, F., Agić, A., & Jazbec, I. P. (2016a). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia.Google Scholar

Ljubešić, N., Zupan, K., Fišer, D., & Erjavec, T. (2016b). Normalising Slovenian data: Historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS), September 19–21, 2016, Bochum, Germany, pp. 146–155. www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf.Google Scholar

Loukachevitch, N. V., Lashevich, G., Gerasimova, A. A., Ivanov, V. V., & Dobrov, B. V. (2016). Creating Russian WordNet by conversion. In Proceedings of Conference on Computational Linguistics and Intellectual Technologies Dialog-2016, Moscow: RSUH, pp. 405–415.Google Scholar

McDonald, R., Pereira, F., Ribarov, K., & Hajič, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 523–530.CrossRef Google Scholar

Mel‘čuk, I. A. (2001). Communicative Organization in Natural Language: The Semantic-Communicative Structure of Sentences, Amsterdam: John Benjamins.CrossRef Google Scholar

Nivre, J., de Marneffe, M. C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., Schuster, S., Tyers, F., & Zeman, D. (2020). Universal dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Conference LREC. www.aclweb.org/anthology/2020.lrec-1.497.Google Scholar

Piasecki, M., Szpakowicz, S., & Broda, B. (2009). A Wordnet from the Ground Up, Wrocław: Oficyna Wydawnicza Politechniki Wroclawskiej.Google Scholar

Piskorski, J., Laskova, L., Marcińczuk, M., Pivovarova, L., Přibáň, P., Steinberger, J., & Yangarber, R., eds. (2019). The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics, pp. 63–74. www.aclweb.org/anthology/W19-3709.CrossRef Google Scholar

Pollak, S., Repar, A., Martinc, M., & Podpečan, V. (2019). Karst Exploration: Extracting terms and definitions from Karst Domain Corpus. In Proceedings of the eLex 2019 Conference: Electronic Lexicography in the 21st Century, Brno: Lexical Computing CZ. https://elex.link/elex2019/proceedings-download/.Google Scholar

Pomikálek, J. (2011). Removing Boilerplate and Duplicate Content from Web Corpora. PhD thesis, Masaryk University, Brno.Google Scholar

Przepiórkowski, A., Górski, R. L., Łaziński, M., & Pęzik, P. (2010). Recent developments in the National Corpus of Polish. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), Valetta, Malta.Google Scholar

Przepiórkowski, A., Hajnicz, E., Patejuk, A., Woliński, M., Skwarski, F., & Świdziński, M. (2014). Walenty: Towards a comprehensive valence dictionary of Polish. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavík: European Language Resources Association (ELRA), pp. 2785–2792.Google Scholar

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the Association for Computational Linguistics (ACL) System Demonstrations.CrossRef Google Scholar

Ramírez-Sánchez, G., Sánchez-Martínez, F., Ortiz-Rojas, S., Pérez-Ortiz, J. A., & Forcada, M. L. (2006). Opentrad Apertium open-source machine translation system: An opportunity for business and research. In Proceedings of Translating and the Computer 28 Conference, London, November 16–17, 2006.Google Scholar

Rehm, G. & Uszkoreit, H., eds. (2013). META-NET Strategic Research Agenda for Multilingual Europe 2020, Dordrecht: Springer Nature. https://doi.org/10.1007/978-3-642-36348-1.CrossRef Google Scholar

Savary, A., Candito, M., Mititelu, V. B., Bejček, E., Cap, F., Čéplö, S., Cordeiro, S. R., Eryiğit, G., Giouli, V., van Gompel, M., HaCohen-Kerner, Y., Kovalevskaitė, J., Krek, S., Liebeskind, C., Monti, J., Escartín, C. P., van der Plas, L., QasemiZadeh, B., Ramisch, C., Sangati, F., Stoyanova, I., & Vincze, V. (2018). PARSEME multilingual corpus of verbal multiword expressions. In Markantonatou, S., Ramisch, C., Savary, A., & Vincze, V., eds., Multiword Expressions at Length and In Depth: Extended Papers from the MWE2017 Workshop, Berlin: Language Science Press, pp. 87–147. https://doi.org/10.5281/zenodo.1471591.Google Scholar

Sgall, P., Goralciková, A., Nebesky, L., & Hajičová, E. (1969). A Functional Approach to Syntax in Generative Description of Language Mathematical Linguistics and Automatic Language Processing, New York, NY: Elsevier.Google Scholar

Silberztein, M. (1994). INTEX: A corpus processing system. In COLING ’94 Proceedings, Kyoto: COLING.Google Scholar

Simov, K., Osenova, P., Kolkovska, S., Balabanova, E., Doikoff, D., Ivanova, K., Simov, A., & Kouylekov, M. (2002). Building a linguistically interpreted corpus of Bulgarian: The BulTreeBank. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), Canary Islands, Spain, pp. 1729–1736.Google Scholar

Simov, K., Peev, Z., Kouylekov, M., Simov, A., Dimitrov, M., & Kiryakov, A. (2001). CLaRK – an XML-based system for corpora development. In Proceedings of the Corpus Linguistics 2001 Conference, pp. 558–560.Google Scholar

Stanković, R., Krstev, C., Stijović, R., Gočanin, M., & Škorić, , M. (2021). Towards automatic definition extraction for Serbian. In Proceedings of XIX EURALEX Congress: Lexicography for Inclusion, Vol. II, Democritus University of Thrace, pp. 695–703. https://euralex2020.gr/proceedings-volume-2/.Google Scholar

Straka, M., Hajič, J., & Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, PoS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4290–4297. https://aclanthology.org/L16-1680/.Google Scholar

Straka, M. & Straková, J. (2017). Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada, August 2017. https://aclanthology.org/K17-3009/.Google Scholar

Šnajder, J. (2013). Models for predicting the inflectional paradigm of Croatian words. Slovenščina 2.0, 1(2), 1–34. www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdf.Google Scholar

Štěpánková, B., Mikulová, M., & Hajič, J. (2020). The MorfFlex Dictionary of Czech as a source of linguistic data. In Proceedings of XIX EURALEX Congress: Lexicography for Inclusion, Democritus University of Thrace, Thrace, pp. 387–392.Google Scholar

Tufiş, D., ed. (2000). BalkaNet: Design and Development of a Multilingual Balkan WordNet. Romanian Journal of Information Science and Technology Special Issue, 7 (1–2).Google Scholar

Vetulani, Z. (2000). Electronic language resources for POLISH: POLEX, CEGLEX and GRAMLEX. In Proceedings of the Second International Conference on Language Resources and Evaluation, LREC’2014, Athens, Greece, European Language Resources Association (ELRA), pp. 367–374.Google Scholar

Vetulani, Z., Kubis, M., & Obrębski, T. (2010). PolNet – Polish WordNet: Data and tools. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), European Language Resources Association (ELRA).Google Scholar

Vitas, D. & Krstev, C. (2004). Intex and Slavonic morphology. In Muller, C., Royauté, J., & Silberztein, M., eds., INTEX pour la Linguistique et le traitement automatique des langues, Besançon: Presses Universitaires de Franche-Comté, pp. 19–34.CrossRef Google Scholar

Zaliznjak, A. A. (1977). Grammatičeskij slovar’ russkogo jazyka (Grammatical Dictionary of the Russian Language), Moscow: Russkie Slovari.Google Scholar

Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1–21.Google Scholar

Žolkovskij, A. K. & Mel‘čuk, I. A. (1965). O vozmožnom metode i instrumentax semantičeskogo sinteza (On a possible method and instruments for semantic synthesis). Naučno-texničeskaja Informacija, 5, 23–28.Google Scholar