RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information

Beáta Lőrincz; Elena Irimia; Adriana Stan; Verginica Barbu Mititelu

doi:10.1017/S1351324922000419

RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information

Published online by Cambridge University Press: 26 August 2022

Beáta Lőrincz

Elena Irimia ,

Adriana Stan and

Verginica Barbu Mititelu

Show author details

Beáta Lőrincz*: Affiliation:
Babeş-Bolyai University, Cluj-Napoca, Romania
Elena Irimia: Affiliation:
Research Institute for Artificial Intelligence ‘Mihai Dragănescu’, Romanian Academy, Bucharest, Romania
Adriana Stan: Affiliation:
Technical University of Cluj-Napoca, Cluj-Napoca, Romania
Verginica Barbu Mititelu: Affiliation:
Research Institute for Artificial Intelligence ‘Mihai Dragănescu’, Romanian Academy, Bucharest, Romania
*: *Corresponding author. E-mail: [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

In this article, we introduce an extended, freely available resource for the Romanian language, named RoLEX. The dataset was developed mainly for speech processing applications, yet its applicability extends beyond this domain. RoLEX includes over 330,000 curated entries with information regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription. The process of selecting the list of word entries and semi-automatically annotating the complete lexical information associated with each of the entries is thoroughly described.

The dataset’s inherent knowledge is then evaluated in a task of concurrent prediction of syllabification, lexical stress marking and phonemic transcription. The evaluation looked into several dataset design factors, such as the minimum viable number of entries for correct prediction, the optimisation of the minimum number of required entries through expert selection and the augmentation of the input with morphosyntactic information, as well as the influence of each task in the overall accuracy. The best results were obtained when the orthographic form of the entries was augmented with the complete morphosyntactic tags. A word error rate of 3.08% and a character error rate of 1.08% were obtained this way. We show that using a carefully selected subset of entries for training can result in a similar performance to the performance obtained by a larger set of randomly selected entries (twice as many). In terms of prediction complexity, the lexical stress marking posed most problems and accounts for around 60% of the errors in the predicted sequence.

Keywords

Lexical dataset Romanian Transformer Concurrent lexical prediction

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 3 , May 2023 , pp. 720 - 745

DOI: https://doi.org/10.1017/S1351324922000419 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Beáta Lőrincz, Elena Irimia, Adriana Stan, and Verginica Barbu Mititelu contributed equally.

References

Barbu, A.-M. (2008). Romanian lexical data bases: Inflected and syllabic forms dictionaries. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08) . Marrakech: European Language Resources Association (ELRA), pp. 1937–1941.Google Scholar

Barbu Mititelu, V., Tufiş, D. and Irimia, E. (2018). The reference corpus of the contemporary Romanian language (CoRoLa). In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA), pp. 1178–1185.Google Scholar

Bartlett, S., Kondrak, G. and Cherry, C. (2009). On the syllabification of phonemes. In Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, Boulder, pp. 308–316.CrossRef Google Scholar

Băcilă, F.-M. (2011). O posibilă clasificare a omografelor româneşti. Philologica Banatica V(1), 36–46.Google Scholar

Boroş, T., Dumitrescu, S. D. and Pais, V. (2018). Tools and resources for Romanian text-to-speech and speech-to-text applications. CoRR, abs/1802.05583.Google Scholar

Chae, M., Park, K., Bang, J., Suh, S., Park, J., Kim, N. and Park, L. (2018). Convolutional sequence to sequence model with non-sequential greedy decoding for grapheme to phoneme conversion. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 2486–2490.CrossRef Google Scholar

Ciobanu, A. M., Dinu, A. and Dinu, L. P. (2014). Predicting Romanian stress assignment. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers, pp. 64–68.CrossRef Google Scholar

Cucu, H., Buzo, A., Besacier, L. and Burileanu, C. (2014). SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian. Speech Communication 56(8), 195–212.CrossRef Google Scholar

de Mareüil, P. B., d’Alessandro, C., Yvon, F., Aubergé, V., Vaissière, J. and Amelot, A. (2000). A French phonetic lexicon with variants for speech and language processing. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00) . Athens: European Language Resources Association (ELRA).Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Minneapolis, MN: Association for Computational Linguistics, pp. 4171–4186.Google Scholar

Diaconescu, S.-S., Codirlasu, F.-C., Ionescu, M., Rizea, M.-M., Radulescu, M., Minca, A. and Fulea, S. (2015a). Fonetica Limbii Romane: Vol. 2 Dictionarul morfologic si fonetic al limbii romane (A-L), Vol. 3 Dictionarul morfologic si fonetic al limbii romane (M-Z). Scotts Valley, CA: CreateSpace.Google Scholar

Diaconescu, S.-S., Codirlasu, F.-C., Ionescu, M., Rizea, M.-M., Radulescu, M., Minca, A. and Fulea, S. (2015b). Fonetica Limbii Romane: Vol. 2 Dictionarul morfologic si fonetic al limbii romane (A-L), Vol. 3 Dictionarul morfologic si fonetic al limbii romane (M-Z). Scotts Valley, CA: CreateSpace.Google Scholar

Dinu, L. (2004). Despartirea automata in silabe a cuvintelor din limba română. Aplicatii in construcţia bazei de date a silabelor limbii române. Universitatea Bucuresti.Google Scholar

Dinu, L., Ciobanu, A. M., Chitoran, I. and Niculae, V. (2014). Using a machine learning model to assess the complexity of stress systems. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) . Reykjavik: European Language Resources Association (ELRA), pp. 331–336.Google Scholar

Dinu, L. and Dinu, A. (2006). On the data base of Romanian syllables and some of its quantitative and cryptographic aspects. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) . Genoa: European Language Resources Association (ELRA), pp. 1795–1798.Google Scholar

Dinu, L. P. (2003). An approach to syllables via some extensions of Marcus contextual grammars. Grammars 6(1), 1–12.CrossRef Google Scholar

Dinu, L. P., Niculae, V. and Sulea, O.-M. (2013). Romanian syllabication using machine learning. In International Conference on Text, Speech and Dialogue. Pilsen: Springer, pp. 450–456.CrossRef Google Scholar

Domokos, J., Buza, O. and Toderean, G. (2012). 100K+ words, machine-readable, pronunciation dictionary for the Romanian language. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO) . Bucharest: IEEE, pp. 320–324.Google Scholar

DOOM (2005). The Orthographic, Orthoepic and Morphologic Dictionary of the Romanian Language (DOOM2). Bucharest: Univers Enciclopedic.Google Scholar

Dou, Q., Bergsma, S., Jiampojamarn, S. and Kondrak, G. (2009). A ranking approach to stress prediction for letter-to-phoneme conversion. In Proceedings of the Joint Conference of the 47th annual meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, pp. 118–126.CrossRef Google Scholar

Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In International Conference on Machine Learning. Sydney: PMLR, pp. 1243–1252.Google Scholar

Georgescu, A.-L., Cucu, H. and Burileanu, C. (2017). SpeeD’s DNN approach to Romanian speech recognition. In 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . Bucharest: IEEE, pp. 1–8.CrossRef Google Scholar

Georgescu, A.-L., Cucu, H., Buzo, A. and Burileanu, C. (2020). RSC: A Romanian read speech corpus for automatic speech recognition. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, pp. 6606–6612.Google Scholar

Goslin, J., Galluzzi, C. and Romani, C. (2014). PhonItalia: A phonological lexicon for Italian. Behavior Research Methods 46(3), 872–886.CrossRef Google Scholar PubMed

Halpern, J. (2022). Comprehensive Full-Form Lexicon for Arabic NLP and Speech Technology. Online. Available at https://www.cjk.org/wp-content/uploads/Halpern-LREC2022Paper.pdf 18 July 2022.Google Scholar

Ion, R. (2018). TEPROLIN: An extensible, online text preprocessing platform for Romanian. In Proceedings of the 13th International Conference on Linguistic Resources and Tools for Processing the Romanian Language, Iaşi.Google Scholar

Kyparissiadis, A., van Heuven, W. J., Pitchford, N. J. and Ledgeway, T. (2017). GreekLex 2: A comprehensive lexical database with part-of-speech, syllabic, phonological, and stress information. PloS one 12(2), e0172493.CrossRef Google Scholar PubMed

Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707.Google Scholar

Lőrincz, B. (2020). Concurrent phonetic transcription, lexical stress assignment and syllabification with deep neural networks. In 24th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, 176 , pp. 108–117.CrossRef Google Scholar

Milde, B., Schmidt, C. and Köhler, J. (2017). Multitask sequence-to-sequence models for grapheme-to-phoneme conversion. In Proceedings of Interspeech 2017, Stockholm, pp. 2536–2540.CrossRef Google Scholar

Pearson, S., Kuhn, R., Fincke, S. and Kibre, N. (2000). Automatic methods for lexical stress assignment and syllabification. In Sixth International Conference on Spoken Language Processing, Beijing.CrossRef Google Scholar

Peiró-Lilja, A. and Farrús, M. (2020). Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding. In Proceedings of Interspeech 2020, Shanghai, pp. 3994–3998.CrossRef Google Scholar

Protopapas, A., Tzakosta, M., Chalamandaris, A. and Tsiakoulis, P. (2012). IPLR: An online resource for Greek word-level and sublexical information. Language resources and evaluation 46(3), 449–459.CrossRef Google Scholar

Rao, K., Peng, F., Sak, H. and Beaufays, F. (2015). Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , South Brisbane, pp. 4225–4229.CrossRef Google Scholar

Rehm, G., Berger, M., Elsholz, E., Hegele, S., Kintzel, F., Marheinecke, K., Piperidis, S., Deligiannis, M., Galanis, D., Gkirtzou, K., Labropoulou, P., Bontcheva, K., Jones, D., Roberts, I., Hajič, J., Hamrlová, J., Kačena, L., Choukri, K., Arranz, V., Vasiļjevs, A., Anvari, O., Lagzdiņš, A., Meļņika, J., Backfried, G., Dikici, E., Janosik, M., Prinz, K., Prinz, C., Stampler, S., Thomas-Aniola, D., Gómez-Pérez, J. M., Garcia Silva, A., Berrío, C., Germann, U., Renals, S. and Klejch, O. (2020). European language grid: An overview. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille: European Language Resources Association, pp. 3366–3380.Google Scholar

Română, A. (1982). Dicţionarul ortografic, ortoepic şi morfologic al limbii române. Bucharest: Editura Academiei Republicii Socialiste România.Google Scholar

Soares, A. P., Iriarte, Á., De Almeida, J. J., Simões, A., Costa, A., Machado, J., França, P., Comesaña, M., Rauber, A., Rato, A., et al. (2018). Procura-PALavras (P-PAL): A web-based interface for a new European Portuguese lexical database. Behavior Research Methods 50(4), 1461–1481.CrossRef Google Scholar PubMed

Stan, A. (2019). Input encoding for sequence-to-sequence learning of Romanian grapheme-to-phoneme conversion. In Proceedings of the 10th IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Timisoara.CrossRef Google Scholar

Stan, A. (2020). RECOApy: Data recording, pre-processing and phonetic transcription for end-to-end speech-based applications. In Proceedings of Interspeech 2020, Shanghai.CrossRef Google Scholar

Stan, A., Dinescu, F., Ţiple, C., Meza, Ş., Orza, B., Chirilă, M. and Giurgiu, M. (2017). The SWARA speech corpus: A large parallel Romanian read speech dataset. In 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) . Bucharest: IEEE, pp. 1–6.CrossRef Google Scholar

Stan, A. and Giurgiu, M. (2018). A comparison between traditional machine learning approaches and deep neural networks for text processing in Romanian. In Proceedings of the 13th International Conference on Linguistic Resources and Tools for Processing Romanian Language (ConsILR), Iaşi.Google Scholar

Stan, A., Lőrincz, B., Nuţu, M. and Giurgiu, M. (2021). The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data. In 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) , Bucharest, pp. 85–90.CrossRef Google Scholar

Stan, A., Yamagishi, J., King, S. and Aylett, M. (2011). The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication 53(3), 442–450.CrossRef Google Scholar

Sun, H., Tan, X., Gan, J.-W., Liu, H., Zhao, S., Qin, T. and Liu, T.-Y. (2019). Token-level ensemble distillation for grapheme-to-phoneme conversion. In Proceedings of Interspeech 2019, Graz, pp. 2115–2119.CrossRef Google Scholar

Taylor, J. and Richmond, K. (2020). Enhancing sequence-to-sequence text-to-speech with morphology. In Proceedings of Interspeech 2020, Shanghai, pp. 1738–1742.CrossRef Google Scholar

Toma, S.-A. and Munteanu, D.-P. (2009). Rule-based automatic phonetic transcription for the Romanian language. In 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns. Athens, pp. 682–686.Google Scholar

Toma, S.-A., Stan, A., Pura, M.-L. and Barsan, T. (2017). MaRePhoR — An open access machine-readable phonetic dictionary for Romanian. In 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) , Bucharest, pp. 1–6.CrossRef Google Scholar

Toshniwal, S. and Livescu, K. (2016). Jointly learning to align and convert graphemes to phonemes with neural attention models. In 2016 IEEE Spoken Language Technology Workshop (SLT). San Juan: IEEE, pp. 76–82.CrossRef Google Scholar

Trandabat, D., Irimia, E., Barbu Mititielu, V., Cristea, D. and Tufiş, D. (2012). The Romanian Language in the Digital Era. Metanet White Paper Series. Heidelberg: Springer.Google Scholar

van Esch, D., Chua, M. and Rao, K. (2016). Predicting pronunciations with syllabification and stress with recurrent neural networks. In Proceedings of Interspeech 2016, San Francisco, CA, pp. 2841–2845.CrossRef Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.u and Polosukhin, I. (2017). Attention is all you need. In Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S. and Garnett R., (eds), Advances in Neural Information Processing Systems 30. Long Beach, CA: Curran Associates, Inc., pp. 5998–6008.Google Scholar

Webster, G. (2004). Improving letter-to-pronunciation accuracy with automatic morphologically-based stress prediction. In Proceedings of Interspeech 2004, Jeju Island, pp. 2573–2576.CrossRef Google Scholar

Yao, K. and Zweig, G. (2015). Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In Proceedings of Interspeech 2015, Dresden, pp. 3330–3334.CrossRef Google Scholar

Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2019a). Grapheme-to-phoneme conversion with convolutional neural networks. Applied Sciences 9(6), 1143.CrossRef Google Scholar

Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2019b). Transformer based grapheme-to-phoneme conversion. In Proceedings of Interspeech 2019, Graz, pp. 2095–2099.CrossRef Google Scholar

Zeineldeen, M., Zeyer, A., Zhou, W., Ng, T., Schlüter, R. and Ney, H. (2020). A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models. arXiv preprint, arXiv: 2005.09336.Google Scholar

Zhang, A., Lipton, Z. C., Li, M. and Smola, A. J. (2020). Dive into Deep Learning. Available at https://d2l.ai Google Scholar

Article contents

RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests