1. bookVolume 26 (2021): Issue 2 (December 2021)
Journal Details
License
Format
Journal
eISSN
2255-8691
First Published
08 Nov 2012
Publication timeframe
2 times per year
Languages
English
access type Open Access

Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

Published Online: 30 Dec 2021
Volume & Issue: Volume 26 (2021) - Issue 2 (December 2021)
Page range: 132 - 138
Journal Details
License
Format
Journal
eISSN
2255-8691
First Published
08 Nov 2012
Publication timeframe
2 times per year
Languages
English
Abstract

Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.

Keywords

[1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Workshop Track Proceedings of 1st International Conference on Learning Representations, Scottsdale, Arizona, USA, May 2013, pp. 1–12. Search in Google Scholar

[2] P. Jeffrey, S. Richard, and D. M. Christopher, “GloVe: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. Search in Google Scholar

[3] B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo, “Evaluating word embedding models: methods and experimental results,”APSIPA Transactions on Signal and Information Processing, vol. 8, Art no. e19, Jul. 2019. https://doi.org/10.1017/ATSIP.2019.1210.1017/ATSIP.2019.12 Search in Google Scholar

[4] A. Znotiņš, “Word embeddings for Latvian natural language processing tools,”Human Language Technologies – The Baltic Perspective, vol. 289, IOS Press, pp. 167–173, 2016. https://doi.org/10.3233/978-1-61499-701-6-167 Search in Google Scholar

[5] A. Znotiņš and G. Barzdiņš, “LVBERT: Transformer-based model for Latvian language understanding,” Human Language Technologies – The Baltic Perspective, vol. 328, IOS Press, pp. 111–115, 2020. https://doi.org/10.3233/FAIA20061010.3233/FAIA200610 Search in Google Scholar

[6] R. Vīksna and I. Skadiņa, “Large language models for Latvian named entity recognition,” Human Language Technologies – The Baltic Perspective, vol. 328,IOS Press, pp. 62–69, 2020. https://doi.org/10.3233/FAIA20060310.3233/FAIA200603 Search in Google Scholar

[7] “EuroParl,” [Online]. Available: https://www.statmt.org/europarl/. Accessed on: May 2021. Search in Google Scholar

[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding”, in Proceedings of NAACL-HLT, Minneapolis, Minnesota, 2019, p. 4171–4186. Search in Google Scholar

[9] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” arXiv, Art no.1802.05365, pp. 1–15, 2018. Search in Google Scholar

[10] M. Ulčar, A. Žagar, C. Armendariz, A. Repar, S. Pollak, M. Purver, and M. Robnik-Šikonja, “Evaluation of contextual embeddings on less-resourced languages,” arXiv, Art no. 2107.10614, pp. 1–45, 2021. Search in Google Scholar

[11] X. Rong, “word2vec parameter learning explained”, arXiv, Art no. 1411.2738v4, pp. 1–21, 2016. Search in Google Scholar

[12] W. Ling, C. Dyer, A. Black, and I. Trancoso, “Two/Too simple adaptations of Word2Vec for syntax problems”, in Proceedings of the 2015 Conference of the North American, Denver, Colorado, May-June 2015, pp. 1299–1304. https://doi.org/10.3115/v1/N15-114210.3115/v1/N15-1142 Search in Google Scholar

[13] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information”, Transactions of the Association for Computational Linguistics, vol. 5, June 2017, pp. 135–146. https://doi.org/10.1162/tacl_a_0005110.1162/tacl_a_00051 Search in Google Scholar

[14] Z. Zhao, T. Liu, S. Li, and B. Li, “Ngram2vec: Learning improved word representations from Ngram co-occurrence statistics”, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, Sep. 2017, pp. 244–253. https://doi.org/10.18653/v1/D17-102310.18653/v1/D17-1023 Search in Google Scholar

[15] “UDPipe,” [Online]. Available: https://ufal.mff.cuni.cz/udpipe/1. Accessed on: May 2021. Search in Google Scholar

[16] “Gensim library,” [Online]. Available: https://radimrehurek.com/gensim/. Accessed on: May 2021. Search in Google Scholar

[17] “Ngram2vec tool repository,” [Online]. Available: https://github.com/zhezhaoa/ngram2vec. Accessed on: May 2021. Search in Google Scholar

[18] “Structured Skip-Gram tool repository,” [Online]. Available: https://github.com/wlin12/wang2vec. Accessed on: May 2021. Search in Google Scholar

[19] M. Ulčar, K. Vaik, J. Lindstrom, M. Dailidenaite, and M. Robnik-Sikonja, “Multilingual culture-independent word analogy datasets,” in Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 2020, pp. 4074–4080. Search in Google Scholar

[20] “Translated analogy dataset repository,” [Online]. Available: https://www.clarin.si/repository/xmlui/handle/11356/1261. Accessed on: May 2021. Search in Google Scholar

[21] “SpaCy tool,” [Online]. Available: https://spacy.io/. Accessed on: May 2021. Search in Google Scholar

[22] “LVTB dataset repository.” [Online]. Available: https://github.com/UniversalDependencies/UD_Latvian-LVTB/tree/master. Accessed on: May 2021. Search in Google Scholar

[23] “LUMII_AiLab NER dataset repository.” [Online]. Available: https://github.com/LUMII-AILab/FullStack/tree/master/NamedEntities. Accessed on: May 2021. Search in Google Scholar

[24] O. Levy and Y. Goldberg, “Linguistic Regularities in Sparse and Explicit Word Representations”, in Proceedings of the Eighteenth Conference on Computational Natural Language Learning, June 2014, pp. 171–180. https://doi.org/10.3115/v1/W14-161810.3115/v1/W14-1618 Search in Google Scholar

[25] “CommonCrawl,” [Online]. Available: https://commoncrawl.org/. Accessed on: May 2021. Search in Google Scholar

[26] P. Paikens, “Deep neural learning approaches for Latvian morphological tagging,” Human Language Technologies – The Baltic Perspective, vol. 289,IOS Press, pp. 160–166, 2016.https://doi.org/10.3233/978-1-61499-701-6-160 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo