Accesso libero

Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

INFORMAZIONI SU QUESTO ARTICOLO

Cita

Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.

eISSN:
2255-8691
Lingua:
Inglese
Frequenza di pubblicazione:
2 volte all'anno
Argomenti della rivista:
Computer Sciences, Artificial Intelligence, Information Technology, Project Management, Software Development