Morphological Tagging and Lemmatization in the Albanian Language

An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging.

This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them.

The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.

eISSN:: 1857-8462
Language:: English

Publication timeframe:: 2 times per year
Journal Subjects:: General Interest

Journal RSS Feed

Morphological Tagging and Lemmatization in the Albanian Language

Published Online: Dec 30, 2021

Page range: 3 - 16

DOI: https://doi.org/10.2478/seeur-2021-0015

Keywords
Part of speech tagging, Albanian language, Natural Language Processing, Lemmatization, Corpora

© 2021 Diellza Nagavci Mati et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Morphological Tagging and Lemmatization in the Albanian Language

Published Online: Dec 30, 2021

Page range: 3 - 16

DOI: https://doi.org/10.2478/seeur-2021-0015

KeywordsPart of speech tagging, Albanian language, Natural Language Processing, Lemmatization, Corpora

© 2021 Diellza Nagavci Mati et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Part of speech tagging, Albanian language, Natural Language Processing, Lemmatization, Corpora