Slovak Question Answering Dataset Based on the Machine Translation of the Squad V2.0

This paper describes the process of building the first large-scale machinetranslated question answering dataset SQuAD-sk for the Slovak language. The dataset was automatically translated from the original English SQuAD v2.0 using the Marian neural machine translation together with the Helsinki-NLP Opus English-Slovak model. Moreover, we proposed an effective approach for the approximate search of the translated answer in the translated paragraph based on measuring their similarity using their word vectors. In this way, we obtained more than 92% of the translated questions and answers from the original English dataset. We then used this machine-translated dataset to train the Slovak question answering system by fine-tuning monolingual and multilingual BERT-based language models. The scores achieved by EM = 69.48% and F1 = 78.87% for the fine-tuned mBERT model show comparable results of question answering with recently published machinetranslated SQuAD datasets for other European languages.

Język:: Angielski

Częstotliwość wydawania:: 2 razy w roku
Dziedziny czasopisma:: Lingwistyka i semiotyka, Ramy teoretyczne i dyscypliny, Językoznawstwo, inne

Kanał RSS czasopisma

Slovak Question Answering Dataset Based on the Machine Translation of the Squad V2.0

Ján Staš

Daniel Hládek

Tomáš Koctúr

Data publikacji: 25 gru 2023

Zakres stron: 381 - 390

DOI: https://doi.org/10.2478/jazcas-2023-0054

Słowa kluczowelanguage modeling, machine reading comprehension, machine translation, natural language processing, question answering

© 2023 Ján Staš et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Słowa kluczowe
language modeling, machine reading comprehension, machine translation, natural language processing, question answering