This work is licensed under the Creative Commons Attribution 4.0 International License.
RADFORD, A. ‒ WU, J. ‒ CHILD, R. ‒ LUAN, D. ‒ AMODEI, D. ‒ SUTSKEVER, I. (2019). Language models are unsupervised multitask learners.Search in Google Scholar
PAPINENI, K. ‒ ROUKOS, S. ‒ WARD, T. ‒ ZHU, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).Search in Google Scholar
LIN, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out, 74-81.Search in Google Scholar
CHINCHOR, N. (1991). MUC-3 Evaluation Metrics and Linguistic Phenomena Tests. In: NATURAL LANGUAGE PROCESSING SYSTEMS EVALUATION WORKSHOP. p. 13.Search in Google Scholar
ZHANG, T. ‒ KISHORE, V. ‒ WU, F. ‒ WEINBERGER, K. Q. ‒ ARTZI, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675.Search in Google Scholar
ZHAO, W. ‒ PEYRARD, M. ‒ LIU, F. ‒ GAO, Y. ‒ MEYER, C. M. ‒ EGER, S. (2019). MoverScore: Text generation evaluating with contextualized embeddings and Earth Mover Distance. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 563-578.Search in Google Scholar
BROWN, T. B. ‒ MANN, B. ‒ RYDER, N. ‒ SUBBIAH, M. ‒ KAPLAN, J. ‒ DHARIWAL, P. ‒ AMODEI, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.Search in Google Scholar
WEI, J. ‒ WANG, X. ‒ SCHUURMANS, D. ‒ BOSMA, M. ‒ ICHTER, B. ‒ XIA, F. ‒ LE, Q. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.Search in Google Scholar
MAYNEZ, J. ‒ NARAYAN, S. ‒ BOHNET, B. ‒ MCDONALD, R. (2020). On faithfulness and factuality in abstractive summarization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1906-1919.Search in Google Scholar
SHENG, E. ‒ CHANG, K. W. ‒ NATARAJAN, P. ‒ PENG, N. (2021). Societal biases in language generation: Progress and challenges. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 4275-4293.Search in Google Scholar
GEHMAN, S. ‒ GURURANGAN, S. ‒ SAP, M. ‒ CHOI, Y. ‒ SMITH, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3356-3369.Search in Google Scholar
POPOVIĆ, M. (2017). chrF++: words helping character n-grams. Proceedings of the Second Conference on Machine Translation, 612-618.Search in Google Scholar
JOSHI, P. ‒ SANTY, S. ‒ BUDHIRAJA, A. ‒ BALI, K. ‒ CHOUDHURY, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6282-6293.Search in Google Scholar
BEDNÁR, P. ‒ DOBEŠ, M. ‒ GARABÍK, R. (2024). Training of large language model Mistral on Slovak language data. Jazykovedný časopis. Under review.Search in Google Scholar
VAN DER LEE, C. ‒ GATT, A. ‒ VAN MILTENBURG, E. ‒ WUBBEN, S. ‒ KRAHMER, E. (2019). Best practices for the human evaluation of automatically generated text. Proceedings of the 12th International Conference on Natural Language Generation, 355-368.Search in Google Scholar
CHIANG, W.-L. ‒ LI, Z. ‒ LIN, Z. ‒ SHENG, Y. ‒ WU, Z. ‒ ZHANG, P. ‒ ZHANG, C. (2023). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/Search in Google Scholar
KOCMI, T. ‒ FEDERMANN, C. (2023). Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.Search in Google Scholar
BENDER, E. M. ‒ GEBRU, T. ‒ MCMILLAN-MAJOR, A. ‒ SHMITCHELL, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). ACM.Search in Google Scholar