1. bookVolumen 72 (2022): Edición 4 (June 2022)
    Building Web corpora as sources for linguistic research and its applications
Detalles de la revista
License
Formato
Revista
eISSN
1338-4287
Primera edición
05 Mar 2010
Calendario de la edición
2 veces al año
Idiomas
Inglés
access type Acceso abierto

Identifying Errors in Russian Web Corpora

Publicado en línea: 17 Aug 2022
Volumen & Edición: Volumen 72 (2022) - Edición 4 (June 2022) - Building Web corpora as sources for linguistic research and its applications
Páginas: 977 - 985
Detalles de la revista
License
Formato
Revista
eISSN
1338-4287
Primera edición
05 Mar 2010
Calendario de la edición
2 veces al año
Idiomas
Inglés
Abstract

The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such “noisy” fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

Keywords

BAEZA-YATES, Ricardo – RELLO, Luz: On measuring the lexical quality of the web. In: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality. Eds. C. Castillo – Z. Gyongyi – A. Jatowt – K. Tanaka. Lyon, France 2012, pp. 1–6. Available at: https://dl.acm.org/doi/pdf/10.1145/2184305.218430710.1145/2184305.2184307 Search in Google Scholar

BENKO, Vladimír: Aranea: Yet another family of (comparable) web corpora. In: International Conference on Text, Speech, and Dialogue. Eds. P. Sojka – A. Horák – I. Kopeček – K. Pala. Cham: Springer 2014, pp. 247–256.10.1007/978-3-319-10816-2_31 Search in Google Scholar

British National Corpus. Available at: http://www.natcorp.ox.ac.uk/corpus/ Search in Google Scholar

BUKCHINA – KALAKUTSKAYA: БуКчИНА, Бронислава З. – КАЛАКуЦКАя, Лариса П.: Слитно или раздельно. Москва: дрофа 2006. 936 с. Search in Google Scholar

CLARK, Eleanor – ARAKI, Kenji: Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English. In: Procedia — Social and Behavioral Sciences. Eds. N. A. Aziz – K. Hasida – A. W. A. Rahman – H. Saito. 2011, 27, pp. 2–11. Search in Google Scholar

GILYAREVSKIY – GRIVNIN: гИЛяреВСКИЙ, руджеро С. – грИВНИН, Владимир С.: определитель языков мира по письменностям. Москва: Наука 1965. 376 с. Search in Google Scholar

JAKUBÍČEK, Miloš – KOVÁŘ, Vojtěch – RYCHLÝ, Pavel–SUCHOMEL, Vít: Current Challenges in Web Corpus Building. In: Proceedings of the 12th Web as Corpus Workshop. Language Resources and Evaluation Conference (LREC 2020). Eds. A. Barbaresi – F. Bildhauer – R. Schäfer – E. Stemle. Marseille, 11–16 May 2020, 2020, pp. 1–4. Search in Google Scholar

KHOKHLOVA, Maria: Large Corpora and Frequency Nouns. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”. Ed. V. P. Selegey, Vol. 15(22). Moscow: RSUH 2016, pp. 224–238. Search in Google Scholar

KHOKHLOVA, Maria – BENKO, Vladimír: Size of corpora and collocations: the case of Russian. In: Slovenščina 2.0, 2020, Vol. 8, No 2, pp. 58–77.10.4312/slo2.0.2020.2.58-77 Search in Google Scholar

KUTUZOV, Andrey–KUNILOVSKAYA, Maria: Size vs. structure in training corpora for word embedding models: Araneum Russicum maximum and Russian national corpus. In: Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science. Eds. W. M. P. van der Aalst et al. 10716 LNCS. Cham: Springer 2018. https://doi.org/10.1007/978-3-319-73013-4_5 Search in Google Scholar

RINGLSTETTER, Christoph – SCHULZ, Klaus – MIHOV, Stoyan: Orthographic Errors in Web Pages: Toward Cleaner Web Corpora. Computational Linguistics, 2006, 32(3), pp. 295–340.10.1162/coli.2006.32.3.295 Search in Google Scholar

ROSENTHAL: роЗеНТАЛь, дитмар Э.: Справочник по правописанию и литературной правке. Москва: Айрис-пресс 2016. 368 с. Search in Google Scholar

SHAPOVAL: ШАПоВАЛ, Виктор В.: Новые типы ошибок в письменной речи. In: русский язык в школе, 2009, № 9, с. 76–83. Search in Google Scholar

SHAVRINA – SOROKIN: ШАВрИНА, Татьяна о. – СороКИН, Алексей А.: Моделирование расширенной лемматизации для русского языка на основе морфологическо-го парсера TnT-Russian. In: Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции «диалог». ред. В. П. Селегей. Москва: российский государственный гуманитарный университет 2015. URL: http://www.dialog-21.ru/digests/dialog2015/materials/pdf/ShavrinaTOSorokinAA.pdf. Search in Google Scholar

SHAVRINA: ШАВрИНА, Татьяна олеговна: Методы обнаружения и исправления опечаток: исторический обзор. In: Вопросы языкознания, 2017, № 4, с. 115–134.10.31857/S0373658X0001024-5 Search in Google Scholar

Artículos recomendados de Trend MD

Planifique su conferencia remota con Sciendo