Identifying Errors in Russian Web Corpora

BAEZA-YATES, Ricardo – RELLO, Luz: On measuring the lexical quality of the web. In: Proceedings of the 2^nd Joint WICOW/AIRWeb Workshop on Web Quality. Eds. C. Castillo – Z. Gyongyi – A. Jatowt – K. Tanaka. Lyon, France 2012, pp. 1–6. Available at: https://dl.acm.org/doi/pdf/10.1145/2184305.2184307 Search in Google Scholar

BENKO, Vladimír: Aranea: Yet another family of (comparable) web corpora. In: International Conference on Text, Speech, and Dialogue. Eds. P. Sojka – A. Horák – I. Kopeček – K. Pala. Cham: Springer 2014, pp. 247–256. Search in Google Scholar

British National Corpus. Available at: http://www.natcorp.ox.ac.uk/corpus/ Search in Google Scholar

BUKCHINA – KALAKUTSKAYA: БуКчИНА, Бронислава З. – КАЛАКуЦКАя, Лариса П.: Слитно или раздельно. Москва: дрофа 2006. 936 с. Search in Google Scholar

CLARK, Eleanor – ARAKI, Kenji: Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English. In: Procedia — Social and Behavioral Sciences. Eds. N. A. Aziz – K. Hasida – A. W. A. Rahman – H. Saito. 2011, 27, pp. 2–11. Search in Google Scholar

GILYAREVSKIY – GRIVNIN: гИЛяреВСКИЙ, руджеро С. – грИВНИН, Владимир С.: определитель языков мира по письменностям. Москва: Наука 1965. 376 с. Search in Google Scholar

JAKUBÍČEK, Miloš – KOVÁŘ, Vojtěch – RYCHLÝ, Pavel–SUCHOMEL, Vít: Current Challenges in Web Corpus Building. In: Proceedings of the 12^th Web as Corpus Workshop. Language Resources and Evaluation Conference (LREC 2020). Eds. A. Barbaresi – F. Bildhauer – R. Schäfer – E. Stemle. Marseille, 11–16 May 2020, 2020, pp. 1–4. Search in Google Scholar

KHOKHLOVA, Maria: Large Corpora and Frequency Nouns. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”. Ed. V. P. Selegey, Vol. 15(22). Moscow: RSUH 2016, pp. 224–238. Search in Google Scholar

KHOKHLOVA, Maria – BENKO, Vladimír: Size of corpora and collocations: the case of Russian. In: Slovenščina 2.0, 2020, Vol. 8, No 2, pp. 58–77. Search in Google Scholar

KUTUZOV, Andrey–KUNILOVSKAYA, Maria: Size vs. structure in training corpora for word embedding models: Araneum Russicum maximum and Russian national corpus. In: Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science. Eds. W. M. P. van der Aalst et al. 10716 LNCS. Cham: Springer 2018. https://doi.org/10.1007/978-3-319-73013-4_5 Search in Google Scholar

RINGLSTETTER, Christoph – SCHULZ, Klaus – MIHOV, Stoyan: Orthographic Errors in Web Pages: Toward Cleaner Web Corpora. Computational Linguistics, 2006, 32(3), pp. 295–340. Search in Google Scholar

ROSENTHAL: роЗеНТАЛь, дитмар Э.: Справочник по правописанию и литературной правке. Москва: Айрис-пресс 2016. 368 с. Search in Google Scholar

SHAPOVAL: ШАПоВАЛ, Виктор В.: Новые типы ошибок в письменной речи. In: русский язык в школе, 2009, № 9, с. 76–83. Search in Google Scholar

SHAVRINA – SOROKIN: ШАВрИНА, Татьяна о. – СороКИН, Алексей А.: Моделирование расширенной лемматизации для русского языка на основе морфологическо-го парсера TnT-Russian. In: Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции «диалог». ред. В. П. Селегей. Москва: российский государственный гуманитарный университет 2015. URL: http://www.dialog-21.ru/digests/dialog2015/materials/pdf/ShavrinaTOSorokinAA.pdf. Search in Google Scholar

SHAVRINA: ШАВрИНА, Татьяна олеговна: Методы обнаружения и исправления опечаток: исторический обзор. In: Вопросы языкознания, 2017, № 4, с. 115–134. Search in Google Scholar

eISSN:: 1338-4287
Idioma:: Inglés

Calendario de la edición:: 2 veces al año
Temas de la revista:: Linguistics and Semiotics, Theoretical Frameworks and Disciplines, Linguistics, other

RSS Feed de revista

Identifying Errors in Russian Web Corpora

Publicado en línea: 17 ago 2021

Páginas: 977 - 985

DOI: https://doi.org/10.2478/jazcas-2022-0021

Palabras clavecorpora, web texts, errors, typos, orthography, typography, Russian language

© 2022 Maria Khokhlova, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Palabras clave
corpora, web texts, errors, typos, orthography, typography, Russian language