Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 2 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Linguistik und Semiotik, Theorien und Fachgebiete, Linguistik, andere

Zeitschrift RSS Feed

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Kirill I. Semenov

Armine K. Titizian

Aleksandra O. Piskunova

Yulia O. Korotkova

Alena D. Tsvetkova

Elena A. Volf

Alexandra S. Konovalova

Yulia N. Kuznetsova

Online veröffentlicht: 30. Dez. 2021

Seitenbereich: 590 - 602

DOI: https://doi.org/10.2478/jazcas-2021-0054

SchlüsselwörterMandarin, Russian, parallel corpus, Chinese word segmentation (CWS), grapheme-to-phoneme conversion (G2P), PoS-tagging, code-switching detection

© 2021 Kirill I. Semenov et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Schlüsselwörter
Mandarin, Russian, parallel corpus, Chinese word segmentation (CWS), grapheme-to-phoneme conversion (G2P), PoS-tagging, code-switching detection