Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

[1] Semenov, K. I., Kuznetsova, Y. N., and Durneva, S. P. (2020). Russian-Chinese parallel corpus of RNC: Problems and perspectives. Proceedings of the 10^th International Conference “Russia and China: History and Perspectives for Cooperation”, pages 633–640. Search in Google Scholar

[2] Emerson, T. (2005). The Second International Chinese Word Segmentation Bakeoff. Accessible at: http://sighan.cs.uchicago.edu/bakeoff2005/. Search in Google Scholar

[3] Li, P.-H., and Ma, W.-Y. (2019). CkipTagger. Accessible at: https://github.com/ckiplab/ckiptagger. Search in Google Scholar

[4] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Association for Computational Linguistics (ACL) System Demonstrations. Accessible at: https://nlp.stanford.edu/pubs/qi2020stanza.pdf. Search in Google Scholar

[5] Honnibal, M., and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Accessible at: https://spacy.io/. Search in Google Scholar

[6] Luo, R., xu, J., Zhang, Y., Ren, x., and Sun, x. (2019). PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. Accessible at: http://arxiv.org/abs/1906.11455. Search in Google Scholar

[7] Geng, Z., Yan, H., Qiu, x., and Huang, x. (2020). fastHan: A BERT-based Joint Many-Task Toolkit for Chinese NLP. Accessible at: http://arxiv.org/abs/2009.08633. Search in Google Scholar

[8] Zhang, H., and Shang, J. (2019). NLPIR-Parser: An intelligent semantic analysis toolkit for big data. Corpus Linguistics, 6(1), pages 87–104. Search in Google Scholar

[9] Che, W., Feng, Y., Qin, L., and Liu, T. (2021). N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models. Accessible at: http://arxiv.org/abs/2009.11616. Search in Google Scholar

[10] Straka, M. (2018). UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207. Accessible at: https://doi.org/10.18653/v1/K18-2020.10.18653/v1/K18-2020 Search in Google Scholar

[11] Semenov, K. I., Korotkova, Y. O., Volf, E. A., and Konovalova, A. S. (2021). Automatic Annotation of the Chinese Texts that Contain Loanwords: Word Segmentation, Transcription, PoS-tagging. DIALOG-2021: 27^th International Conference on Computational Linguistics and Intellectual Technologies, Supplementary volume, pages 1081–1095. Accessible at: http://www.dialog-21.ru/media/5420/_-dialog2021supvol.pdf. Search in Google Scholar

[12] Cai, Z., Yang, Y., Zhang, C., Qin, x., and Li, M. (2019). Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features. Accessible at: https://arxiv.org/abs/1907.01749. Search in Google Scholar

[13] Park, K., and Lee, S. (2020). g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset. Accessible at: http://arxiv.org/abs/2004.03136. Search in Google Scholar

[14] Luo, E. (2020). xpinyin. Accessible at: https://github.com/lxneng/xpinyin. Search in Google Scholar

[15] Huang, H. (2020). pypinyin. Accessible at: https://github.com/mozillazg/python-pinyin. Search in Google Scholar

[16] Konovalova, A. S., and Tsvetkova, A. D. (2021). Comparative analysis of grapheme-to-phoneme models for the Russian-Chinese parallel corpus. Program book of Buckeye East Asian Linguistics Forum 4, pages 28–30. Accessible at: https://cpb-us-w2.wpmucdn.com/u.osu.edu/dist/6/3609/files/2021/03/BEALF-4_Program_Book_2021-3-5.pdf. Search in Google Scholar

[17] Roten, T. S. (2018). PyNLPIR PoS tagset. Accessible at: https://pynlpir.readthedocs.io/en/latest/api.html. Search in Google Scholar

[18] Semenov, K. I., Korotkova, Y. O., and Volf, E. A. (2021). Automatic Annotation of the Russian Loanwords in Chinese Texts: Issues in Word Segmentation and PoS-tagging. Proceedings of Corpora 2021 International Conference. 14 pages [in press]. Search in Google Scholar

[19] Konovalova, A. S. (2021). Automatic POS-tagging for Chinese Using Parallel Data [BA thesis]. Higher School of Economics. 82 pages. Search in Google Scholar

eISSN:: 1338-4287
Language:: English

Publication timeframe:: 2 times per year
Journal Subjects:: Linguistics and Semiotics, Theoretical Frameworks and Disciplines, Linguistics, other

Journal RSS Feed

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Published Online: Dec 30, 2021

Page range: 590 - 602

DOI: https://doi.org/10.2478/jazcas-2021-0054

KeywordsMandarin, Russian, parallel corpus, Chinese word segmentation (CWS), grapheme-to-phoneme conversion (G2P), PoS-tagging, code-switching detection

© 2021 Kirill I. Semenov et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Keywords
Mandarin, Russian, parallel corpus, Chinese word segmentation (CWS), grapheme-to-phoneme conversion (G2P), PoS-tagging, code-switching detection