Chinese Language Word Embeddings Based on the Corpus Hanku

Radovan Garabík

Open Access

Chinese Language Word Embeddings Based on the Corpus Hanku

Radovan Garabík

| Aug 17, 2022

Journal of Linguistics/Jazykovedný casopis

Volume 72 (2021): Issue 4 (December 2021)

Building Web corpora as sources for linguistic research and its applications

About this article

Cite

Page range: 996 - 1004

DOI: https://doi.org/10.2478/jazcas-2022-0023

Keywords
word embeddings, Chinese, Pǔtōnghuà, corpus, NLP

© 2022 Radovan Garabík, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

BOJANOWSKI, Piotr – GRAVE, Edouard – JOULIN, Armand – MIKOLOV, Tomáš: Enriching word vectors with subword information. In: Transactions of the Association for Computational Linguistics, 2017, No. 5, pp. 135–146. Search in Google Scholar

GAJDOŠ, Ľuboš – GARABÍK, Radovan – BENICKÁ, Jana: The New Chinese Webcorpus Hanku – Origin, Parameters, Usage. In: Studia Orientalia Slovaca, 2016, Vol. 15, No. 1, pp. 21–33. Search in Google Scholar

GAJDOŠ, Ľuboš: The discrepancy between spoken and written Chinese methodological notes on linguistics. In: Studia Orientalia Slovaca, 2011, Vol. 10, No. 1, pp. 155–159. Search in Google Scholar

GAJDOŠ, Ľuboš: Čínsky jazyk a čínske písmo. In: Historická revue, 2012, Vol. 23, No. 7, pp. 47–50. Search in Google Scholar

GAJDOŠ, Ľuboš: Synsémantické slová v rámci stratifikácie čínskeho jazyka. In: Miscellanea Asiae Orientalis Slovaca. Bratislava: Univerzita Komenského 2014, pp. 121–131. Search in Google Scholar

GARABÍK, Radovan: Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool. In: Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje, 2020, Vol. 46, No. 2, pp. 603–618. Search in Google Scholar

中华人民共和国中央人民政府: 国务院关于推广普通话的指示, 1956. Available online: http://www.gov.cn/test/2005-08/02/content_19132.htm Search in Google Scholar

HANSELL, Mark: The Sino-Alphabet: The Assimilation of Roman Letters into the Chinese Writing System. In: Sino-Platonic Papers, 1994, Vol. 45, pp. 1–28. Search in Google Scholar

MICHELFEIT, Jan – POMIKÁLEK, Jan – SUCHOMEL, Vít: Text Tokenisation Using unitok. In: 8th Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU 2014, pp. 71–75. Search in Google Scholar

MIKOLOV, Tomáš – CHEN, Kai – CORRADO, Greg – JEFFREY, Dean: Efficient Estimation of Word Representations in Vector Space. In: Proceedings of Workshop at ICLR 2013. Search in Google Scholar

ŘEHŮŘEK, Radim – SOJKA, Petr: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50. Search in Google Scholar

ŞENEL, Lutfi Kerem – UTLU, İhsan. – YÜCESOY, Veysel – KOÇ, Aykut. – ÇUKUR, Tolga: Semantic structure and interpretability of word embeddings. In: EEE/ACM Transactions on Audio, Speech and Language Processing, 2018, Vol. 26, No. 10, pp. 1769–1779. Search in Google Scholar

SPROAT, Richard W. – SHIH, Chilin – GALE, William – CHANG, Nancy:. A stochastic finite-state word-segmentation algorithm for Chinese. In: Computational Linguistics, 1996, Vol. 22, No. 3, pp. 377–404. Search in Google Scholar

ZHANG, Yue – CLARK, Stephen: Syntactic Processing Using the Generalized Perceptron and Beam Search. In: Computational Linguistics, 2011, Vol. 37, No. 1, pp. 105–151. Search in Google Scholar

eISSN:: 1338-4287
Language:: English

Publication timeframe:: 2 times per year
Journal Subjects:: Linguistics and Semiotics, Theoretical Frameworks and Disciplines, Linguistics, other

Journal RSS Feed

Chinese Language Word Embeddings Based on the Corpus Hanku

Published Online: Aug 17, 2022

Page range: 996 - 1004

DOI: https://doi.org/10.2478/jazcas-2022-0023

Keywordsword embeddings, Chinese, Pǔtōnghuà, corpus, NLP

© 2022 Radovan Garabík, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
word embeddings, Chinese, Pǔtōnghuà, corpus, NLP