1. bookVolumen 72 (2022): Edición 4 (June 2022)
    Building Web corpora as sources for linguistic research and its applications
Detalles de la revista
License
Formato
Revista
eISSN
1338-4287
Primera edición
05 Mar 2010
Calendario de la edición
2 veces al año
Idiomas
Inglés
access type Acceso abierto

Chinese Language Word Embeddings Based on the Corpus Hanku

Publicado en línea: 17 Aug 2022
Volumen & Edición: Volumen 72 (2022) - Edición 4 (June 2022) - Building Web corpora as sources for linguistic research and its applications
Páginas: 996 - 1004
Detalles de la revista
License
Formato
Revista
eISSN
1338-4287
Primera edición
05 Mar 2010
Calendario de la edición
2 veces al año
Idiomas
Inglés
Abstract

Vector models based on word embeddings are an indispensable part of advanced Natural Language Processing research and language analysis. We describe several Chinese language (Pǔtōnghuà) word embeddings, the differences from “western” language models caused by specific orthographic and linguistic features of the written Chinese language, and introduce a publicly available web interface for querying the vector models, aimed at linguistically or pedagogically oriented users.

Keywords

BOJANOWSKI, Piotr – GRAVE, Edouard – JOULIN, Armand – MIKOLOV, Tomáš: Enriching word vectors with subword information. In: Transactions of the Association for Computational Linguistics, 2017, No. 5, pp. 135–146.10.1162/tacl_a_00051 Search in Google Scholar

GAJDOŠ, Ľuboš – GARABÍK, Radovan – BENICKÁ, Jana: The New Chinese Webcorpus Hanku – Origin, Parameters, Usage. In: Studia Orientalia Slovaca, 2016, Vol. 15, No. 1, pp. 21–33. Search in Google Scholar

GAJDOŠ, Ľuboš: The discrepancy between spoken and written Chinese methodological notes on linguistics. In: Studia Orientalia Slovaca, 2011, Vol. 10, No. 1, pp. 155–159. Search in Google Scholar

GAJDOŠ, Ľuboš: Čínsky jazyk a čínske písmo. In: Historická revue, 2012, Vol. 23, No. 7, pp. 47–50. Search in Google Scholar

GAJDOŠ, Ľuboš: Synsémantické slová v rámci stratifikácie čínskeho jazyka. In: Miscellanea Asiae Orientalis Slovaca. Bratislava: Univerzita Komenského 2014, pp. 121–131. Search in Google Scholar

GARABÍK, Radovan: Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool. In: Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje, 2020, Vol. 46, No. 2, pp. 603–618.10.31724/rihjj.46.2.8 Search in Google Scholar

中华人民共和国中央人民政府: 国务院关于推广普通话的指示, 1956. Available online: http://www.gov.cn/test/2005-08/02/content_19132.htm Search in Google Scholar

HANSELL, Mark: The Sino-Alphabet: The Assimilation of Roman Letters into the Chinese Writing System. In: Sino-Platonic Papers, 1994, Vol. 45, pp. 1–28. Search in Google Scholar

MICHELFEIT, Jan – POMIKÁLEK, Jan – SUCHOMEL, Vít: Text Tokenisation Using unitok. In: 8th Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU 2014, pp. 71–75. Search in Google Scholar

MIKOLOV, Tomáš – CHEN, Kai – CORRADO, Greg – JEFFREY, Dean: Efficient Estimation of Word Representations in Vector Space. In: Proceedings of Workshop at ICLR 2013. Search in Google Scholar

ŘEHŮŘEK, Radim – SOJKA, Petr: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50. Search in Google Scholar

ŞENEL, Lutfi Kerem – UTLU, İhsan. – YÜCESOY, Veysel – KOÇ, Aykut. – ÇUKUR, Tolga: Semantic structure and interpretability of word embeddings. In: EEE/ACM Transactions on Audio, Speech and Language Processing, 2018, Vol. 26, No. 10, pp. 1769–1779. Search in Google Scholar

SPROAT, Richard W. – SHIH, Chilin – GALE, William – CHANG, Nancy:. A stochastic finite-state word-segmentation algorithm for Chinese. In: Computational Linguistics, 1996, Vol. 22, No. 3, pp. 377–404. Search in Google Scholar

ZHANG, Yue – CLARK, Stephen: Syntactic Processing Using the Generalized Perceptron and Beam Search. In: Computational Linguistics, 2011, Vol. 37, No. 1, pp. 105–151.10.1162/coli_a_00037 Search in Google Scholar

Artículos recomendados de Trend MD

Planifique su conferencia remota con Sciendo