A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach

This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ingeniería, Introducciones y reseñas, Ingeniería, otros

RSS Feed de revista

A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach

Lu Pengyu

Pu Jingchuan

Du Mingming

Lou Xiaojuan

Jin Lijun

Publicado en línea: 01 mar 2014

Páginas: 263 - 282

DOI: https://doi.org/10.21307/ijssis-2017-655

Palabras claveChinese word segmentation, lexicon-based, Corpus-based, word frequency, natural language processing

© 2014 Lu Pengyu et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Palabras clave
Chinese word segmentation, lexicon-based, Corpus-based, word frequency, natural language processing