A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach
Publicado en línea: 01 mar 2014
Páginas: 263 - 282
DOI: https://doi.org/10.21307/ijssis-2017-655
Palabras clave
© 2014 Lu Pengyu et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).