A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach

This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).

eISSN:: 1178-5608
Language:: English

Publication timeframe:: Volume Open
Journal Subjects:: Engineering, Introductions and Overviews, other

Journal RSS Feed

A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach

Published Online: Mar 01, 2014

Page range: 263 - 282

DOI: https://doi.org/10.21307/ijssis-2017-655

© 2014 Lu Pengyu et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.