Hostname: page-component-78c5997874-t5tsf Total loading time: 0 Render date: 2024-11-03T08:56:01.516Z Has data issue: false hasContentIssue false

Splitting-merging model of Chinese word tokenization and segmentation

Published online by Cambridge University Press:  01 December 1998

YUAN YAO
Affiliation:
Department of Information Systems & Computer Science, National University of Singapore, Lower Kent Ridge Road, Singapore 119260, e-mail: [email protected]@iscs.nw.edu.sg
KIM TEN LUA
Affiliation:
Department of Information Systems & Computer Science, National University of Singapore, Lower Kent Ridge Road, Singapore 119260, e-mail: [email protected]@iscs.nw.edu.sg

Abstract

Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.

Type
Research Article
Copyright
© 1998 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)