OXFORD UNIVERSITY COMPUTING LABORATORY

Chinese Segmentation Using a Word-based Perceptron Algorithm

Yue Zhang and Stephen Clark

abstract

Standard approaches to Chinese word segmentation treat the problem as a tagging task, assigning labels to the characters in the sequence indicating whether the character marks a word boundary. Discriminatively trained models based on local character features are used to make the tagging decisions, with Viterbi decoding finding th highest scoring segmentation. In this pape we propose an alternative, word-based segmentor, which uses features based on complete words and word sequences. The generalized perceptron algorithm is used for discriminative training, and we use a beam search decoder. Closed tests on the first an second SIGHAN bakeoffs show that our system is competitive with the best in the literature, achieving the highest reported F-score for a number of corpora.

info

address

Prague, Czech Republic

book title

Proceedings of ACL

month

June

year

2007

links

BibTeX

Link (pdf)

related pages

people

Random Image
Random Image
Random Image