Chinese Segmentation Using a Word-based Perceptron Algorithm
Yue Zhang and Stephen Clark abstract
Standard approaches to Chinese word segmentation treat the problem as a tagging task, assigning labels to the characters in the sequence indicating whether the character marks a word boundary. Discriminatively trained models based on local character features are used to make the tagging decisions, with Viterbi decoding finding th highest scoring segmentation. In this pape we propose an alternative, word-based segmentor, which uses features based on complete words and word sequences. The generalized perceptron algorithm is used for discriminative training, and we use a beam search decoder. Closed tests on the first an second SIGHAN bakeoffs show that our system is competitive with the best in the literature, achieving the highest reported F-score for a number of corpora.
infoaddress | Prague, Czech Republic |
book title | Proceedings of ACL |
month | June |
year | 2007 |
links
BibTeX
Link (pdf)
related pages
|