The Second International Chinese Word Segmentation Bakeoff Coordinated by Thomas Emerson
Roadmap Contest Details –Corpora, Tracks, and Sites Results –Baselines and Measures Discussion Thanks
Corpora Four Corpora: 2 simplified chars, 2 traditional All provide ground truth and segmentation standard
Tracks and Sites Two tracks: –Open: Participants may use any data to train External lexica, POS information, etc –Closed: Sites may ONLY use training data set 23 Participating sites completed bakeoff –9 PRC, 4 HK, 4 US, 2 TW, 1 GB, 1 JP, 1 SG 130 runs submitted
Results Baseline: L-to-R MaxMatch w/training vocab: Topline: L-to-R MaxMatch w/test truth vocab: 0.99 Measures: Recall, Precision, F-measure –Recall on OOV, Recall on in-vocab Best F-score: Open 0.972, median –Best closed: (on MSR corpus) –Best OOV recall: Open 0.872; Closed Vs 2003: best F-score: 0.961: now 17 reach
Results AS Closed: NAIST, Stanford AS Open: SG, Yahoo!, Sheffield MSR Closed Stanford, UHK, Yahoo! MSR Open: Harbin, SG, UHK
Thanks & Future Thanks to participants and providers –Academia Sinica, ICL Beijing, CUHK,MSRA Future Bakeoffs: –Different training/test registers? –Additional tasks? NER? –Suggestions?