Download presentation
Presentation is loading. Please wait.
1
The Second International Chinese Word Segmentation Bakeoff Coordinated by Thomas Emerson
2
Roadmap Contest Details –Corpora, Tracks, and Sites Results –Baselines and Measures Discussion Thanks
3
Corpora Four Corpora: 2 simplified chars, 2 traditional All provide ground truth and segmentation standard
4
Tracks and Sites Two tracks: –Open: Participants may use any data to train External lexica, POS information, etc –Closed: Sites may ONLY use training data set 23 Participating sites completed bakeoff –9 PRC, 4 HK, 4 US, 2 TW, 1 GB, 1 JP, 1 SG 130 runs submitted
5
Results Baseline: L-to-R MaxMatch w/training vocab: 0.83-0.93 Topline: L-to-R MaxMatch w/test truth vocab: 0.99 Measures: Recall, Precision, F-measure –Recall on OOV, Recall on in-vocab Best F-score: Open 0.972, median 0.941 –Best closed: 0.964 (on MSR corpus) –Best OOV recall: Open 0.872; Closed 0.813 Vs 2003: best F-score: 0.961: now 17 reach
6
Results AS Closed: NAIST, Stanford AS Open: SG, Yahoo!, Sheffield MSR Closed Stanford, UHK, Yahoo! MSR Open: Harbin, SG, UHK
7
Thanks & Future Thanks to participants and providers –Academia Sinica, ICL Beijing, CUHK,MSRA Future Bakeoffs: –Different training/test registers? –Additional tasks? NER? –Suggestions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.