Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Second International Chinese Word Segmentation Bakeoff Coordinated by Thomas Emerson.

Similar presentations


Presentation on theme: "The Second International Chinese Word Segmentation Bakeoff Coordinated by Thomas Emerson."— Presentation transcript:

1 The Second International Chinese Word Segmentation Bakeoff Coordinated by Thomas Emerson

2 Roadmap Contest Details –Corpora, Tracks, and Sites Results –Baselines and Measures Discussion Thanks

3 Corpora Four Corpora: 2 simplified chars, 2 traditional All provide ground truth and segmentation standard

4 Tracks and Sites Two tracks: –Open: Participants may use any data to train External lexica, POS information, etc –Closed: Sites may ONLY use training data set 23 Participating sites completed bakeoff –9 PRC, 4 HK, 4 US, 2 TW, 1 GB, 1 JP, 1 SG 130 runs submitted

5 Results Baseline: L-to-R MaxMatch w/training vocab: 0.83-0.93 Topline: L-to-R MaxMatch w/test truth vocab: 0.99 Measures: Recall, Precision, F-measure –Recall on OOV, Recall on in-vocab Best F-score: Open 0.972, median 0.941 –Best closed: 0.964 (on MSR corpus) –Best OOV recall: Open 0.872; Closed 0.813 Vs 2003: best F-score: 0.961: now 17 reach

6 Results AS Closed: NAIST, Stanford AS Open: SG, Yahoo!, Sheffield MSR Closed Stanford, UHK, Yahoo! MSR Open: Harbin, SG, UHK

7 Thanks & Future Thanks to participants and providers –Academia Sinica, ICL Beijing, CUHK,MSRA Future Bakeoffs: –Different training/test registers? –Additional tasks? NER? –Suggestions?


Download ppt "The Second International Chinese Word Segmentation Bakeoff Coordinated by Thomas Emerson."

Similar presentations


Ads by Google