Download presentation
Presentation is loading. Please wait.
Published byJonathan Julius Williamson Modified over 6 years ago
1
SCTB: A Chinese Treebank in Scientific Domain
Chenhui Chu, Toshiaki Nakazawa Daisuke Kawahara and Sadao Kurohashi ALR 2016
2
The Significant Needs Increase of Chinese Analysis in Scientific Domain
Intelligent access to Chinese scientific text such as: Text mining Knowledge discovery Translation China China # Scientific papers # Patents
3
Segmentation and POS tagger
Chinese Analysis 烟草使用者 (tobacco user) Segmentation and POS tagger 烟草_NN 使用_VV 者_SFN tobacco use person Syntactic parser Treebank
4
Chinese Treebanks Treebank Size Syntactic structure Domain CTB [Xue+ 2005] 18k in CTB5 phrase News! PKU [Yu+ 2003] 14k both HIT [Che+ 2012] 50k dependency There is no available Chinese treebanks in scientific domain!
5
Chinese Analysis Accuracies (with analyzers trained on CTB5)
Solution: construct a Chinese treebank in scientific domain (SCTB) F-Measure (%)
6
SCTB Annotation: Raw Sentence Selection
Randomly selected from the parallel data in the LCAS (National Science Library, Chinese Academy of Sciences) corpus As there are corresponding translated Japanese and English sentences, we leave the potential to develop our treebank to a trilingual one
7
SCTB Annotation: Annotation Standard
Phrase structure: Follow the annotation standard of CTB [Xue+ 2005] NEW! POS tagging: follow CTB [Xue+ 2005], but 6 additional tags for bound morphemes Segmentation: short and consistent units [Shen+ COLING 2016] 12/13 14:00~ One word in conventional segmentation standards
8
SCTB Annotation: Annotation Process
Used the SynTree toolkit as the annotation interface Two annotators with inter-annotator agreements of 98.95%, 97.78%, and 95.05% for segmentation, POS tagging and parsing, respectively Finished 5,133 sentences in 6 month at the submission time of this paper (now about 8,000 sentences)
9
Experimental Settings for Chinese Analysis
Chinese analyzers KyotoMorph [Shen+ 2014] for segmentation and POS tagging Berkeley parser [Petrov+ 2007] for syntactic parsing Comparison: test on 200 scientific sentences Baseline: trained on CTB5 Baseline+SCTB: additionally added SCTB
10
Chinese Analysis Accuracies with SCTB
F-Measure (%)
11
An Improved Parsing Example
Gold Baseline Baseline+SCTB 被动(passive) /肌肉(muscle) / 骨骼(skeleton) /亚(sub) /系统(system)
12
Learning Curve (%)
13
Machine Translation Experiments
Shared MT Tasks ASPEC-CJ: Chinese-to-Japanese in scientific domain [Nakazawa+ 2015] NTCIR-CE: Chinese-to-English in patent domain [Goto+ 2013] MT system Moses tree-to-string [Koehn+ 2007] System ASPEC-CJ (BLEU) NTCIR-CE (BLEU) Baseline 39.12 33.19 Baseline+SCTB 40.08 33.90
14
An Improved MT Example Baseline Baseline+SCTB
15
Conclusion We constructed SCTB: a Chinese treebank in scientific domain SCTB 1.0 has been released for free! Future Work Annotate more sentences Develop it to a trilingual treebank
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.