Adaptive segmentation for Chinese DDL

1 Adaptive segmentation for Chinese DDL
Simon Smith Coventry University, UK

2 Collaborators on EPSRC bid
And, many thanks to: Hilary Nesi, Coventry University Miloš Jakubíček, Lexical Computing, Brno Nicole Keng, University of Vaasa Serge Sharoff, University of Leeds Siân Alsop, Coventry University

3 Talk sketch Chinese learning and ICT Sketch Engine and SkELL
Wordhood and segmentation Adaptive segmentation Planned implementation on SkELL

4 Importance of Chinese teaching/learning
Mandarin Chinese has most native speakers of all languages. Secondary school uptake has increased in recent years GCSE entries up 18% in 2015, despite overall decline language exam entries (Guardian 2015). Coventry University first and only Confucius Institute in West Midlands

5 ICT in Chinese learning
Good provision in some respects




9 What is missing? Authentic texts to use Vocabulary in use
Collocations in context Whereas for English… Learner dictionaries based on corpus since 80s Data-driven language learning (DDL) since 90s

10 Acclaimed corpus analysis platform
Used by most English learner dictionaries Macmillan, Cambridge etc Resources for many languages Including Chinese Suitable for expert users Can be used by teachers/learners Sketch Engine for Language Learning Free on web Only English & Russian currently available Suitable for teachers/learners Extensible to other languages Chinese an in-demand language

11 Chinese corpora on Sketch Engine
(not SkELL, however)

12 Sketch Engine for Language Learning
Light touch corpus tool for language learners. English, Czech, Russian available What is being taught here?

13 Sketch Engine for Language Learning
What is being taught here?

14 Sketch Engine for Language Learning
Collocation is being taught here Sketch Engine for Language Learning

15 Chinese Applied Corpus Linguistics Symposium, Lancaster University
Please find words that commonly go with 造成 11/9/2018 Chinese Applied Corpus Linguistics Symposium, Lancaster University

16 Chinese Applied Corpus Linguistics Symposium, Lancaster University
11/9/2018 Chinese Applied Corpus Linguistics Symposium, Lancaster University

17 Q: How are these grammatical relations found??? A: gramrels
Simple VO rule for English 1: "V" "(DET|NUM|ADJ|ADV|N)"* 2: "N" Adjectival Modifier/Modifies in Chinese 2:"V.*" [word="的"] modifying_noun{0,2} 1:noun not_noun Object-fronting direct object rule for Chinese [word="把"|word="将"] NP adv{0,2} 1:"VV" (particle|prep)? NP1 noun

18 Chinese NLP steps… But to do segmentation, we must first define word!
POS tagging詞類標記 Sentence parsing (for gramrel)

19 Wordhood Trivial for English? Blackbird, greenfinch vs red admiral
Don’t; runner-up Whitespace: an analysis of convenience But at least it is available Used by NLP Chinese, Japanese, Latin (in antiquity) No whitespace Japanese and Latin have inflection

20 The Chinese word for “word”
Not really 字 Not really 词 Chao’s (1968: 136) sociological word = 字 字 = character ≈ morpheme ≈ syllable People are not 100% in agreement over word boundaries:

21 Segmentation software privileges long words
rén zhōng guó middle country person

22 中国人 [中国人] [中国] [人] [中] [国] [人] Lexicography standard
Learners: word level Learners: intra-word


24 Chinese NLP steps… Solution: multiple levels of segmentation
Segmentation now set in stone! POS tagging詞類標記 Solution: multiple levels of segmentation Sentence parsing (for gramrel)

25 中华人民共和国尊重在中国境内的外国人的宗教信仰自由
(From zhTenTen corpus) The PRC respects the right of foreigners residing within its territory to practise freedom of religion

26 People’s Republic of China
1 2 People’s Republic of China 3 Chinese people republic 3a China the people 4 middle Chinese person nationality republican country 4a common union Useful for MT Useful for IR ALL levels: useful for learners

27 Segmentation algorithms
Longest match procedure (Tseng & Chen 2002) Mutual Information scores Sproat and Shih (1990) Sun et al (1998) without dictionaries Machine learning Conditional Random Fields (Sha & Pereira, 2003; Chang et al (2008) HMM (Baidu’s Jieba segmenter) neural nets (Zheng et al, 2013; Pei et al, 2014)

28 Segmentation algorithms
Longest match procedure Deng & Long (1987) Lai (1990) Step Segments to look up Lookup status (success/fail) 1 ABCDEF f A s 2 ABCDE AB 3 ABCD ABC 4 C 5 CD 6 CDEFG CDE 7 CDEF 8 G 9 Result AB CDEF G AB CDEF G

29 Work planned on EPSRC bid
With Serge and Siân Neural net based adaptive segmenter Reveals different granularities of segmentation to learners of Chinese With Miloš and Lexical Computing Overhaul of current Chinese gramrels Implementation of adaptive segmenter on SkELL platform

30 References Chang, P. C., Galley, M., & Manning, C. D. (2008). Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the third workshop on statistical machine translation (pp ). Association for Computational Linguistics. Chao, Y. (1968). Grammar of Spoken Chinese. Berkeley: University of California Press. Deng, Q., & Long, Z. (1987). A microcomputer retrieval system realising automatic information indexing in Chinese. Journal of Information Science, 6, 427—432 Lai, M. (1990). Development of a system of automatic indexing for Chinese scientific and technical literature. In M. Zeng (Ed.), Database developments and Chinese information needs: Proceedings of the Second Beijing International Symposium on Computerised Information Retrieval, Beijing (pp. 179—188). London: Aslib Pei, W., Ge, T., & Chang, B. (2014). Max-Margin Tensor Neural Network for Chinese Word Segmentation. In Proceedings of ACL 2014 (1) (pp ). Sha, F. and Pereira, F. (2003). Shallow parsing with conditional random fields. In Proceedings, HLT-NAACL’03, pp. 134–141. Edmonton, Canada Sproat, R., and Shih, C. (1990). A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4(4). 336–51. Sun, M., Shen, D., & Tsou, B. K. (1998). Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 17th international conference on Computational linguistics-Volume 2 (pp ). Association for Computational Linguistics. Tseng, H., & Chen, K. J. (2002). Design of Chinese morphological analyzer. In: Proceedings of the first SIGHAN workshop on Chinese language processing (pp. 1-7). Association for Computational Linguistics Zheng, X. Chen, H. and Xu, T. (2013). Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 647–657, Seattle, Washington, USA, October. Association for Computational Linguistics.

