1 Chinese WordSketch Online, corpus-based summaries of word usage
2 Participants Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University Simon Smith, 銘傳大學 ( 中研院 ) 黃居仁, 中研院 巫宜靜, 清華大學 ( 中研院 )
3 Facing the problem: lexical choice “You shall know a word by the company it keeps” (Firth, 1957) The meaning of face depends on the collocation ( 詞語搭配 ) – 學漢語的外國人要面對詞語選擇的問題 – 許多種動物正在面臨絕種 Similarly with save –Save money –Save life –Save a seat for me
4 Look in a dictionary? A corpus? Some modern English dictionaries give some collocation ( 詞語搭配 ) information –Chinese dictionaries give very limited help Since the 1980s, corpus KWIC (KeyWord In Context) concordances have been available
5 Pre-computer corpus! Oxford English Dictionary: 20 million index cards
6 KWIC Concordance
7 1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people The coloured pens method
8 Limitation of KWIC analysis A s corpora get bigger: too much data –50 lines for a word: read all –500 lines: could read all, takes a long time –5000 lines: no Instead, create a statistical summary of word usage –Show most salient 最有顯著性 collocates (Mutual Information)
9 Mutual Information Church and Hanks 1989 MI: How much more often does a word pair occur, than one might expect by chance :
10 Collocation listing For right collocates of save (>5 hits) wordf(x+y)f(y)wordf(x+y)f(y) forests6170life $ dollars81668 lives371697costs71719 enormous6301thousands61481 annually7447face92590 jobs202001estimated62387 money646776your73141
11 Limitations of collocation listing Some items are not genuine collocates –yours appears only because it is adjacent to save The collocates can belong to any part of speech –It would better if they were classified into POS –and the role they play in the sentence Thus, –for arrest in “The police were quick to arrest a number of suspects on the spot” We would like to see –Keyword: arrest –Subject: police –Object: suspect(s) –Modifier: on the spot
12 Wordsketch Attempts to meet these requirements A corpus-derived one-page summary of a word’s grammatical and collocational behaviour Implemented for English and Czech Chinese and Irish implementations in progress
13 The corpus: Chinese Gigaword A Linguistic Data Consortium corpus –Very large: over 1 billion characters –Compiled by David Graff & Ke Chen in 2003 –Minimally tagged 286 newswire stories, half from each of: –CNA Taiwan (740 million traditional characters) –Xinhua PRC (380 million simplified characters) Corpus was segmented and tagged using Academia Sinica tools
14 逮捕 教 學習 銀行 捉
21 Functions KWIC concordance –Sorting, filtering etc Word sketch Automatic thesaurus Sketch difference –discriminate near-synonyms In development –key words in a subcorpus / text type –how word varies with text type
23 Grammar writing Uses CQL (Corpus query language) –Christ and Schulze, U. Stuttgart, 1994 defining an object: v (adj|n|det|num|adv)* n rewriting in CQL with BNC/CLAWS-5 tags [tag="VV.*"] [tag="(A[JTV]|D|O).*"]* [tag="NN.*"]
24 Further work Improve grammatical relations, especially sentence objects, to account for –topicalization ( 啤酒, 葡萄酒, 他都愛喝 ) – 把 fronting ( 請把啤酒喝完 ) Create “Dr Eye” style interface, to show common collocations online, in a text
25 English version available For personal use – 歡迎註冊及多善加利用 !