Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.

Slides:



Advertisements
Similar presentations
Part Two: Using Xaira to explore corpora Richard Xiao
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Tracking L2 Lexical and Syntactic Development Xiaofei Lu CALPER 2010 Summer Workshop July 14, 2010.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Chinese WordSketch Online, corpus-based summaries of word usage.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Reference & Denotation Connotation Sense Relations
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Researching language with computers Paul Thompson.
PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.
1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.
Using the Sketch Engine for second language learning Simon Smith & Alice Chen.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Using the Sketch Engine for second language learning: an experiment Simon Smith & Alice Chen |
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
INTRODUCTION: RESEARCH AREA 1. Chinese Semantics 2. Semantic difference related to syntax 3. Module Attribute Representation of Verbal Semantics (MARVS)
1 Word senses: a computational response Adam Kilgarriff Auckland 2012Kilgarriff: Word senses: a computational response.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
8. ONLINE REFERENCE TOOLS Dictionaries and Thesauruses Concordancers and corpuses for language analysis Translators for language analysis Encyclopedias.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
1 Word senses: a computational response Adam Kilgarriff.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
A Writer’s Assistant based on the World Wide Web-Knowledge Takashi Yamanoue Kyushu Institute of Technology, Japan Toshiro Minami Kyushu University, Japan.
Visualisation of Word Senses on the Concordance Level ENeL Kris Heylen Quantitative Lexicology and Variational Linguistics.
Using the Sketch Engine for second language learning: an experiment Simon Smith & Alice Chen |
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Yu, et al.’s “A Model-Driven Development Framework for Enterprise Web Services” In proceedings of the 10 th IEEE Intl Enterprise Distributed Object Computing.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
1 Word senses: a computational response Adam Kilgarriff.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Lexical and Syntax Analysis
Computational and Statistical Methods for Corpus Analysis: Overview
H070 Topic Title H470/1 Exploring language.
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Corpora, Language Technology and Maltese
Presentation transcript:

Sketch engine for Chinese Discussion notes

Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based summaries of a word’s grammatical and collocational behaviour Captures information in a more accessible way then hundreds of KWIC lines Uses MI based salience algorithm

Other corpus query tools do collocational salience too, but… Sketch engine uses lemmata not word- forms –So that eat and eats are treated the same And it takes account of grammatical relations –So that The plane banks and The investment banks are treated separately –And (if the corpus is appropriately parsed) He robs banks and He robbed the bank would be accorded similar treatment

Grammatical relations example Unary relations Word2 and Prep are not specified Binary relations Prep not specified Binary relations, Word2 not specified Trinary relations

Sketch engine modules Concordance –KWIC or sentence context Thesaurus –A list of “similar” words Sketch differences, for distinguishing near- synonyms –If both lemmata x and y have strong collocational salience with a, then they are near-synonyms Wordsketch

Sample of grammatical relation definitions script (M language) define(`wh_word',`[tag=3D"AVQ"|tag=3D"D`$ p& TQ"|tag=3D"PNQ"]') define(`whether_if',`[tag=3D"PNQ" & word=3D"if" |word=3D"whether"]') define(`determiner',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposs_pro]') define(`conjunction',`"CJC"') define(`simple_neg',`"XX."') define(`rel_start',`[tag=3D"DTQ"|tag=3D"PNQ"|tag=3Dthat_comp]') define(`adv_neg',`[tag=3Dany_adv|tag=3Dsimple_neg]') define(`number',`"[OC]RD"') define(`goal_adv',`[word=3D"back"|word=3D"over"|word=3D"home"|word=3D"awa= y"|word=3D"out"]') define(`long_np',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposp& €( s_pro|tag=3Dnumber|ta= g=3Dany_adv|tag=3Dany_adj|tag=3Dgenitive]{0,3} any_noun{0,2} 2:any_noun = [tag!=3Dany_noun & tag !=3D genitive]') define(`np_start',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposs_pro|tag=3Dnumber|t= ag=3Dany_adj|tag=3Dany_noun]')

Applications Intended as an aid to lexicographers At least one paper on MT application Could be used in pedagogical applications –Earlier NSF grant aimed at a complete Chinese learning platform, with Wordsketch as a module –Comparison of similar lexemes cross- linguistically Yiching is publishing about express vs biaoshi, and this work may use Wordsketch

Chinese Wordsketch Kilgarriff et al report that Wordsketch can be ported to any language –Pavel Rychly in Czech Rep has implemented concordancing at Chinese character level only AS has acquired Chinese Gigaword, and POS-tagged it automatically –No parsing has been attempted so far Grammatical relations ruleset for Chinese is needed I would plan to –contribute to the writing of this ruleset –collaborate on cross-linguistic lexical analyses, using Wordsketch where possible

links e/ e/ –test chin –ssmith ssmith