Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.

Slides:



Advertisements
Similar presentations
IAC (ACCESS INTERFACE CORPUS) DEVELOPED BY BARCELONA MEDIA & UNIVERSITAT POMPEU FABRA TONI BADIA (BARCELONA MEDIA - UNIVERSITAT POMPEU FABRA) JUDITH DOMINGO.
Advertisements

Concordancing at Upper-Intermediate Levels What it is not What you will get from this talk.
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 17, 2012.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
CALL – computer assisted language learning A short course delivered by Dr. Klaus Schwienhorst. MITE January 2002.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
Researching language with computers Paul Thompson.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
R esearching learner E nglish on a portfolio corpus --A research proposal for diachronic studies L i W enzhong.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Language Identification and Part-of-Speech Tagging
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Topics in Linguistics ENG 331
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
(word formation: follow up)
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 2 prof. ssa Laura Liucci –
Presentation transcript:

Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009

2 Overview  What is a corpus  Corpus design and compilation  Corpus annotation  Corpus querying and analysis  Resources  GOLD

3 What is a corpus?  Leech (1992): an unexciting phenomenon, a helluva lot of text, stored on a computer  Sinclair (1991): a collection of naturally-occurring language text, chosen to characterise a state or a variety of language  Sinclair (2004): a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

4 Types of corpora  General-purpose vs. specialized corpora The British National Corpus Michigan Corpus of Academic Spoken English  Native vs. learner corpora International Corpus of Learner English  Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer  Corpora representing one or diverse language varieties International Corpus of English  Synchronic vs. diachronic corpora  Spoken vs. written corpora

5 Corpus design  Purpose/orientation, type  External criteria for content selection Communicative function of a text Mode, medium, interaction, domain, topic  Sampling, size  Representativeness, balance, homogeneity  Design of the BNC Design of the BNC

6 Corpus annotation  Why annotate  Levels of corpus annotation  Difficulties for corpus annotation  Standards and encoding

7 Why annotate  For linguistic research Allow more effective corpus searches  For natural language processing Spelling and grammar checking Machine translation

8 Levels of corpus annotation  Sentence and word segmentation  Lemmatization and part-of-speech (POS) tagging  Chunking and syntactic parsing  Semantic, pragmatic, discourse, and stylistic tagging  Learner corpora: error annotation  Project-specific annotation

9 Difficulties for corpus annotation  Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD  Unknown words Identification POS tagging Semantic annotation  Precision, recall, inter-annotator agreement

10 Standards and encoding  Useful standards Separable Documentation Linguistically consensual Compatibility with existing standards  Encoding Simple encoding: present_JJ XML-style: present

11 Corpus querying and analysis  Using windows- or web-based software Good for processing raw corpora Word frequency, concordances, lexical bundles, and keyword lists Examples: AntConc and GOLDAntConcGOLD  Using natural language processing tools Good for processing annotated corpora Extracting occurrences of grammatical patterns Examples: Stanford parser and TregexStanford parser and Tregex

12 Interpreting corpus data  St atistical analysis examples  Are frequency differences statistically significant? w appears x times in an n-word corpus, and y times in an m-word corpus Chi-square test and Fisher’s Ex act Test  Collocation analysis How strongly are x and y associated Mutual information and t-test

13 Resources  Books Hunston (2002): Corpora in Applied Linguistics McEnery (2006): Corpus-Based Language Studies  Journals International Journal of Corpus Linguistics Corpus Linguistics and Linguistic Theory Corpora  Websites and mailing lists Bookmarks for corpus-based linguists Linguistic data consortium The corpora list

14 Resources  Corpus annotation and analysis tools Stanford Natural Language Processing Group  Places for exploration MICASE BNC Online

15 Note on research project design  Purpose of project  Corpus compilation and annotation  Corpus analysis Bottom-up: from observations of recurring patterns to hypothesis and generalizations Top-down: start with given categories and search for evidence of use and variance  Caution on generalizability

16 GOLD: Graphic Online Language Diagnostic  One of 10 projects in CALPER  Co-directors: Michael McCarthy & Xiaofei Lu  This is work in progress ( )

17 Overview of functions  An online tool for users toonline tool Build, upload, and update their own corpora Share corpora with each other Search corpora

18 Corpus compilation  A user can compile a corpus by Directly creating and uploading an XML filean XML file Using the guided XML creation interface  An uploaded corpus can be easily updated Documents can be added or deleted The whole corpus can be deleted

19 Corpus sharing  GOLD facilitates easy data sharing  A corpus may be set to be Private, shared, or public  Corpus owner may give others right to View, add, edit, or delete corpora

20 Metadata information  A corpus should contain informative metadata Information about the learner Information about the sample  Facilitates contrastive and longitudinal studies

21 Corpus search  Select one or more corpora to search  Specify key words or phrases May use the wildcard character, e.g. book*  Specify contexts Size of context window Context words and their positions  Specify metadata conditions

22 Corpus search results  Display of search results Sortable KWIC display of search results Sortable graphic display of search results  Additional statistics of selected corpora Sortable wordlist MLS, MLW, Type/Token ratio

23 N-gram search  Procedure Select one or more corpora to search Specify search word Specify contexts Specify metadata conditions  Search results Sortable list of n-grams found in selected corpora

24 Summary of features  Difference from other online tools Can create, share, and search multiple corpora Ability to work with any language  With informative metadata, one can Compare performance of different learners Track development of a learner or a group of learners over time

25 Challenges  Corpora for benchmarking  Multilingual natural language processing  Suggestions on desirable functions welcome