Presentation is loading. Please wait.

Presentation is loading. Please wait.

Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.

Similar presentations

Presentation on theme: "Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009."— Presentation transcript:

1 Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009

2 2 Overview  What is a corpus  Corpus design and compilation  Corpus annotation  Corpus querying and analysis  Resources  GOLD

3 3 What is a corpus?  Leech (1992): an unexciting phenomenon, a helluva lot of text, stored on a computer  Sinclair (1991): a collection of naturally-occurring language text, chosen to characterise a state or a variety of language  Sinclair (2004): a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

4 4 Types of corpora  General-purpose vs. specialized corpora The British National Corpus Michigan Corpus of Academic Spoken English  Native vs. learner corpora International Corpus of Learner English  Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer  Corpora representing one or diverse language varieties International Corpus of English  Synchronic vs. diachronic corpora  Spoken vs. written corpora

5 5 Corpus design  Purpose/orientation, type  External criteria for content selection Communicative function of a text Mode, medium, interaction, domain, topic  Sampling, size  Representativeness, balance, homogeneity  Design of the BNC Design of the BNC

6 6 Corpus annotation  Why annotate  Levels of corpus annotation  Difficulties for corpus annotation  Standards and encoding

7 7 Why annotate  For linguistic research Allow more effective corpus searches  For natural language processing Spelling and grammar checking Machine translation

8 8 Levels of corpus annotation  Sentence and word segmentation  Lemmatization and part-of-speech (POS) tagging  Chunking and syntactic parsing  Semantic, pragmatic, discourse, and stylistic tagging  Learner corpora: error annotation  Project-specific annotation

9 9 Difficulties for corpus annotation  Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD  Unknown words Identification POS tagging Semantic annotation  Precision, recall, inter-annotator agreement

10 10 Standards and encoding  Useful standards Separable Documentation Linguistically consensual Compatibility with existing standards  Encoding Simple encoding: present_JJ XML-style: present

11 11 Corpus querying and analysis  Using windows- or web-based software Good for processing raw corpora Word frequency, concordances, lexical bundles, and keyword lists Examples: AntConc and GOLDAntConcGOLD  Using natural language processing tools Good for processing annotated corpora Extracting occurrences of grammatical patterns Examples: Stanford parser and TregexStanford parser and Tregex

12 12 Interpreting corpus data  St atistical analysis examples  Are frequency differences statistically significant? w appears x times in an n-word corpus, and y times in an m-word corpus Chi-square test and Fisher’s Ex act Test  Collocation analysis How strongly are x and y associated Mutual information and t-test

13 13 Resources  Books Hunston (2002): Corpora in Applied Linguistics McEnery (2006): Corpus-Based Language Studies  Journals International Journal of Corpus Linguistics Corpus Linguistics and Linguistic Theory Corpora  Websites and mailing lists Bookmarks for corpus-based linguists Linguistic data consortium The corpora list

14 14 Resources  Corpus annotation and analysis tools Stanford Natural Language Processing Group  Places for exploration MICASE BNC Online

15 15 Note on research project design  Purpose of project  Corpus compilation and annotation  Corpus analysis Bottom-up: from observations of recurring patterns to hypothesis and generalizations Top-down: start with given categories and search for evidence of use and variance  Caution on generalizability

16 16 GOLD: Graphic Online Language Diagnostic  One of 10 projects in CALPER  Co-directors: Michael McCarthy & Xiaofei Lu  This is work in progress (2006-2010)

17 17 Overview of functions  An online tool for users toonline tool Build, upload, and update their own corpora Share corpora with each other Search corpora

18 18 Corpus compilation  A user can compile a corpus by Directly creating and uploading an XML filean XML file Using the guided XML creation interface  An uploaded corpus can be easily updated Documents can be added or deleted The whole corpus can be deleted

19 19 Corpus sharing  GOLD facilitates easy data sharing  A corpus may be set to be Private, shared, or public  Corpus owner may give others right to View, add, edit, or delete corpora

20 20 Metadata information  A corpus should contain informative metadata Information about the learner Information about the sample  Facilitates contrastive and longitudinal studies

21 21 Corpus search  Select one or more corpora to search  Specify key words or phrases May use the wildcard character, e.g. book*  Specify contexts Size of context window Context words and their positions  Specify metadata conditions

22 22 Corpus search results  Display of search results Sortable KWIC display of search results Sortable graphic display of search results  Additional statistics of selected corpora Sortable wordlist MLS, MLW, Type/Token ratio

23 23 N-gram search  Procedure Select one or more corpora to search Specify search word Specify contexts Specify metadata conditions  Search results Sortable list of n-grams found in selected corpora

24 24 Summary of features  Difference from other online tools Can create, share, and search multiple corpora Ability to work with any language  With informative metadata, one can Compare performance of different learners Track development of a learner or a group of learners over time

25 25 Challenges  Corpora for benchmarking  Multilingual natural language processing  Suggestions on desirable functions welcome

Download ppt "Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009."

Similar presentations

Ads by Google