Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational and Statistical Methods for Corpus Analysis: Overview

Similar presentations


Presentation on theme: "Computational and Statistical Methods for Corpus Analysis: Overview"— Presentation transcript:

1 Computational and Statistical Methods for Corpus Analysis: Overview
Xiaofei Lu Summer Institute of Applied Linguistics July 6, 2009

2 Overview What is a corpus Corpus design and compilation
Corpus annotation Corpus querying and analysis Resources GOLD

3 What is a corpus? Leech (1992): Sinclair (1991): Sinclair (2004):
an unexciting phenomenon, a helluva lot of text, stored on a computer Sinclair (1991): a collection of naturally-occurring language text, chosen to characterise a state or a variety of language Sinclair (2004): a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

4 Types of corpora General-purpose vs. specialized corpora
The British National Corpus Michigan Corpus of Academic Spoken English Native vs. learner corpora International Corpus of Learner English Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer Corpora representing one or diverse language varieties International Corpus of English Synchronic vs. diachronic corpora Spoken vs. written corpora

5 Corpus design Purpose/orientation, type
External criteria for content selection Communicative function of a text Mode, medium, interaction, domain, topic Sampling, size Representativeness, balance, homogeneity Design of the BNC

6 Corpus annotation Why annotate Levels of corpus annotation
Difficulties for corpus annotation Standards and encoding

7 Why annotate For linguistic research For natural language processing
Allow more effective corpus searches For natural language processing Spelling and grammar checking Machine translation

8 Levels of corpus annotation
Sentence and word segmentation Lemmatization and part-of-speech (POS) tagging Chunking and syntactic parsing Semantic, pragmatic, discourse, and stylistic tagging Learner corpora: error annotation Project-specific annotation

9 Difficulties for corpus annotation
Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD Unknown words Identification POS tagging Semantic annotation Precision, recall, inter-annotator agreement

10 Standards and encoding
Useful standards Separable Documentation Linguistically consensual Compatibility with existing standards Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w> 10

11 Corpus querying and analysis
Using windows- or web-based software Good for processing raw corpora Word frequency, concordances, lexical bundles, and keyword lists Examples: AntConc and GOLD Using natural language processing tools Good for processing annotated corpora Extracting occurrences of grammatical patterns Examples: Stanford parser and Tregex

12 Interpreting corpus data
Statistical analysis examples Are frequency differences statistically significant? w appears x times in an n-word corpus, and y times in an m-word corpus Chi-square test and Fisher’s Exact Test Collocation analysis How strongly are x and y associated Mutual information and t-test

13 Resources Books Journals Websites and mailing lists
Hunston (2002): Corpora in Applied Linguistics McEnery (2006): Corpus-Based Language Studies Journals International Journal of Corpus Linguistics Corpus Linguistics and Linguistic Theory Corpora Websites and mailing lists Bookmarks for corpus-based linguists Linguistic data consortium The corpora list

14 Resources Corpus annotation and analysis tools Places for exploration
Stanford Natural Language Processing Group Places for exploration MICASE BNC Online

15 Note on research project design
Purpose of project Corpus compilation and annotation Corpus analysis Bottom-up: from observations of recurring patterns to hypothesis and generalizations Top-down: start with given categories and search for evidence of use and variance Caution on generalizability

16 GOLD: Graphic Online Language Diagnostic
One of 10 projects in CALPER Co-directors: Michael McCarthy & Xiaofei Lu This is work in progress ( ) 16

17 Overview of functions An online tool for users to
Build, upload, and update their own corpora Share corpora with each other Search corpora 17

18 Corpus compilation A user can compile a corpus by
Directly creating and uploading an XML file Using the guided XML creation interface An uploaded corpus can be easily updated Documents can be added or deleted The whole corpus can be deleted 18

19 Corpus sharing GOLD facilitates easy data sharing
A corpus may be set to be Private, shared, or public Corpus owner may give others right to View, add, edit, or delete corpora 19

20 Metadata information A corpus should contain informative metadata
Information about the learner Information about the sample Facilitates contrastive and longitudinal studies 20

21 Corpus search Select one or more corpora to search
Specify key words or phrases May use the wildcard character, e.g. book* Specify contexts Size of context window Context words and their positions Specify metadata conditions 21

22 Corpus search results Display of search results
Sortable KWIC display of search results Sortable graphic display of search results Additional statistics of selected corpora Sortable wordlist MLS, MLW, Type/Token ratio 22

23 N-gram search Procedure Search results
Select one or more corpora to search Specify search word Specify contexts Specify metadata conditions Search results Sortable list of n-grams found in selected corpora 23

24 Summary of features Difference from other online tools
Can create, share, and search multiple corpora Ability to work with any language With informative metadata, one can Compare performance of different learners Track development of a learner or a group of learners over time 24

25 Challenges Corpora for benchmarking
Multilingual natural language processing Suggestions on desirable functions welcome 25


Download ppt "Computational and Statistical Methods for Corpus Analysis: Overview"

Similar presentations


Ads by Google