Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research methods in corpus linguistics Xiaofei Lu.

Similar presentations


Presentation on theme: "Research methods in corpus linguistics Xiaofei Lu."— Presentation transcript:

1 Research methods in corpus linguistics Xiaofei Lu

2 2 Overview  What is a corpus?  Types of corpora  Corpus design  Where to obtain corpora  Corpus annotation  Corpus analysis  Note on research project design  Exercises and demos in between  Future courses on corpus linguistics

3 3 What is a corpus?  Leech (1992): an unexciting phenomenon, a helluva lot of text, stored on a computer  Francis (1982): a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis  Sinclair (1991): a collection of naturally-occurring language text, chosen to characterise a state or a variety of language

4 4 Types of corpora  General-purpose monolingual corpora The British National Corpus  Specialized corpora Lancaster Corpus of Academic Written English  Learner corpora International Corpus of Learner English  Parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer  Corpora and varieties International Corpus of English  Synchronic and diachronic corpora

5 5 Corpus design  Purpose  Comparability  Type  Content: mode, interaction, domain, medium  Structure: proportions  Size  Sampling?  Design of the BNC Design of the BNC

6 6 Where to obtain corpora  Linguistic data consortium Linguistic data consortium  Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists  Ask on the corpora listthe corpora list  Compile your own corpora Design your corpus Getting permission File format, metadata, and data markupdata markup Text capture  Scanning, typing, electronic files, web crawlers, e.g., WebSPHINXWebSPHINX  Transcription tools, e.g., TranscriberTranscriber A Guide to Good Practice

7 7 Corpus annotation  Why annotate  Levels of corpus annotation  Difficulties for corpus annotation  Tools for corpus annotation

8 8 Why annotate  For linguistic research Allow more effective corpus searches  For natural language processing Spelling and grammar checking Text summarization Machine translation Question answering

9 9 Levels of corpus annotation  Sentence segmentation  Word segmentation/tokenization  Part-of-speech (POS) tagging  Chunking/shallow parsing  Syntactic parsing  Semantic annotation  Pragmatic annotation  Parallel corpora: sentence alignment  Learner corpora: error annotation

10 10 Difficulties for corpus annotation  Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD  Unknown words Identification POS tagging Semantic annotation

11 11 Tools for corpus annotation  Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists  Corpora and Corpus Annotation Tools on the WWW Corpora and Corpus Annotation Tools on the WWW  POS tagger demonstration POS tagger demonstration Sentence segmentation POS tagging Extracting NPs of the form DT NN NN  Dexter: Tools for analyzing language data Dexter: Tools for analyzing language data

12 12 Corpus analysis  Levels of corpus analysis  Tools for corpus analysis  Interpreting corpus data

13 13 Levels of corpus analysis  Word frequency lists  Concordances Collocation (lexical patterning) Colligation (syntactic patterning)  Keyword lists

14 14 Tools for corpus analysis  Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists  Recommendations: WordSmith Tools (not free) WordSmith Tools AntConc (free) AntConc TextStat (free) TextStat  Unix tools  Write your own scripts

15 15 Exercise (part 1)  Download and install AntConcAntConc  Download some text for processing Project Gutenberg  Generate a word frequency list for your mini-corpus

16 16 Interpreting corpus data  Are frequency differences statistically significant? w appears x times in an n-word corpus, and y times in an m-word corpus Chi-square test (doesn’t work well for small numbers) Chi-square test Fisher’s Exact Test (doesn’t work for a cross table larger than 2×2) Fisher’s Exact Test

17 17 Exercise (part 2)  Compare your word frequency list with that of BNCthat of BNC  Anything interesting?  Run the chi-square test and Fisher’s Exact test on some interesting words

18 18 Interpreting corpus data (cont.)  Collocational analysis: How strongly are x and y associated Mutual information  Measures difference between observed and expected frequencies of (X,Y)  Higher MI, stronger association  Doesn’t work well for low frequencies T-test  Measures confidence with which to claim strong association between X and Y  Higher t-score, higher association  Online calculations Online calculations

19 19 Exercise (part 3)  Generate a concordance for a target word  Find a word that co-occurs frequently with the target word  Test if the word is strongly associated with the target word

20 20 Note on research project design  Purpose of project  Corpus compilation and annotation  Corpus analysis Bottom-up: from observations of recurring patterns to hypothesis and generalizations Top-down: start with given categories and search for evidence of use and variance  Caution on generalizability

21 21 Future courses on corpus linguistics  Spring 2007 APLING 597E: Introduction to Corpus Linguistics Hands-on course on principles and tools for corpus compilation, annotation, processing, and analysis  Spring 2008 APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious research projects


Download ppt "Research methods in corpus linguistics Xiaofei Lu."

Similar presentations


Ads by Google