Download presentation
Presentation is loading. Please wait.
Published byAntony Payne Modified over 9 years ago
1
Research methods in corpus linguistics Xiaofei Lu
2
2 Overview What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics
3
3 What is a corpus? Leech (1992): an unexciting phenomenon, a helluva lot of text, stored on a computer Francis (1982): a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis Sinclair (1991): a collection of naturally-occurring language text, chosen to characterise a state or a variety of language
4
4 Types of corpora General-purpose monolingual corpora The British National Corpus Specialized corpora Lancaster Corpus of Academic Written English Learner corpora International Corpus of Learner English Parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer Corpora and varieties International Corpus of English Synchronic and diachronic corpora
5
5 Corpus design Purpose Comparability Type Content: mode, interaction, domain, medium Structure: proportions Size Sampling? Design of the BNC Design of the BNC
6
6 Where to obtain corpora Linguistic data consortium Linguistic data consortium Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists Ask on the corpora listthe corpora list Compile your own corpora Design your corpus Getting permission File format, metadata, and data markupdata markup Text capture Scanning, typing, electronic files, web crawlers, e.g., WebSPHINXWebSPHINX Transcription tools, e.g., TranscriberTranscriber A Guide to Good Practice
7
7 Corpus annotation Why annotate Levels of corpus annotation Difficulties for corpus annotation Tools for corpus annotation
8
8 Why annotate For linguistic research Allow more effective corpus searches For natural language processing Spelling and grammar checking Text summarization Machine translation Question answering
9
9 Levels of corpus annotation Sentence segmentation Word segmentation/tokenization Part-of-speech (POS) tagging Chunking/shallow parsing Syntactic parsing Semantic annotation Pragmatic annotation Parallel corpora: sentence alignment Learner corpora: error annotation
10
10 Difficulties for corpus annotation Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD Unknown words Identification POS tagging Semantic annotation
11
11 Tools for corpus annotation Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists Corpora and Corpus Annotation Tools on the WWW Corpora and Corpus Annotation Tools on the WWW POS tagger demonstration POS tagger demonstration Sentence segmentation POS tagging Extracting NPs of the form DT NN NN Dexter: Tools for analyzing language data Dexter: Tools for analyzing language data
12
12 Corpus analysis Levels of corpus analysis Tools for corpus analysis Interpreting corpus data
13
13 Levels of corpus analysis Word frequency lists Concordances Collocation (lexical patterning) Colligation (syntactic patterning) Keyword lists
14
14 Tools for corpus analysis Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists Recommendations: WordSmith Tools (not free) WordSmith Tools AntConc (free) AntConc TextStat (free) TextStat Unix tools Write your own scripts
15
15 Exercise (part 1) Download and install AntConcAntConc Download some text for processing Project Gutenberg Generate a word frequency list for your mini-corpus
16
16 Interpreting corpus data Are frequency differences statistically significant? w appears x times in an n-word corpus, and y times in an m-word corpus Chi-square test (doesn’t work well for small numbers) Chi-square test Fisher’s Exact Test (doesn’t work for a cross table larger than 2×2) Fisher’s Exact Test
17
17 Exercise (part 2) Compare your word frequency list with that of BNCthat of BNC Anything interesting? Run the chi-square test and Fisher’s Exact test on some interesting words
18
18 Interpreting corpus data (cont.) Collocational analysis: How strongly are x and y associated Mutual information Measures difference between observed and expected frequencies of (X,Y) Higher MI, stronger association Doesn’t work well for low frequencies T-test Measures confidence with which to claim strong association between X and Y Higher t-score, higher association Online calculations Online calculations
19
19 Exercise (part 3) Generate a concordance for a target word Find a word that co-occurs frequently with the target word Test if the word is strongly associated with the target word
20
20 Note on research project design Purpose of project Corpus compilation and annotation Corpus analysis Bottom-up: from observations of recurring patterns to hypothesis and generalizations Top-down: start with given categories and search for evidence of use and variance Caution on generalizability
21
21 Future courses on corpus linguistics Spring 2007 APLING 597E: Introduction to Corpus Linguistics Hands-on course on principles and tools for corpus compilation, annotation, processing, and analysis Spring 2008 APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious research projects
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.