Download presentation
Presentation is loading. Please wait.
Published byLisa Harper Modified over 9 years ago
1
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer
2
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 2 Features of corpora Size (little/big/huge) Plasticity (finite/monitor) Metadata (none/lots) Annotation (none, …, lots) Balance
3
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 3 Features: size Relative over time Currently, micro/small/large/massive
4
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 4 Features: size Relative over time 1960's: 1M words (Brown) 1990's: 4.5M words (Penn Treebank) 2000's: 415M words (BOE) 2000's: 1000M (English Gigaword) Currently, micro/small/large/massive
5
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 5 Features Finite size established in advance sample sizes adjusted accordingly doesn't change over time Monitor allow diachronic analysis grows over time
6
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 6 Metadata (practically) none language, at least document boundaries some document attributes title body author date PMID- 6509398 DP - 1984 Nov TI - The natural history of Machado-Joseph disease. An analysis of 138 personally examined cases. PG - 510-25 AB - We have examined 138 cases of a disorder previously described in people of Portuguese origin and which has received many names. By computer analysis of 46 different items of a standardized neurological examination carried out in each patient, we have been able to delineate the main components of
7
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 7 Metadata Lots Author characteristics gender, age, mother tongue(s), dialect, educational level genre classification news scientific personal topic relevance MH - Aged MH - Azores/ethnology MH - Cerebellar Ataxia/diagnosis MH - Gene Frequency MH - Human MH - Phenotype MH - Portugal/ethnology MH - Support, Non-U.S. Gov't MH - Syndrome MH - United States MH - Variation (Genetics)
8
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 8 Balanced corpora What are you balancing? Most common: genre Authors gender age education dialect
9
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 9 Balanced corpora speechwriting unpublished published non-fiction fiction informativeinstructionalpersuasive Composition of the International Corpus of English academicpopularnews (Adapted from Meyer 2002)
10
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 10 Balanced corpora speechwriting dialogue monologue scripted unscripted talksnewsspeeches Composition of the International Corpus of English (Adapted from Meyer 2002)
11
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 11 Corpus length Overall length Sample size partial 2,000 words (Brown, LOB, ICE) 5,000 words (London-Lund) full takes up space copyright permission issues harder
12
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 12 Sample size Motivating assumption: more important to maximize number of authors/genres than length of text from each
13
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 13 By purpose Linguistic-y lexicon vs. other NLP General purpose information retrieval information extraction
14
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 14 By purpose Linguistic-y lexicon vs. other NLP General purpose information retrieval information extraction Foreign language instruction Native L2 "Learner" L2
15
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 15 Is there a corpus…
16
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 16 Is there a corpus… http://www.ldc.upenn.edu/
17
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 17 Annotation None/some/lots
18
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 18 Annotation None "collection" Some POS lemmas lemma(be) = {be, am, is, are, were, being, been}
19
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 19 Annotation Lots syntax (treebank, "bracketing") semantics predicate/argument structure ontological Dogs make me happy.
20
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 20 Diachronic Historical (OE, ME, …) Later sampling of earlier balanced corpus Monitor
21
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 21 Spoken Phonetically motivated (elicited) Other ("natural")
22
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 22 Multilingual Parallel L1 contents == L2 contents Parliamentary proceedings in English & French Shakespeare in English and German Translation/comparable two L1's; genre == genre E.g., weather reports
23
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 23 Penn Treebank treebank: corpus of syntactically- annotated data first release: 4.5 million words, 3 years' work currently 4.9 M
24
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 24 Penn Treebank
25
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 25 Penn Treebank POS-tagged Switchboard data Dysfluency-annotated Switchboard data Syntactically-annotated Switchboard data http://www.cis.upenn.edu/~treebank/switch-samp-pos.html http://www.cis.upenn.edu/~treebank/switch-samp-dfl.html http://www.cis.upenn.edu/~treebank/switch-samp-bkt.html
26
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 26 GENIA 2000 abstracts red blood cell transcription factors POS-tagged (HW2, #16) semantic annotation with molecular biology ontology
27
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 27 Corpora/resources Dictionaries, ontologies,... CELEX WordNet
28
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 28 Corpora/resources Dictionaries, ontologies,... "discovery procedure" phonology contrasts phonotactics morphology term formation inflectional
29
LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 29 McEnery & Wilson's definition of "corpus" sampled & representative finite size machine-readable "standard reference" ???
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.