1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer
LING 5200, 2006 BASED on Kevin Cohen’s LING Features of corpora Size (little/big/huge) Plasticity (finite/monitor) Metadata (none/lots) Annotation (none, …, lots) Balance
LING 5200, 2006 BASED on Kevin Cohen’s LING Features: size Relative over time Currently, micro/small/large/massive
LING 5200, 2006 BASED on Kevin Cohen’s LING Features: size Relative over time 1960's: 1M words (Brown) 1990's: 4.5M words (Penn Treebank) 2000's: 415M words (BOE) 2000's: 1000M (English Gigaword) Currently, micro/small/large/massive
LING 5200, 2006 BASED on Kevin Cohen’s LING Features Finite size established in advance sample sizes adjusted accordingly doesn't change over time Monitor allow diachronic analysis grows over time
LING 5200, 2006 BASED on Kevin Cohen’s LING Metadata (practically) none language, at least document boundaries some document attributes title body author date PMID DP Nov TI - The natural history of Machado-Joseph disease. An analysis of 138 personally examined cases. PG AB - We have examined 138 cases of a disorder previously described in people of Portuguese origin and which has received many names. By computer analysis of 46 different items of a standardized neurological examination carried out in each patient, we have been able to delineate the main components of
LING 5200, 2006 BASED on Kevin Cohen’s LING Metadata Lots Author characteristics gender, age, mother tongue(s), dialect, educational level genre classification news scientific personal topic relevance MH - Aged MH - Azores/ethnology MH - Cerebellar Ataxia/diagnosis MH - Gene Frequency MH - Human MH - Phenotype MH - Portugal/ethnology MH - Support, Non-U.S. Gov't MH - Syndrome MH - United States MH - Variation (Genetics)
LING 5200, 2006 BASED on Kevin Cohen’s LING Balanced corpora What are you balancing? Most common: genre Authors gender age education dialect
LING 5200, 2006 BASED on Kevin Cohen’s LING Balanced corpora speechwriting unpublished published non-fiction fiction informativeinstructionalpersuasive Composition of the International Corpus of English academicpopularnews (Adapted from Meyer 2002)
LING 5200, 2006 BASED on Kevin Cohen’s LING Balanced corpora speechwriting dialogue monologue scripted unscripted talksnewsspeeches Composition of the International Corpus of English (Adapted from Meyer 2002)
LING 5200, 2006 BASED on Kevin Cohen’s LING Corpus length Overall length Sample size partial 2,000 words (Brown, LOB, ICE) 5,000 words (London-Lund) full takes up space copyright permission issues harder
LING 5200, 2006 BASED on Kevin Cohen’s LING Sample size Motivating assumption: more important to maximize number of authors/genres than length of text from each
LING 5200, 2006 BASED on Kevin Cohen’s LING By purpose Linguistic-y lexicon vs. other NLP General purpose information retrieval information extraction
LING 5200, 2006 BASED on Kevin Cohen’s LING By purpose Linguistic-y lexicon vs. other NLP General purpose information retrieval information extraction Foreign language instruction Native L2 "Learner" L2
LING 5200, 2006 BASED on Kevin Cohen’s LING Is there a corpus…
LING 5200, 2006 BASED on Kevin Cohen’s LING Is there a corpus…
LING 5200, 2006 BASED on Kevin Cohen’s LING Annotation None/some/lots
LING 5200, 2006 BASED on Kevin Cohen’s LING Annotation None "collection" Some POS lemmas lemma(be) = {be, am, is, are, were, being, been}
LING 5200, 2006 BASED on Kevin Cohen’s LING Annotation Lots syntax (treebank, "bracketing") semantics predicate/argument structure ontological Dogs make me happy.
LING 5200, 2006 BASED on Kevin Cohen’s LING Diachronic Historical (OE, ME, …) Later sampling of earlier balanced corpus Monitor
LING 5200, 2006 BASED on Kevin Cohen’s LING Spoken Phonetically motivated (elicited) Other ("natural")
LING 5200, 2006 BASED on Kevin Cohen’s LING Multilingual Parallel L1 contents == L2 contents Parliamentary proceedings in English & French Shakespeare in English and German Translation/comparable two L1's; genre == genre E.g., weather reports
LING 5200, 2006 BASED on Kevin Cohen’s LING Penn Treebank treebank: corpus of syntactically- annotated data first release: 4.5 million words, 3 years' work currently 4.9 M
LING 5200, 2006 BASED on Kevin Cohen’s LING Penn Treebank
LING 5200, 2006 BASED on Kevin Cohen’s LING Penn Treebank POS-tagged Switchboard data Dysfluency-annotated Switchboard data Syntactically-annotated Switchboard data
LING 5200, 2006 BASED on Kevin Cohen’s LING GENIA 2000 abstracts red blood cell transcription factors POS-tagged (HW2, #16) semantic annotation with molecular biology ontology
LING 5200, 2006 BASED on Kevin Cohen’s LING Corpora/resources Dictionaries, ontologies,... CELEX WordNet
LING 5200, 2006 BASED on Kevin Cohen’s LING Corpora/resources Dictionaries, ontologies,... "discovery procedure" phonology contrasts phonotactics morphology term formation inflectional
LING 5200, 2006 BASED on Kevin Cohen’s LING McEnery & Wilson's definition of "corpus" sampled & representative finite size machine-readable "standard reference" ???