Download presentation
Presentation is loading. Please wait.
1
CS224N Section 3: Corpora, etc.
Helen Kwong April 24, 2009 (Thanks to Bill MacCartney and Pi-Chuan Chang for these materials)
2
Final Project Proposal due in 2 weeks - Wed. 5/6 Project ideas
Final project guide Go through the syllabus Projects from previous years:
3
Corpora Corpora@Stanford LDC (Linguistic Data Consortium)
Some are on AFS (/afs/ir/data/linguistic-data/); some are available on DVD/CDs in the linguistic department LDC (Linguistic Data Consortium) Links to many resources Previous years’ notes:
4
Treebanks Most widely used: Penn Treebank
There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3 Contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus Parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ See Bill’s notes for more details BLLIP: like PTB, WSJ text, but 30m words, parsed automatically by Charniak Switchboard: telephone conversations
5
Parsed corpora in other languages
Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences German (newspaper text): NEGRA TIGER Tueba-D/Z
6
Part-of-speech tagged corpora
POS tags from treebanks British National Corpus (BNC) 100m words wide sample of British English: newspapers, books, letters
7
Named Entity Recognition (NER)
Message Understanding Conference (MUC) We have MUC-6 and MUC-7 Example: /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys CoNLL shared tasks: Language-Independent Named Entity Recognition (I), (II) 2002: 2003:
8
Anaphora resolution Data: MUC-6 and MUC-7
Example: Pam went home because she felt sick Demo: Unsolved problem Harder example: We gave the bananas to the monkeys because they were hungry We gave the bananas to the monkeys because they were ripe.
9
Semantics WordNet Website: http://wordnet.princeton.edu/
Browse online: 150,000 nouns, verbs, adjectives, adverbs Groups words into “synsets” with short, general definitions, and records various relations between synsets, e.g. hypernym (kind-of) hierarchy. Good tutorial: Neat visual interface: Problems with WordNet: fine-grained senses sense ordering sometimes funny (see "airline")
10
Semantic Role Labeling
Detection of semantic arguments associated with each verb in a sentence Example: “I [agent] sold you [patient] a book [theme]” CoNLL shared task 2004, 2005 PropBank Adds predicate-argument relations to PTB syntax trees FrameNet: Demo from UIUC:
11
More corpora for specific tasks
Word Sense Disambiguation (WSD) Senseval: Question Answering e.g. "What film introduced Jar Jar Binks?" TREC competition, Question Answering track Textual Entailment Recognizing Textual Entailment (RTE) challenges Events, temporal relations TimeBank corpus:
12
More corpora for specific tasks
Topic Detection and Tracking Given documents, separate into different topics
13
Speech & Dialogue Speech Dialogue BNC: 10m words Switchboard corpus
Conversations of two speakers recorded over the phone Transcriptions of their speech, with speakers labeled Example:
14
Email/Spam Enron corpus TREC Spam track
/afs/ir/data/linguistic-data/Enron- -Corpus/maildir/skilling-j/ Annotated subsets: TREC Spam track
15
Tools Many links to tools on the StatNLP page Parsers POS taggers
Parsers Stanford Parser (English, Chinese, German and Arabic) Online parser: Collin’s parser, Charniak’s parser, MiniPar, etc. POS taggers Named entity recognizers Language modeling toolkits
16
Machine learning tools
Stanford classifier conditional loglinear (aka maximum entropy) model Weka Java library containing (nearly) every machine learning algorithm -Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc. Mallet Java; useful for statistical NLP, document classification, clustering, topic modeling, information extraction…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.