CS224N Section 3: Project,Corpora

CS224N Section 3: Project,Corpora
Shrey Gupta January 28, 2011 (Thanks to Bill MacCartney, Helen Kwong and Pi-Chuan Chang for these materials)

Agenda Go through administrative details regarding the final project
Presentations by research groups Resources for final project

Final Project Proposal due in 2 weeks – Wed. 2/9 Other details
Please read the final project guide Projects from previous years: Proposal - Intended as a sanity check and to make sure that the topic is relevant to the course. 34% of your grade Team size: 1-3 member(s) Reports and code due on 3/9(late days allowed) Project presentations on 3/17

Project Ideas Topics from Syllabus Ideas listed in the project guide
Papers from NLP conferences - Collaboration with research groups at Stanford Something you are really interested in !

Presentations by research groups

Topics Relation Extraction in the Knowledge Base Population (KBP) context BioNLP Event Extraction Predicting U.S. Elections with Twitter Litigation Analysis - Outcome Prediction, Field Classification, Attorney Recommendation, Entity Resolution Document classification to identify outbreak-related web content

Resources

Corpora Corpora@Stanford LDC (Linguistic Data Consortium)
Some are on AFS (/afs/ir/data/linguistic-data/); some are available on DVD/CDs in the linguistic department LDC (Linguistic Data Consortium) Links to many resources

Treebanks Most widely used: Penn Treebank
There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3 Contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus Parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ BLLIP: like PTB, WSJ text, but 30m words, parsed automatically by Charniak Switchboard: telephone conversations PTB WSJ contains sections 0 through 24, ~2400 sentences each, but section 24 is half size. ~50,000 sentences total. Convention in parsing world: sections 2-21: training (39,832 sentences) section 0 or 22 or 24: development testing section 23: final test data Sections 0 and 1 perceived to be less reliable -- annotators warming up. PTB3 adds some new stuff vs. PTB2, but NO BUG FIXES.

Parsed corpora in other languages
Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences German (newspaper text): NEGRA TIGER Tueba-D/Z

Part-of-speech tagged corpora
POS tags from treebanks British National Corpus (BNC) 100m words wide sample of British English: newspapers, books, letters

Named Entity Recognition (NER)
Message Understanding Conference (MUC) We have MUC-6 and MUC-7 Example: /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys CoNLL shared tasks: Language-Independent Named Entity Recognition (I), (II) 2002: 2003:

Anaphora resolution Data: MUC-6 and MUC-7
Example: Pam went home because she felt sick Demo: Unsolved problem Harder example: We gave the bananas to the monkeys because they were hungry We gave the bananas to the monkeys because they were ripe.

Semantics WordNet Website: http://wordnet.princeton.edu/
Browse online: 150,000 nouns, verbs, adjectives, adverbs Groups words into “synsets” with short, general definitions, and records various relations between synsets, e.g. hypernym (kind-of) hierarchy. Neat visual interface: Problems with WordNet: fine-grained senses sense ordering sometimes funny (see "airline")

Semantic Role Labeling
Detection of semantic arguments associated with each verb in a sentence Example: “I [agent] sold you [patient] a book [theme]” CoNLL shared task 2004, 2005 PropBank Adds predicate-argument relations to PTB syntax trees FrameNet: Demo from UIUC:

More corpora for specific tasks
Word Sense Disambiguation (WSD) Senseval: Question Answering e.g. "What film introduced Jar Jar Binks?" TREC competition, Question Answering track Textual Entailment Recognizing Textual Entailment (RTE) challenges Events, temporal relations TimeBank corpus:

More corpora for specific tasks
Topic Detection and Tracking Given documents, separate into different topics

Speech & Dialogue Speech Dialogue BNC: 10m words Switchboard corpus
Conversations of two speakers recorded over the phone Transcriptions of their speech, with speakers labeled Example:

Email/Spam Enron corpus TREC Spam track
/afs/ir/data/linguistic-data/Enron- -Corpus/maildir/skilling-j/ Annotated subsets(for NER): TREC Spam track

Tools Many links to tools on the StatNLP page Parsers POS taggers
Parsers Stanford Parser (English, Chinese, German and Arabic) Online parser: Collin’s parser, Charniak’s parser, MiniPar, etc. POS taggers Named entity recognizers Language modeling toolkits

Machine learning tools
Stanford classifier conditional loglinear (aka maximum entropy) model Weka Java library containing (nearly) every machine learning algorithm -Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc. Mallet Java; useful for statistical NLP, document classification, clustering, topic modeling, information extraction…

Thank You ! Any questions ?

CS224N Section 3: Project,Corpora

Similar presentations

Presentation on theme: "CS224N Section 3: Project,Corpora"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS224N Section 3: Project,Corpora

Similar presentations

Presentation on theme: "CS224N Section 3: Project,Corpora"— Presentation transcript:

Similar presentations

About project

Feedback