CS224N Section 3: Project,Corpora Shrey Gupta January 28, 2011 (Thanks to Bill MacCartney, Helen Kwong and Pi-Chuan Chang for these materials)
Agenda Go through administrative details regarding the final project Presentations by research groups Resources for final project
Final Project Proposal due in 2 weeks – Wed. 2/9 Other details Please read the final project guide Projects from previous years: http://nlp.stanford.edu/courses/cs224n/ Proposal - Intended as a sanity check and to make sure that the topic is relevant to the course. 34% of your grade Team size: 1-3 member(s) Reports and code due on 3/9(late days allowed) Project presentations on 3/17
Project Ideas Topics from Syllabus Ideas listed in the project guide Papers from NLP conferences - http://www.cs.rochester.edu/~tetreaul/conferences.html Collaboration with research groups at Stanford Something you are really interested in !
Presentations by research groups
Topics Relation Extraction in the Knowledge Base Population (KBP) context BioNLP Event Extraction Predicting U.S. Elections with Twitter Litigation Analysis - Outcome Prediction, Field Classification, Attorney Recommendation, Entity Resolution Document classification to identify outbreak-related web content
Resources
Corpora Corpora@Stanford LDC (Linguistic Data Consortium) http://www.stanford.edu/dept/linguistics/corpora/ Some are on AFS (/afs/ir/data/linguistic-data/); some are available on DVD/CDs in the linguistic department LDC (Linguistic Data Consortium) http://www.ldc.upenn.edu/Catalog/ Links to many resources http://nlp.stanford.edu/links/statnlp.html
Treebanks Most widely used: Penn Treebank There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3 Contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus Parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ BLLIP: like PTB, WSJ text, but 30m words, parsed automatically by Charniak Switchboard: telephone conversations PTB WSJ contains sections 0 through 24, ~2400 sentences each, but section 24 is half size. ~50,000 sentences total. Convention in parsing world: sections 2-21: training (39,832 sentences) section 0 or 22 or 24: development testing section 23: final test data Sections 0 and 1 perceived to be less reliable -- annotators warming up. PTB3 adds some new stuff vs. PTB2, but NO BUG FIXES.
Parsed corpora in other languages Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences German (newspaper text): NEGRA http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ TIGER http://www.ims.uni-stuttgart.de/projekte/TIGER/ Tueba-D/Z http://www.sfs.uni-tuebingen.de/en/tuebadz.shtml
Part-of-speech tagged corpora POS tags from treebanks British National Corpus (BNC) 100m words wide sample of British English: newspapers, books, letters http://www.natcorp.ox.ac.uk/
Named Entity Recognition (NER) Message Understanding Conference (MUC) We have MUC-6 and MUC-7 Example: /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys.980205 CoNLL shared tasks: Language-Independent Named Entity Recognition (I), (II) 2002: http://www.cnts.ua.ac.be/conll2002/ner/ 2003: http://www.cnts.ua.ac.be/conll2003/ner/
Anaphora resolution Data: MUC-6 and MUC-7 Example: Pam went home because she felt sick Demo: http://lingpipe-demos.com:8080/lingpipe-demos/coref_en_news_muc6/textInput.html Unsolved problem Harder example: We gave the bananas to the monkeys because they were hungry We gave the bananas to the monkeys because they were ripe.
Semantics WordNet Website: http://wordnet.princeton.edu/ Browse online: http://wordnetweb.princeton.edu/perl/webwn 150,000 nouns, verbs, adjectives, adverbs Groups words into “synsets” with short, general definitions, and records various relations between synsets, e.g. hypernym (kind-of) hierarchy. Neat visual interface: http://www.visualthesaurus.com/?vt Problems with WordNet: fine-grained senses sense ordering sometimes funny (see "airline")
Semantic Role Labeling Detection of semantic arguments associated with each verb in a sentence Example: “I [agent] sold you [patient] a book [theme]” CoNLL shared task 2004, 2005 http://www.lsi.upc.es/~srlconll/ PropBank Adds predicate-argument relations to PTB syntax trees FrameNet: http://framenet.icsi.berkeley.edu/ Demo from UIUC: http://l2r.cs.uiuc.edu/~cogcomp/srl-demo.php
More corpora for specific tasks Word Sense Disambiguation (WSD) Senseval: http://www.senseval.org/ Question Answering e.g. "What film introduced Jar Jar Binks?" TREC competition, Question Answering track http://trec.nist.gov/data/qamain.html Textual Entailment Recognizing Textual Entailment (RTE) challenges http://pascallin.ecs.soton.ac.uk/Challenges/RTE/ Events, temporal relations TimeBank corpus: http://timeml.org/site/timebank/browser_1.2/
More corpora for specific tasks Topic Detection and Tracking Given documents, separate into different topics http://projects.ldc.upenn.edu/TDT/
Speech & Dialogue Speech Dialogue BNC: 10m words Switchboard corpus Conversations of two speakers recorded over the phone Transcriptions of their speech, with speakers labeled Example: http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.readme.html#txt
Email/Spam Enron corpus TREC Spam track /afs/ir/data/linguistic-data/Enron-Email-Corpus/maildir/skilling-j/ Annotated subsets(for NER): http://www.cs.cmu.edu/~einat/datasets.html TREC Spam track http://trec.nist.gov/data/spam.html
Tools Many links to tools on the StatNLP page Parsers POS taggers http://nlp.stanford.edu/links/statnlp.html Parsers Stanford Parser (English, Chinese, German and Arabic) http://nlp.stanford.edu/software/lex-parser.shtml Online parser: http://josie.stanford.edu:8080/parser/ Collin’s parser, Charniak’s parser, MiniPar, etc. http://nlp.stanford.edu/fsnlp/probparse/ POS taggers Named entity recognizers Language modeling toolkits
Machine learning tools Stanford classifier conditional loglinear (aka maximum entropy) model http://nlp.stanford.edu/software/classifier.shtml Weka Java library containing (nearly) every machine learning algorithm -Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc. http://www.cs.waikato.ac.nz/ml/weka/ Mallet Java; useful for statistical NLP, document classification, clustering, topic modeling, information extraction… http://mallet.cs.umass.edu/
Thank You ! Any questions ?