CS224N Section 3: Corpora, etc. Helen Kwong April 24, 2009 (Thanks to Bill MacCartney and Pi-Chuan Chang for these materials)
Final Project Proposal due in 2 weeks - Wed. 5/6 Project ideas Final project guide Go through the syllabus Projects from previous years: http://nlp.stanford.edu/courses/cs224n/
Corpora Corpora@Stanford LDC (Linguistic Data Consortium) http://www.stanford.edu/dept/linguistics/corpora/ Some are on AFS (/afs/ir/data/linguistic-data/); some are available on DVD/CDs in the linguistic department LDC (Linguistic Data Consortium) http://www.ldc.upenn.edu/Catalog/ Links to many resources http://nlp.stanford.edu/links/statnlp.html Previous years’ notes: http://www.stanford.edu/class/cs224n/handouts/cs224n-section3-corpora.txt http://www.stanford.edu/class/cs224n/handouts/section3-2008.pdf
Treebanks Most widely used: Penn Treebank There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3 Contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus Parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ See Bill’s notes for more details BLLIP: like PTB, WSJ text, but 30m words, parsed automatically by Charniak Switchboard: telephone conversations
Parsed corpora in other languages Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences German (newspaper text): NEGRA http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ TIGER http://www.ims.uni-stuttgart.de/projekte/TIGER/ Tueba-D/Z http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml
Part-of-speech tagged corpora POS tags from treebanks British National Corpus (BNC) 100m words wide sample of British English: newspapers, books, letters http://www.natcorp.ox.ac.uk/
Named Entity Recognition (NER) Message Understanding Conference (MUC) We have MUC-6 and MUC-7 Example: /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys.980205 CoNLL shared tasks: Language-Independent Named Entity Recognition (I), (II) 2002: http://www.cnts.ua.ac.be/conll2002/ner/ 2003: http://www.cnts.ua.ac.be/conll2003/ner/
Anaphora resolution Data: MUC-6 and MUC-7 Example: Pam went home because she felt sick Demo: http://lingpipe-demos.com:8080/lingpipe-demos/coref_en_news_muc6/textInput.html Unsolved problem Harder example: We gave the bananas to the monkeys because they were hungry We gave the bananas to the monkeys because they were ripe.
Semantics WordNet Website: http://wordnet.princeton.edu/ Browse online: http://wordnetweb.princeton.edu/perl/webwn 150,000 nouns, verbs, adjectives, adverbs Groups words into “synsets” with short, general definitions, and records various relations between synsets, e.g. hypernym (kind-of) hierarchy. Good tutorial: http://www.brians.org/Projects/Technology/Papers/Wordnet/ Neat visual interface: http://www.visualthesaurus.com/?vt Problems with WordNet: fine-grained senses sense ordering sometimes funny (see "airline")
Semantic Role Labeling Detection of semantic arguments associated with each verb in a sentence Example: “I [agent] sold you [patient] a book [theme]” CoNLL shared task 2004, 2005 http://www.lsi.upc.es/~srlconll/ PropBank Adds predicate-argument relations to PTB syntax trees FrameNet: http://framenet.icsi.berkeley.edu/ Demo from UIUC: http://l2r.cs.uiuc.edu/~cogcomp/srl-demo.php
More corpora for specific tasks Word Sense Disambiguation (WSD) Senseval: http://www.senseval.org/ Question Answering e.g. "What film introduced Jar Jar Binks?" TREC competition, Question Answering track http://trec.nist.gov/data/qamain.html Textual Entailment Recognizing Textual Entailment (RTE) challenges http://www.nist.gov/tac/tracks/2008/rte/ Events, temporal relations TimeBank corpus: http://corpora.dutchboy.net/timebank/
More corpora for specific tasks Topic Detection and Tracking Given documents, separate into different topics http://projects.ldc.upenn.edu/TDT/
Speech & Dialogue Speech Dialogue BNC: 10m words Switchboard corpus Conversations of two speakers recorded over the phone Transcriptions of their speech, with speakers labeled Example: http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.readme.html#txt
Email/Spam Enron corpus TREC Spam track /afs/ir/data/linguistic-data/Enron-Email-Corpus/maildir/skilling-j/ Annotated subsets: http://www.cs.cmu.edu/~einat/datasets.html TREC Spam track http://trec.nist.gov/data/spam.html
Tools Many links to tools on the StatNLP page Parsers POS taggers http://nlp.stanford.edu/links/statnlp.html Parsers Stanford Parser (English, Chinese, German and Arabic) http://nlp.stanford.edu/software/lex-parser.shtml Online parser: http://josie.stanford.edu:8080/parser/ Collin’s parser, Charniak’s parser, MiniPar, etc. http://nlp.stanford.edu/fsnlp/probparse/ POS taggers Named entity recognizers Language modeling toolkits
Machine learning tools Stanford classifier conditional loglinear (aka maximum entropy) model http://nlp.stanford.edu/software/classifier.shtml Weka Java library containing (nearly) every machine learning algorithm -Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc. http://www.cs.waikato.ac.nz/ml/weka/ Mallet Java; useful for statistical NLP, document classification, clustering, topic modeling, information extraction… http://mallet.cs.umass.edu/