CS224N Section 3: Corpora, etc.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
Semantic Role Labeling Abdul-Lateef Yussiff
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Methods in Computational Linguistics II Queens College Lecture 1: Introduction.
LING 581: Advanced Computational Linguistics Lecture Notes May 5th.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
MAchine Learning for LanguagE Toolkit
ELN – Natural Language Processing Giuseppe Attardi
Automatic Extraction of Opinion Propositions and their Holders Steven Bethard, Hong Yu, Ashley Thornton, Vasileios Hatzivassiloglou and Dan Jurafsky Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
MinorThird 서울시립대학교 인공지능연구실 곽별샘
MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
Natural language processing tools Lê Đức Trọng 1.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Tools for Linguistic Analysis. Overview of Linguistic Tools  Dictionaries  Linguistic Inquiry and Word Count (LIWC) Linguistic Inquiry and Word Count.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
SALSA-WS 09/05 Approximating Textual Entailment with LFG and FrameNet Frames Aljoscha Burchardt, Anette Frank Computational Linguistics Department Saarland.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
COSC 6336: Natural Language Processing
Measuring Monolinguality
Coarse-grained Word Sense Disambiguation
SENSEVAL: Evaluating WSD Systems
An overview of the Natural Language Toolkit
Google SyntaxNet “Parsey McParseface and other SyntaxNet models are some of the most complex networks that we have trained with the TensorFlow framework.
Tools for Natural Language Processing Applications
Natural Language Processing (NLP)
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Text Analytics Giuseppe Attardi Università di Pisa
Social Knowledge Mining
Machine Learning in Natural Language Processing
LING 388: Computers and Language
LING/C SC 581: Advanced Computational Linguistics
WordNet: A Lexical Database for English
Stanford CoreNLP
WordNet WordNet, WSD.
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
Automatic Detection of Causal Relations for Question Answering
Computational Linguistics: New Vistas
Text Mining & Natural Language Processing
Using Uneven Margins SVM and Perceptron for IE
Natural Language Processing (NLP)
CS565: Intelligent Systems and Interfaces
CS224N Section 3: Project,Corpora
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

CS224N Section 3: Corpora, etc. Helen Kwong April 24, 2009 (Thanks to Bill MacCartney and Pi-Chuan Chang for these materials)

Final Project Proposal due in 2 weeks - Wed. 5/6 Project ideas Final project guide Go through the syllabus Projects from previous years: http://nlp.stanford.edu/courses/cs224n/

Corpora Corpora@Stanford LDC (Linguistic Data Consortium) http://www.stanford.edu/dept/linguistics/corpora/ Some are on AFS (/afs/ir/data/linguistic-data/); some are available on DVD/CDs in the linguistic department LDC (Linguistic Data Consortium) http://www.ldc.upenn.edu/Catalog/ Links to many resources http://nlp.stanford.edu/links/statnlp.html Previous years’ notes: http://www.stanford.edu/class/cs224n/handouts/cs224n-section3-corpora.txt http://www.stanford.edu/class/cs224n/handouts/section3-2008.pdf

Treebanks Most widely used: Penn Treebank There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3 Contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus Parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ See Bill’s notes for more details BLLIP: like PTB, WSJ text, but 30m words, parsed automatically by Charniak Switchboard: telephone conversations

Parsed corpora in other languages Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences German (newspaper text): NEGRA http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ TIGER http://www.ims.uni-stuttgart.de/projekte/TIGER/ Tueba-D/Z http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml

Part-of-speech tagged corpora POS tags from treebanks British National Corpus (BNC) 100m words wide sample of British English: newspapers, books, letters http://www.natcorp.ox.ac.uk/

Named Entity Recognition (NER) Message Understanding Conference (MUC) We have MUC-6 and MUC-7 Example: /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys.980205 CoNLL shared tasks: Language-Independent Named Entity Recognition (I), (II) 2002: http://www.cnts.ua.ac.be/conll2002/ner/ 2003: http://www.cnts.ua.ac.be/conll2003/ner/

Anaphora resolution Data: MUC-6 and MUC-7 Example: Pam went home because she felt sick Demo: http://lingpipe-demos.com:8080/lingpipe-demos/coref_en_news_muc6/textInput.html Unsolved problem Harder example: We gave the bananas to the monkeys because they were hungry We gave the bananas to the monkeys because they were ripe.

Semantics WordNet Website: http://wordnet.princeton.edu/ Browse online: http://wordnetweb.princeton.edu/perl/webwn 150,000 nouns, verbs, adjectives, adverbs Groups words into “synsets” with short, general definitions, and records various relations between synsets, e.g. hypernym (kind-of) hierarchy. Good tutorial: http://www.brians.org/Projects/Technology/Papers/Wordnet/ Neat visual interface: http://www.visualthesaurus.com/?vt Problems with WordNet: fine-grained senses sense ordering sometimes funny (see "airline")

Semantic Role Labeling Detection of semantic arguments associated with each verb in a sentence Example: “I [agent] sold you [patient] a book [theme]” CoNLL shared task 2004, 2005 http://www.lsi.upc.es/~srlconll/ PropBank Adds predicate-argument relations to PTB syntax trees FrameNet: http://framenet.icsi.berkeley.edu/ Demo from UIUC: http://l2r.cs.uiuc.edu/~cogcomp/srl-demo.php

More corpora for specific tasks Word Sense Disambiguation (WSD) Senseval: http://www.senseval.org/ Question Answering e.g. "What film introduced Jar Jar Binks?" TREC competition, Question Answering track http://trec.nist.gov/data/qamain.html Textual Entailment Recognizing Textual Entailment (RTE) challenges http://www.nist.gov/tac/tracks/2008/rte/ Events, temporal relations TimeBank corpus: http://corpora.dutchboy.net/timebank/

More corpora for specific tasks Topic Detection and Tracking Given documents, separate into different topics http://projects.ldc.upenn.edu/TDT/

Speech & Dialogue Speech Dialogue BNC: 10m words Switchboard corpus Conversations of two speakers recorded over the phone Transcriptions of their speech, with speakers labeled Example: http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.readme.html#txt

Email/Spam Enron corpus TREC Spam track /afs/ir/data/linguistic-data/Enron-Email-Corpus/maildir/skilling-j/ Annotated subsets: http://www.cs.cmu.edu/~einat/datasets.html TREC Spam track http://trec.nist.gov/data/spam.html

Tools Many links to tools on the StatNLP page Parsers POS taggers http://nlp.stanford.edu/links/statnlp.html Parsers Stanford Parser (English, Chinese, German and Arabic) http://nlp.stanford.edu/software/lex-parser.shtml Online parser: http://josie.stanford.edu:8080/parser/ Collin’s parser, Charniak’s parser, MiniPar, etc. http://nlp.stanford.edu/fsnlp/probparse/ POS taggers Named entity recognizers Language modeling toolkits

Machine learning tools Stanford classifier conditional loglinear (aka maximum entropy) model http://nlp.stanford.edu/software/classifier.shtml Weka Java library containing (nearly) every machine learning algorithm -Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc. http://www.cs.waikato.ac.nz/ml/weka/ Mallet Java; useful for statistical NLP, document classification, clustering, topic modeling, information extraction… http://mallet.cs.umass.edu/