CS224N Section 3: Project,Corpora

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
Semantic Role Labeling Abdul-Lateef Yussiff
Methods in Computational Linguistics II Queens College Lecture 1: Introduction.
LING 581: Advanced Computational Linguistics Lecture Notes May 5th.
Introduction to treebanks Session 1: 7/08/
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
2 ND GRADE WRITING J anuary 30, 2014 Jessica Rentas
ELN – Natural Language Processing Giuseppe Attardi
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
CS 6961: Structured Prediction Fall 2014 Course Information.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Natural language processing tools Lê Đức Trọng 1.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Tools for Linguistic Analysis. Overview of Linguistic Tools  Dictionaries  Linguistic Inquiry and Word Count (LIWC) Linguistic Inquiry and Word Count.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Computational Linguistics Courses Experiment Test.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
SENSEVAL: Evaluating WSD Systems
Sentiment analysis algorithms and applications: A survey
An overview of the Natural Language Toolkit
Google SyntaxNet “Parsey McParseface and other SyntaxNet models are some of the most complex networks that we have trained with the TensorFlow framework.
Tools for Natural Language Processing Applications
Parsing in Multiple Languages
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Natural Language Processing (NLP)
张昊.
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Text Analytics Giuseppe Attardi Università di Pisa
Social Knowledge Mining
Machine Learning in Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
Stanford CoreNLP
WordNet WordNet, WSD.
Automatic Detection of Causal Relations for Question Answering
How to publish in a format that enhances literature-based discovery?
Computational Linguistics: New Vistas
Statistical n-gram David ling.
Text Mining & Natural Language Processing
Using Uneven Margins SVM and Perceptron for IE
PURE Learning Plan Richard Lee, James Chen,.
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
CS565: Intelligent Systems and Interfaces
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

CS224N Section 3: Project,Corpora Shrey Gupta January 28, 2011 (Thanks to Bill MacCartney, Helen Kwong and Pi-Chuan Chang for these materials)

Agenda Go through administrative details regarding the final project Presentations by research groups Resources for final project

Final Project Proposal due in 2 weeks – Wed. 2/9 Other details Please read the final project guide Projects from previous years: http://nlp.stanford.edu/courses/cs224n/ Proposal - Intended as a sanity check and to make sure that the topic is relevant to the course. 34% of your grade Team size: 1-3 member(s) Reports and code due on 3/9(late days allowed) Project presentations on 3/17

Project Ideas Topics from Syllabus Ideas listed in the project guide Papers from NLP conferences - http://www.cs.rochester.edu/~tetreaul/conferences.html Collaboration with research groups at Stanford Something you are really interested in !

Presentations by research groups

Topics Relation Extraction in the Knowledge Base Population (KBP) context BioNLP Event Extraction Predicting U.S. Elections with Twitter Litigation Analysis - Outcome Prediction, Field Classification, Attorney Recommendation, Entity Resolution Document classification to identify outbreak-related web content

Resources

Corpora Corpora@Stanford LDC (Linguistic Data Consortium) http://www.stanford.edu/dept/linguistics/corpora/ Some are on AFS (/afs/ir/data/linguistic-data/); some are available on DVD/CDs in the linguistic department LDC (Linguistic Data Consortium) http://www.ldc.upenn.edu/Catalog/ Links to many resources http://nlp.stanford.edu/links/statnlp.html

Treebanks Most widely used: Penn Treebank There's PTB2 and PTB3. Use PTB3, i.e. Treebank-3 Contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus Parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ BLLIP: like PTB, WSJ text, but 30m words, parsed automatically by Charniak Switchboard: telephone conversations PTB WSJ contains sections 0 through 24, ~2400 sentences each, but section 24 is half size. ~50,000 sentences total. Convention in parsing world: sections 2-21: training (39,832 sentences) section 0 or 22 or 24: development testing section 23: final test data Sections 0 and 1 perceived to be less reliable -- annotators warming up. PTB3 adds some new stuff vs. PTB2, but NO BUG FIXES.

Parsed corpora in other languages Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences German (newspaper text): NEGRA http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ TIGER http://www.ims.uni-stuttgart.de/projekte/TIGER/ Tueba-D/Z http://www.sfs.uni-tuebingen.de/en/tuebadz.shtml

Part-of-speech tagged corpora POS tags from treebanks British National Corpus (BNC) 100m words wide sample of British English: newspapers, books, letters http://www.natcorp.ox.ac.uk/

Named Entity Recognition (NER) Message Understanding Conference (MUC) We have MUC-6 and MUC-7 Example: /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys.980205 CoNLL shared tasks: Language-Independent Named Entity Recognition (I), (II) 2002: http://www.cnts.ua.ac.be/conll2002/ner/ 2003: http://www.cnts.ua.ac.be/conll2003/ner/

Anaphora resolution Data: MUC-6 and MUC-7 Example: Pam went home because she felt sick Demo: http://lingpipe-demos.com:8080/lingpipe-demos/coref_en_news_muc6/textInput.html Unsolved problem Harder example: We gave the bananas to the monkeys because they were hungry We gave the bananas to the monkeys because they were ripe.

Semantics WordNet Website: http://wordnet.princeton.edu/ Browse online: http://wordnetweb.princeton.edu/perl/webwn 150,000 nouns, verbs, adjectives, adverbs Groups words into “synsets” with short, general definitions, and records various relations between synsets, e.g. hypernym (kind-of) hierarchy. Neat visual interface: http://www.visualthesaurus.com/?vt Problems with WordNet: fine-grained senses sense ordering sometimes funny (see "airline")

Semantic Role Labeling Detection of semantic arguments associated with each verb in a sentence Example: “I [agent] sold you [patient] a book [theme]” CoNLL shared task 2004, 2005 http://www.lsi.upc.es/~srlconll/ PropBank Adds predicate-argument relations to PTB syntax trees FrameNet: http://framenet.icsi.berkeley.edu/ Demo from UIUC: http://l2r.cs.uiuc.edu/~cogcomp/srl-demo.php

More corpora for specific tasks Word Sense Disambiguation (WSD) Senseval: http://www.senseval.org/ Question Answering e.g. "What film introduced Jar Jar Binks?" TREC competition, Question Answering track http://trec.nist.gov/data/qamain.html Textual Entailment Recognizing Textual Entailment (RTE) challenges http://pascallin.ecs.soton.ac.uk/Challenges/RTE/ Events, temporal relations TimeBank corpus: http://timeml.org/site/timebank/browser_1.2/

More corpora for specific tasks Topic Detection and Tracking Given documents, separate into different topics http://projects.ldc.upenn.edu/TDT/

Speech & Dialogue Speech Dialogue BNC: 10m words Switchboard corpus Conversations of two speakers recorded over the phone Transcriptions of their speech, with speakers labeled Example: http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.readme.html#txt

Email/Spam Enron corpus TREC Spam track /afs/ir/data/linguistic-data/Enron-Email-Corpus/maildir/skilling-j/ Annotated subsets(for NER): http://www.cs.cmu.edu/~einat/datasets.html TREC Spam track http://trec.nist.gov/data/spam.html

Tools Many links to tools on the StatNLP page Parsers POS taggers http://nlp.stanford.edu/links/statnlp.html Parsers Stanford Parser (English, Chinese, German and Arabic) http://nlp.stanford.edu/software/lex-parser.shtml Online parser: http://josie.stanford.edu:8080/parser/ Collin’s parser, Charniak’s parser, MiniPar, etc. http://nlp.stanford.edu/fsnlp/probparse/ POS taggers Named entity recognizers Language modeling toolkits

Machine learning tools Stanford classifier conditional loglinear (aka maximum entropy) model http://nlp.stanford.edu/software/classifier.shtml Weka Java library containing (nearly) every machine learning algorithm -Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc. http://www.cs.waikato.ac.nz/ml/weka/ Mallet Java; useful for statistical NLP, document classification, clustering, topic modeling, information extraction… http://mallet.cs.umass.edu/

Thank You ! Any questions ?