An Introduction to Edison Vivek Srikumar 17 th April 2012.

Slides:



Advertisements
Similar presentations
An Introduction to GATE
Advertisements

University of Sheffield NLP Module 4: Machine Learning.
1 I256: Applied Natural Language Processing Marti Hearst Aug 30, 2006.
Chapter 1 Object-Oriented Concepts. A class consists of variables called fields together with functions called methods that act on those fields.
Java Review Interface, Casting, Generics, Iterator.
Cognitive Computation Group Curator Overview December 3, 2013
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Semantic Role Labeling Abdul-Lateef Yussiff
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
CS 571. Web services Web service: "a software system designed to support interoperable machine-to-machine interaction over a network“ – W3C In short,
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
CRSX plug-in development. Prerequisites Software and Libraries Eclipse RCP (3.5 or higher) –Go –Select.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
® IBM Software Group © 2006 IBM Corporation How to read/write XML using EGL This Learning Module shows how to utilize an EGL Library to read/write an XML.
Cognitive Computation Group Natural Language Processing Tutorial May 26 & 27, 2011
Cognitive Computation Group Resources for Semantic Similarity
ASM: A Bytecode Manipulation Tool – A brief Overview Course : CSE 6329 Spring 2011 University of Texas at Arlington.
ELN – Natural Language Processing Giuseppe Attardi
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Guidelines for Homework 6. Getting Started Homework 6 requires that you complete Homework 5. –All of HW5 must run on the GridFarm. –HW6 may run elsewhere.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
Ling 570 Day 17: Named Entity Recognition Chunking.
Natural and programming languages v0.2 – initial draft, Pikaro Tarmo v0.3 – updated, Pikaro Tarmo.
Chapter 8 Introduction to HTML and Applets Fundamentals of Java.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
CS61B L02 Using Objects (1)Garcia / Yelick Fall 2003 © UCB Kathy Yelick Handout for today: These lecture notes Computer Science 61B Lecture 2 – Using Objects.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
CS 11 java track: lecture 6 This week: networking basics Sockets Vectors parsing strings.
CS285 Visual Basic 2 Department of Computing UniS 1 Statements in Visual Basic A statement is the fundamental syntactical element of a program smallest.
Software Documentation Section 5.5 ALBING’s Section JIA’s Appendix B JIA’s.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Topic 1 Object Oriented Programming. 1-2 Objectives To review the concepts and terminology of object-oriented programming To discuss some features of.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
Chapter 6 Introduction to Defining Classes. Objectives: Design and implement a simple class from user requirements. Organize a program in terms of a view.
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
14th Oct 2005CERN AB Controls Development Process of Accelerator Controls Software G.Kruk L.Mestre, V.Paris, S.Oglaza, V. Baggiolini, E.Roux and Application.
MedKAT Medical Knowledge Analysis Tool December 2009.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Computer Science I Programming in Java (programming using Processing IN Java, using IntelliJ IDE) Classwork/Homework: copy your Processing projects over.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
©2001 Priority Technologies, Inc. All Rights Reserved Meteor Status Miami Face to Face Meeting January 16 – 18, 2002.
Machine Learning in GATE Valentin Tablan. 2 Machine Learning in GATE Uses classification. [Attr 1, Attr 2, Attr 3, … Attr n ]  Class Classifies annotations.
Software Deployment & Release 26/03/2015 1EN-ICE.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
Introduction to UML and Rational Rose UML - Unified Modeling Language Rational Rose 98 - a GUI tool to systematically develop software through the following.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
ODF API - ODFDOM Svante Schubert Software Engineer
Natural Language Processing (NLP)
Health Natural Language Processing Center
LING/C SC 581: Advanced Computational Linguistics
Using the Java Library API
Lecture 9: Semantic Parsing
Java External Libraries & Case Study
Extracting Recipes from Chemical Academic Papers
Getting Started With Solr
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Natural Language Processing (NLP)
Presentation transcript:

An Introduction to Edison Vivek Srikumar 17 th April 2012

Curator gives us easy access to several layers of annotation over text What can we do with these?

Outline What is Edison? Installing Edison Using Edison – Creating Edison objects – Accessing the Curator – Adding and using views

What is Edison? 1.A uniform representation of diverse NLP annotations 2.A library of NLP data structures 1.A Java client to the Curator

NLP Annotations John Smith bought the car. Part-of-speech NNP John NNP Smith VBD bought DT the NN car. Named Entities PER John Smith Shallow parse NP John Smith VP bought NP the car Semantic roles Predicate buy A0 John Smith A1 the car Parse tree S NPVP NNP VBD NP DT NN JohnSmithboughtthecar And many others….

A uniform representation Main ideas – All the annotations over text are graphs – Nodes: Labeled spans of text Spans indexed by tokens in the text – Edges: Relations between the nodes Edison terminology – TextAnnotation: A container of tokens and views – View: A graph that denotes a specific annotation – Constituent: A labeled span of text (nodes) – Relation: A labeled directed edge between Constituents

A uniform representation TextAnnotation Raw text: John Smith bought the car. Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Views Name: SENTENCEConstituents: {…} Relations: {…} Name: POSConstituents: {…} Relations: {…} Name: PARSE_CHARNIAKConstituents: {…} Relations: {…} and other views….

Getting started with Edison Download the jar from – Click the download link and follow instructions – Add the edison jar and its dependencies to your class path Dependencies – Cogcomp core utilities – Apache commons libraries – Thrift (to communicate with the Curator) – Porter stemmer – LBJ Library – Java WordNet interface Javadoc available under “User Guide”

Edison using Maven Add the following repository definition to your pom.xml file Add Edison as a dependency CogcompSoftware edu.illinois.cs.cogcomp edison jar compile

So far… 1.What is Edison? 2.Installing Edison 3.Creating a TextAnnotation 4.Adding views from the Curator 5.Using views 6.…?? 7.Profit!

A uniform representation TextAnnotation Raw text: John Smith bought the car. Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} Views Name: SENTENCEConstituents: {…} Relations: {…} Name: POSConstituents: {…} Relations: {…} Name: PARSE_CHARNIAKConstituents: {…} Relations: {…} and other views….

Three ways to create TextAnnotations 1.When you don’t know the tokenization – Use this for raw text, if you don’t want to use the Curator 2.When you know the tokenization – Use this for pre-tokenized text 3.Using the Curator – Use this for raw text – If your text is pre-tokenized, you can still use the Curator for adding views

Creating TextAnnotations (1) When to use this approach – If you don’t know the tokenization (i.e. words) – Want to use the LBJ tokenizer and sentence splitter Note: Every TextAnnotation has a textId and corpusId, these could be used in the future for book-keeping

Creating TextAnnotations (1) String corpus = "2001_ODYSSEY"; String textId = "001"; String text1 = "Good afternoon, gentlemen. I am a HAL-9000 computer."; TextAnnotation ta1 = new TextAnnotation(corpus, textId, text1); System.out.println(ta1.getText()); System.out.println(ta1.getTokenizedText()); // Print the sentences. The `Sentence` class has the same // methods as a `TextAnnotation`. List sentences = ta1.sentences(); System.out.println(sentences.size() + " sentences found."); for (int i = 0; i < sentences.size(); i++) { Sentence sentence = sentences.get(i); System.out.println(sentence); }

Creating TextAnnotations (2) When to use this approach – When you know the tokenization That is, when some external source specifies the tokens of the text After creating it, it can be used as before

Creating TextAnnotations (2) String corpus = "2001_ODYSSEY"; String textId = "002"; List tokenizedSentences = Arrays.asList("Good afternoon, gentlemen.", "I am a HAL-9000 computer."); TextAnnotation ta2 = new TextAnnotation(corpus, textId, tokenizedSentences); System.out.println(ta2.getText()); System.out.println(ta2.getTokenizedText()); // Print the sentences. The `Sentence` class of the same // methods as a `TextAnnotation`. List sentences = ta2.sentences(); System.out.println(sentences.size() + " sentences found."); for (int i = 0; i < sentences.size(); i++) { Sentence sentence = sentences.get(i); System.out.println(sentence); }

Connecting to the Curator (1) If you don’t know anything about your text, the curator can tokenize your text for you. String text = "Good afternoon, gentlemen. I am a HAL-9000 " + "computer. I was born in Urbana, Il. in 1992"; String corpus = "2001_ODYSSEY"; String textId = "001"; // We need to specify a host and a port where the curator server is // running. String curatorHost = "my-curator-server.cs.uiuc.edu"; int curatorPort = 9090; CuratorClient client = new CuratorClient(curatorHost, curatorPort); // Should the curator's cache be forcibly updated? boolean forceUpdate = false; // Get the text annotation object from the curator, which splits the // sentences and tokenizes it. TextAnnotation ta = client.getTextAnnotation(corpus, textId, text, forceUpdate); Create a curator client Create a TextAnnotation

Connecting to the Curator (2) If you know the tokenization and want all the Curator’s annotators to respect this tokenization String corpus = "2001_ODYSSEY"; String textId = "002"; List tokenizedSentences = Arrays.asList("Good afternoon, gentlemen.", "I am a HAL-9000 computer."); TextAnnotation ta2 = new TextAnnotation(corpus, textId, tokenizedSentences); // We need to specify a host and a port where the curator server is // running. String curatorHost = "my-curator-server.cs.uiuc.edu"; int curatorPort = 9090; CuratorClient client = new CuratorClient(curatorHost, curatorPort, true); Curator shoud Respect tokenization Note: A Curator Client in this mode cannot create TextAnnotations. Doing so will trigger an exception! Create your TextAnnotation as before

So far… 1.What is Edison? 2.Installing Edison 3.Creating a TextAnnotation 4.Adding views from the Curator 5.Using views 6.…?? 7.Profit!

Views Views are graphs, Constituents are nodes and Relations are edges Every TextAnnotation can be seen as a container for views, indexed by their name View is a Java class that represents any graph over constituents – Specializations of the View class to deal with specific types TokenLabelView, SpanLabelView, TreeView, PredicateArgumentView, CoreferenceView – You can create your own views or specializations too!

Example: Part-of-speech John Smith bought the car. Part-of-speech NNP John NNP Smith VBD bought DT the NN car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} 0-1 NNP 1-2 NNP 2-3 VBD 3-4 DT 4-5 NN 5-6. Constituents No Relations! Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1) th one. This specialization of the View class is called a TokenLabelView, where each constituent assigns a label to a token and there are no relations. Use for part-of-speech, stem/lemma, etc.

Adding part-of-speech from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the part-of-speech annotation? boolean forceUpdate = false; // Add the part of speech view from the Curator client.addPOSView(ta, forceUpdate); // Get the part-of-speech view from the TextAnnotation. This view will // be filed under the name 'ViewNames.POS'. Also, we know that // this view will be a TokenLabelView. TokenLabelView posView = (TokenLabelView) ta.getView(ViewNames.POS); // Iterate through the text and get the POS label for each token for (int tokenId = 0; tokenId < ta.size(); tokenId++) { String token = ta.getToken(tokenId); String posLabel = posView.getLabel(tokenId); System.out.println(token + "\t" + posLabel); } Curator call This method is available for TokenLabelVIews

Example: Shallow parse John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} 0-2 NP 2-3 VP 3-4 NP Constituents No Relations! Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1) th one. This specialization of the View class is called a SpanLabelView, where each constituent assigns a label to a span of text and there are no relations. Use for named entities, shallow parse, Wikifier, etc. Shallow parse NP John Smith VP bought NP the car

Adding shallow parse from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the shallow parse annotation? boolean forceUpdate = false; // Add the shallow parse/chunk view from the Curator client.addChunkView(ta, forceUpdate); // Get the shallow parse view from the TextAnnotation. This view will // be filed under the name 'ViewNames.SHALLOW_PARSE'. Also, we know that // this view will be a SpanLabelView. SpanLabelView chunkView = (SpanLabelView) ta.getView(ViewNames.SHALLOW_PARSE); // Get all constituents whose span is contained in the span (0, 2). List constituents = chunkView.getSpanLabels(0, 2); // Iterate over them and print their labels for(Constituent c: constituents) { String label = c.getLabel(); System.out.println(label); } Curator call Available for SpanLabelView

Other SpanLabel views in the Curator Shallow parse – ViewNames.SHALLOW_PARSE – Use ‘client.addChunkView(ta, forceUpdate)’ Named entities – ViewNames.NER – Use ‘client.addNamedEntityView(ta, forceUpdate)’ Wikifier – ViewNames.WIKIFIER – Use ‘client.addWikifierView(ta, forceUpdate) Note: For these function calls to work, the corresponding annotator should exist in your instance of the Curator. Otherwise, an exception will be triggered

Example: Parse view John Smith bought the car. Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.} 0-5 S 0-2 NP 3-5 VP Constituents Relations This specialization of the View class is called a TreeView, where the graph represents a tree. Use for full parse and dependency trees. Parse tree S NPVP NNP VBD NP DT NN JohnSmithboughtthecar 0-1 NNP ParentOf Rest of the tree not shown.

Adding Charniak parse from the Curator // Suppose we have a CuratorClient called 'client' and a TextAnnotation // called 'ta'. // Should the Curator forcibly update the parse annotation? boolean forceUpdate = false; // Add the charniak parse view from the Curator client.addCharniakParse(ta, forceUpdate); // Get the Charniak parse view from the TextAnnotation. This view will // be filed under the name 'ViewNames.PARSE_CHARNIAK'. Also, we know // that this view will be a TreeView. TreeView parseView = (TreeView) ta.getView(ViewNames.PARSE_CHARNIAK); // get all parse nodes List treeNodes = parseView.getConstituents(); // get the tree structure for the first sentence (i.e. sentence #0) Tree parseTree = parseView.getTree(0); // Get path between parse tree nodes (common feature) String parsePath = PathFeatureHelper.getFullParsePathString( treeNodes.get(0), treeNodes.get(1), 400); Curator call Do interesting things

Tree views from the curator Charniak parser – ViewNames.PARSE_CHARNIAK – client.addCharniakParse(ta, forceUpdate) Easy-first dependency parser – ViewNames.DEPENDENCY – client.addEasyFirstDependencyView(ta, forceUpdate) Stanford parser – ViewNames.PARSE_STANFORD – client.addStanfordParse(ta, forceUpdate) Stanford dependency parser – ViewNames.DEPENDENCY_STANFORD – client.addStanfordDependencyView(ta, forceUpdate)

Other Curator calls Verb semantic roles – View name: ViewNames.SRL – client.addSRLView(ta, forceUpdate) Adds a view of type PredicateArgumentView, which is a subclass of the View class Nominal semantic roles – View name: ViewNames.NOM – client.addNOMView(ta, forceUpdate) Adds a view of type PredicateArgumentView Coreference – View name: ViewNames.COREF – client.addCorefView(ta, forceUpdate) Adds a view of type CoreferenceView, which is a subclass of the View class

So far… 1.What is Edison? 2.Installing Edison 3.Creating a TextAnnotation 4.Adding views from the Curator 5.Using views 6.…?? 7.Profit!

Using views All views provide access to – Constituents: getConstituents, getConstituentsCoveringToken, getConstituentsCoveringSpan – Relations: getRelations Allows us to manipulate several different views – Eg: Get the parse tree nodes that contain the named entity constituent that whose label is “PER”: for (Constituent c : namedEntityView.getConstituents()) { if (c.getLabel().equals("PER")) { List parseConstituents = parseView.getConstituentsCovering(c); // do something with these }

Using constituents and relations Each constituent belongs to a view Constituents provide the following methods: – getLabel(): gets the label of the constituent – getSpan(): gets the span of the constituent – getIncomingRelations(): gets list of Relations that are incident to this constituent in this view – getOutgoingRelations(): gets list of Relations whose source is this constituent in this view Relations provide the following accessors: – getRelationName(), getSource(), getTarget()

Other useful functionality Supports – Top-K views – Custom views, for your application Provides helper functions for common tasks – Look at the functions in classes in the package edu.illinois.cs.cogcomp.edison.features.helpers Provides interface to WordNet – WordNetManager Collin’s head-finding rules Several feature extraction utilities – Look the classes at edu.illinois.cs.cogcomp.edison.features

So far… 1.What is Edison? 2.Installing Edison 3.Creating a TextAnnotation 4.Adding views from the Curator 5.Using views 6.…?? 7.Profit!

Links Edison download Example code API documentation