A Kernel-based Approach to Learning Semantic Parsers

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning Semantic Parsers Using Statistical.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

1 Natural Language Processing COMPSCI 423/723 Rohit Kate.

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning Language Semantics from Ambiguous Supervision Rohit J. Kate.

Probabilistic Context Free Grammars for Representing Action Song Mao November 14, 2000.

1 Learning for Semantic Parsing Using Statistical Syntactic Parsing Techniques Ruifang Ge Ph.D. Final Defense Supervisor: Raymond J. Mooney Machine Learning.

Learning to Transform Natural to Formal Language Presented by Ping Zhang Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing Raymond.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning a Compositional Semantic Parser.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Natural Language Generation with Tree Conditional Random Fields Wei Lu, Hwee Tou Ng, Wee Sun Lee Singapore-MIT Alliance National University of Singapore.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing with Kernels under Various Forms of.

Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing of Natural.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Machine Learning: Ensemble Methods

Support Vector Machines

CS 9633 Machine Learning Support Vector Machines

CSC 594 Topics in AI – Natural Language Processing

Programming Languages Translator

Introduction to Parsing (adapted from CS 164 at Berkeley)

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

School of Computer Science & Engineering

Instance Based Learning

Basic Parsing with Context Free Grammars Chapter 13

Semantic Parsing for Question Answering

Relation Extraction CSCI-GA.2591

Using String-Kernels for Learning Semantic Parsers

CS416 Compiler Design lec00-outline September 19, 2018

Learning to Transform Natural to Formal Languages

Data Recombination for Neural Semantic Parsing

Introduction CI612 Compiler Design CI612 Compiler Design.

CS 388: Natural Language Processing: Syntactic Parsing

Training Tree Transducers

K Nearest Neighbor Classification

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Learning to Parse Database Queries Using Inductive Logic Programming

Learning to Sportscast: A Test of Grounded Language Acquisition

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

CS416 Compiler Design lec00-outline February 23, 2019

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

CSCI 5832 Natural Language Processing

A Path-based Transfer Model for Machine Translation

Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

David Kauchak CS159 – Spring 2019

SVMs for Document Ranking

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Support Vector Machines 2

Presentation transcript:

A Kernel-based Approach to Learning Semantic Parsers Rohit J. Kate Doctoral Dissertation Proposal Supervisor: Raymond J. Mooney November 21, 2005

Outline Semantic Parsing Related Work Background on Kernel-based Methods Completed Research Proposed Research Conclusions

Semantic Parsing Semantic Parsing: Transforming natural language (NL) sentences into computer executable complete meaning representations (MRs) Importance of Semantic Parsing Natural language communication with computers Insights into human language acquisition Example application domains CLang: Robocup Coach Language Geoquery: A Database Query Application

CLang: RoboCup Coach Language In RoboCup Coach competition teams compete to coach simulated players The coaching instructions are given in a formal language called CLang If our player 4 has the ball, our player 4 should shoot. Simulated soccer field Coach Semantic Parsing ((bowner our {4}) (do our {4} shoot)) CLang

Geoquery: A Database Query Application Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] Which rivers run through the states bordering Texas? User Semantic Parsing answer(traverse_2( next_to(stateid(‘texas’)))) Query

Learning Semantic Parsers Assume meaning representation languages (MRLs) have deterministic context free grammars true for almost all computer languages MRs can be parsed unambiguously

NL: Which rivers run through the states bordering Texas? MR: answer(traverse_2(next_to(stateid(‘texas’)))) Parse tree of MR: Non-terminals: ANSWER, RIVER, TRAVERSE_2, STATE, NEXT_TO, STATEID Terminals: answer, traverse_2, next_to, stateid, ‘texas’ Productions: ANSWER  answer(RIVER), RIVER  TRAVERSE_2(STATE), STATE  NEXT_TO(STATE), STATE  NEXT_TO(STATE) TRAVERSE_2  traverse_2, NEXT_TO  next_to, STATEID  ‘texas’ ANSWER answer STATE RIVER NEXT_TO TRAVERSE_2 STATEID stateid ‘texas’ next_to traverse_2

Learning Semantic Parsers Assume meaning representation languages (MRLs) have deterministic context free grammars true for almost all computer languages MRs can be parsed unambiguously Training data consists of NL sentences paired with their MRs Induce a semantic parser which can map novel NL sentences to their correct MRs Learning problem differs from that of syntactic parsing where training data has trees annotated over the NL sentences

Outline Semantic Parsing Related Work Background on Kernel-based Methods Completed Research Proposed Research Conclusions

Related Work: CHILL [Zelle & Mooney, 1996] Uses Inductive Logic Programming (ILP) to induce a semantic parser Learns rules to control actions of a deterministic shift-reduce parser Processes sentence one word at a time making hard parsing decision each time Brittle and ILP techniques do not scale to large corpora

Related Work: SILT [Kate, Wong & Mooney, 2005] Transformation rules associate NL patterns with MRL templates NL patterns matched in the sentence are replaced by the MRL templates By the end of parsing, NL sentence gets transformed into its MR Two versions: string patterns and syntactic tree patterns NL pattern our left [3] penalty area MRL template AREA  (left (penalty-area our))

Related Work: SILT contd. Weaknesses of SILT: Hard-matching transformation rules are brittle: For e.g. for NL pattern our left [3] penalty area “our left penalty area” “our left side of penalty area ” “left of our penalty area” “our ah.. left penalty area” Parsing is done deterministically which is less robust than probabilistic parsing

Related Work: WASP [Wong, 2005] Based on Synchronous Context-free Grammars Uses Machine Translation technique of statistical word alignment to find good transformation rules Builds a maximum entropy model for parsing The transformation rules are hard-matching

Related Work: SCISSOR [Ge & Mooney, 2005] Based on a fairly standard approach to compositional semantics [Jurafsky and Martin, 2000] A statistical parser is used to generate a semantically augmented parse tree (SAPT) Augment Collins’ head-driven model 2 (Bikel’s implementation, 2004) to incorporate semantic labels Translate SAPT into a complete formal meaning representation our player 2 has the ball PRP$-team NN-player CD-unum VB-bowner DT-null NN-null NP-null VP-bowner NP-player S-bowner our player 2 has the ball PRP$-team NN-player CD-unum VB-bowner DT-null NN-null NP-null VP-bowner NP-player S-bowner

Related Work: Zettlemoyer & Collins [2005] Uses Combinatorial Categorial Grammar (CCG) formalism to learn a statistical semantic parser Generates CCG lexicon relating NL words to semantic types through general hand-built template rules Uses maximum entropy model for compacting this lexicon and doing probabilistic CCG parsing

Outline Semantic Parsing Related Work Background on Kernel-based Methods Completed Research Proposed Research Conclusions

Traditional Machine Learning with Structured Data Examples Feature Engineering Information loss Feature Vectors Machine Learning Algorithm

Kernel-based Machine Learning with Structured Data Examples Kernelized Machine Learning Algorithm Kernel Computations Implicit mapping to potentially infinite number of features

Kernel Functions A kernel K is a similarity function over domain X which maps any two objects x, y in X to their similarity score K(x,y) For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij should be symmetric and positive-semidefinite, then the kernel function calculates the dot-product of the implicit feature vectors in some high-dimensional feature space Machine learning algorithms which use the data only to compute similarity can be kernelized (e.g. Support Vector Machines, Nearest Neighbor etc.)

String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = ?

String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left K(s,t) = 1+?

String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = our K(s,t) = 2+?

String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = penalty K(s,t) = 3+?

String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = area K(s,t) = 4+?

String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 5+?

String Subsequence Kernel Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = 11

Normalized String Subsequence Kernel Normalize the kernel (range [0,1]) to remove any bias due to different string lengths Lodhi et al. [2002] give O(n|s||t|) for computing string subsequence kernel Used for Text Categorization [Lodhi et al, 2002] and Information Extraction [Bunescu & Mooney, 2005b]

Support Vector Machines Mapping data to high-dimensional feature spaces can lead to overfitting of training data (“curse of dimensionality”) Support Vector Machines (SVMs) are known to be resistant to this overfitting

SVMs: Maximum Margin Given positive and negative examples, SVMs find a separating hyperplane such that the margin ρ between the closest examples is maximized Maximizing the margin is good according to intuition and PAC theory ρ Separating hyperplane

SVMs: Probability Estimates Probability estimate of a point belonging to a class can be obtained using its distance from the hyperplane [Platt, 1999]

Why Kernel-based Approach to Learning Semantic Parsers? Natural language sentences are structured Natural languages are flexible, various ways to express the same semantic concept CLang MR: (left (penalty-area our)) NL: our left penalty area our left side of penalty area left side of our penalty area left of our penalty area our penalty area towards the left side our ah.. left penalty area

Why Kernel-based Approach to Learning Semantic Parsers? right side of our penalty area left of our penalty area opponent’s right penalty area our left side of penalty area our ah.. left penalty area our left penalty area our right midfield left side of our penalty area our penalty area towards the left side Kernel methods can robustly capture the range of NL contexts.

Outline Semantic Parsing Related Work Background on Kernel-based Methods Completed Research Proposed Research Conclusions

KRISP: Kernel-based Robust Interpretation by Semantic Parsing Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar Productions of MRL are treated like semantic concepts SVM classifier is trained for each production with string subsequence kernel These classifiers are used to compositionally build MRs of the sentences

Train string-kernel-based Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Training Semantic Parser Testing Novel NL sentences Best MRs

Train string-kernel-based Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Training Semantic Parser Testing Novel NL sentences Best MRs

Overview of KRISP’s Semantic Parsing We first define Semantic Derivation of an NL sentence We define probability of a semantic derivation Semantic parsing of an NL sentence involves finding its most probable semantic derivation Straightforward to obtain MR from a semantic derivation

Semantic Derivation of an NL Sentence MR parse with non-terminals on the nodes: ANSWER answer STATE RIVER NEXT_TO TRAVERSE_2 STATEID stateid ‘texas’ next_to traverse_2 Which rivers run through the states bordering Texas?

Semantic Derivation of an NL Sentence MR parse with productions on the nodes: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Which rivers run through the states bordering Texas?

Semantic Derivation of an NL Sentence Semantic Derivation: Each node covers an NL substring: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Which rivers run through the states bordering Texas?

Semantic Derivation of an NL Sentence Semantic Derivation: Each node contains a production and the substring of NL sentence it covers: (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO(STATE), [5..9]) (NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9

Semantic Derivation of an NL Sentence Substrings in NL sentence may be in a different order: ANSWER  answer(RIVER) RIVER  TRAVERSE_2(STATE) TRAVERSE_2  traverse_2 STATE  NEXT_TO(STATE) NEXT_TO  next_to STATE  STATEID STATEID  ‘texas’ Through the states that border Texas which rivers run?

Semantic Derivation of an NL Sentence Nodes are allowed to permute the children productions from the original MR parse (ANSWER  answer(RIVER), [1..10]) (RIVER  TRAVERSE_2(STATE), [1..10]] (STATE  NEXT_TO(STATE), [1..6]) (TRAVERSE_2  traverse_2, [7..10]) (NEXT_TO  next_to, [1..5]) (STATE  STATEID, [6..6]) (STATEID  ‘texas’, [6..6]) Through the states that border Texas which rivers run? 1 2 3 4 5 6 7 8 9 10

Probability of a Semantic Derivation Let Pπ(s[i..j]) be the probability that production π covers the substring s[i..j], For e.g., PNEXT_TO  next_to (“the states bordering”) Obtained from the string-kernel-based SVM classifiers trained for each production π Probability of a semantic derivation D: (NEXT_TO  next_to, [5..7]) the states bordering 5 6 7

Computing the Most Probable Semantic Derivation Task of semantic parsing is to find the most probable semantic derivation Let En,s[i..j], partial derivation, denote any subtree of a derivation tree with n as the LHS non-terminal of the root production covering sentence s from index i to j Example of ESTATE,s[5..9] : Derivation D is then EANSWER, s[1..|s|] (STATE  NEXT_TO(STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[i..j] E*STATE,s[i..j] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..5] E*STATE,s[6..9] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..6] E*STATE,s[7..9] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..7] E*STATE,s[8..9] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[5..8] E*STATE,s[9..9] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*NEXT_TO,s[i..j] E*STATE,s[i..j] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Let E*STATE,s[5.,9], denote the most probable partial derivation among all ESTATE,s[5.,9] This is computed recursively as follows: E*STATE,s[5..9] (STATE  NEXT_TO(STATE), [5..9]) E*STATE,s[i..j] E*NEXT_TO,s[i..j] the states bordering Texas? 5 6 7 8 9

Computing the Most Probable Semantic Derivation contd. Implemented by extending Earley’s [1970] context-free grammar parsing algorithm Predicts subtrees top-down and completes them bottom-up Dynamic programming algorithm which generates and compactly stores each subtree once Extended because: Probability of a production depends on which substring of the sentence it covers Leaves are not terminals but substrings of words

Computing the Most Probable Semantic Derivation contd. Does a greedy approximation search, with beam width ω, and returns ω most probable derivations it finds Uses a threshold θ to prune low probability trees

Train string-kernel-based Overview of KRISP MRL Grammar Collect positive and negative examples NL sentences with MRs Best semantic derivations (correct and incorrect) Train string-kernel-based SVM classifiers Pπ(s[i..j]) Training Semantic Parser Testing Novel NL sentences Best MRs

KRISP’s Training Algorithm Takes NL sentences paired with their respective MRs as input Obtains MR parses Proceeds in iterations In the first iteration, for every production π: Call those sentences positives whose MR parses use that production Call the remaining sentences negatives

KRISP’s Training Algorithm contd. First Iteration STATE  NEXT_TO(STATE) which rivers run through the states bordering texas? what is the most populated state bordering oklahoma ? what is the largest city in states that border california ? … what state has the highest population ? what states does the delaware river run through ? which states have cities named austin ? what is the lowest point of the state with the largest area ? Positives Negatives String-kernel-based SVM classifier PSTATENEXT_TO(STATE) (s[i..j])

KRISP’s Training Algorithm contd. Using these classifiers Pπ(s[i..j]), obtain the ω best semantic derivations of each training sentence Some of these derivations will give the correct MR, called correct derivations, some will give incorrect MRs, called incorrect derivations For the next iteration, collect positives from most probable correct derivation Collect negatives from incorrect derivations with higher probability than the most probable correct derivation

KRISP’s Training Algorithm contd. Most probable correct derivation: (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO(STATE), [5..9]) (NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9

KRISP’s Training Algorithm contd. Most probable correct derivation: Collect positive examples (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO(STATE), [5..9]) (NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9

KRISP’s Training Algorithm contd. Incorrect derivation with probability greater than the most probable correct derivation: (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9 Incorrect MR: answer(traverse_2(stateid(‘texas’)))

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Traverse both trees in breadth-first order till the first nodes where their productions differ are found.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Traverse both trees in breadth-first order till the first nodes where their productions differ are found.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Traverse both trees in breadth-first order till the first nodes where their productions differ are found.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Traverse both trees in breadth-first order till the first nodes where their productions differ are found.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Traverse both trees in breadth-first order till the first nodes where their productions differ are found.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Mark the words under these nodes.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Mark the words under these nodes.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Consider all the productions covering the marked words. Collect negatives for productions which cover any marked word in incorrect derivation but not in the correct derivation.

KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..4]) (STATE  NEXT_TO (STATE), [5..9]) (STATE  STATEID, [8..9]) (STATEID  ‘texas’, [8..9]) (NEXT_TO  next_to, [5..7]) Which rivers run through the states bordering Texas? (ANSWER  answer(RIVER), [1..9]) (RIVER  TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2  traverse_2, [1..7]) (STATE  STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Consider the productions covering the marked words. Collect negatives for productions which cover any marked word in incorrect derivation but not in the correct derivation.

KRISP’s Training Algorithm contd. Next Iteration STATE  NEXT_TO(STATE) Positives Negatives the states bordering texas? state bordering oklahoma ? states that border california ? states which share border next to state of iowa … what state has the highest population ? what states does the delaware river run through ? which states have cities named austin ? what is the lowest point of the state with the largest area ? which rivers run through states bordering … String-kernel-based SVM classifier PSTATENEXT_TO(STATE) (s[i..j])

KRISP’s Training Algorithm contd. In the next iteration, SVM classifiers are trained with the new positive examples and the accumulated negative examples Iterate specified number of times

Experimental Corpora CLang Geo250 [Zelle & Mooney, 1996] 300 randomly selected pieces of coaching advice from the log files of the 2003 RoboCup Coach Competition 22.52 words on average in NL sentences 13.42 tokens on average in MRs Geo250 [Zelle & Mooney, 1996] 250 queries for the given U.S. geography database 6.76 words on average in NL sentences 6.20 tokens on average in MRs Geo880 [Tang & Mooney, 2001] Superset of Geo250 with 880 queries 7.48 words on average in NL sentences 6.47 tokens on average in MRs

Experimental Methodology Evaluated using standard 10-fold cross validation Correctness CLang: output exactly matches the correct representation Geoquery: the resulting query retrieves the same answer as the correct representation Metrics

Experimental Methodology contd. Compared Systems: SILT [Kate, Wong & Mooney, 2005] WASP [Wong, 2005] SCISSOR [Ge & Mooney, 2005] CHILL COCKTAIL ILP algorithm [Tang & Mooney, 2001] Zettlemoyer & Collins (2005) Different Experimental Setup (600 training, 280 testing examples) Results available only for Geo880 corpus Geobase Hand-built NL interface [Borland International, 1988] Results available only for Geo250

Experimental Methodology contd. KRISP gives probabilities for its semantic derivation which are taken as confidences of the MRs We plot precision-recall curves by first sorting the best MR for each sentence by confidences and then finding precision for every recall value WASP and SCISSOR also output confidences so we show their precision-recall curves Results of other systems shown as points on precision-recall graphs

requires more annotation on corpus Results on CLang requires more annotation on corpus CHILL gives 49.2% precision and 12.67% recall with 160 examples, can’t run beyond.

Results on Geo250

Results on Geo880

Results on Multilingual Geo250 We have Geo250 corpus translated into Japanese, Spanish and Turkish KRISP is directly applicable to other languages

Results on Multilingual Geo250

Outline Semantic Parsing Related Work Background on Kernel-based Methods Completed Research Proposed Research Short-term Long term Conclusions

Short Term: Exploiting Natural Language Syntax KRISP currently uses only word order of the sentence Semantic interpretation depends largely on NL syntax, exploiting it should help semantic parsing We already have syntactic annotations on our corpora, used in SILT-tree and SCISSOR Existing syntactic parsers can be trained on our corpora in addition to WSJ [Bikel, 2004]

Exploiting Natural Language Syntax contd. Most natural extension of KRISP is to use syntactic-tree kernel instead of string kernel Syntactic-tree kernel Introduced by Collins & Duffy [2001] K(x,y) = Number of subtrees common between x & y NP PP JJ NN left side IN of DT the midfield PRP$ our penalty area K(x,y) = ?

Exploiting Natural Language Syntax contd. Most natural extension of KRISP is to use syntactic-tree kernel instead of string kernel Syntactic-tree kernel Introduced by Collins & Duffy [2001] K(x,y) = Number of subtrees common between x & y NP NP NP PP NP PP IN NP IN NP JJ NN JJ NN of of PRP$ NN DT NN left side NN left side our penalty area the midfield K(x,y) = 1+?

Exploiting Natural Language Syntax contd. Most natural extension of KRISP is to use syntactic-tree kernel instead of string kernel Syntactic-tree kernel Introduced by Collins & Duffy [2001] K(x,y) = Number of subtrees common between x & y NP NP NP PP NP PP IN NP IN NP JJ NN JJ NN of of PRP$ NN DT NN left side NN left side our penalty area the midfield K(x,y) = 2+?

Exploiting Natural Language Syntax contd. Most natural extension of KRISP is to use syntactic-tree kernel instead of string kernel Syntactic-tree kernel Introduced by Collins & Duffy [2001] K(x,y) = Number of subtrees common between x & y NP NP NP PP NP PP IN NP IN NP JJ NN JJ NN of of PRP$ NN DT NN left side NN left side our penalty area the midfield K(x,y) = 3+?

Exploiting Natural Language Syntax contd. Most natural extension of KRISP is to use syntactic-tree kernel instead of string kernel Syntactic-tree kernel Introduced by Collins & Duffy [2001] K(x,y) = Number of subtrees common between x & y NP NP NP PP NP PP IN NP IN NP JJ NN JJ NN of of PRP$ NN DT NN left side NN left side our penalty area the midfield K(x,y) = 8

Exploiting Natural Language Syntax contd. Often the syntactic information needed is present in dependency trees, full syntactic trees not necessary Dependency trees capture most important functional relationship between words Various dependency tree kernels have been used successfully for doing Information Extraction [Zelenko, Aone & Richardella, 2003], [Cumby & Roth, 2003], [Culotta & Sorenson, 2004], [Bunescu & Mooney, 2005a]

Short Term: Noisy NL Sentences If users are interacting with the semantic parser through speech then many ways noise can be present [Zue & Glass, 2000] Speech recognition errors Interjections (um’s and ah’s) Environment noise (door slams, phone rings etc.) Out-of-domain words and ill-formed utterances In KRISP, presence of extra words or corrupted words may decrease kernel values but won’t affect semantic parsing in a hard way KRISP is hence more robust to noise compared to systems with hard-matching rules like SILT and WASP, or systems doing complete syntactic-semantic parsing like SCISSOR

Noisy NL Sentences contd. We plan to do preliminary experiments by artificially corrupting our existing corpora Then we plan to get and experiment on some real world noisy corpus

Short Term: Committees of Semantic Parsers System Correct CLang MRs out of 300 KRISP 178 WASP 185 SCISSOR 232 Committee Upper-bound on Correct MRs KRISP+WASP 223 KRISP+SCISSOR 253 WASP+SCISSOR 246 KRISP+WASP+SCISSOR 259 Good indication that forming their committee will improve performance.

Committees of Semantic Parsers contd. Two general approaches to combine parse trees [Henderson & Brill, 1999] Parser Switching: Learn which parser works best on which types of sentences Parse Hybridization: Look into output MRs and combine their best components Particularly useful when none of the parser generates complete MRs Prior work is specific to combining syntactic parses. We plan to explore these two general approaches for combining MRs.

Long Term: Non-parallel Training Corpus Training data contained NL sentences aligned with their respective MRs In some domains many NL sentences and semantic MRs may be available but not aligned For e.g. in RoboCup commentary task [Binsted et al., 2000], NL sentences and symbolic description of events are available but not aligned Referential ambiguity: Which NL description refers to which symbolic description? In our present work we resolve which portion of the sentence refers to which production of MR parse Same approach could be extended to one level higher

Non-parallel Training Corpus contd. Let training corpus be {(Mi, Si)|i=1..N} where each Mi is a set of MRs and each Si is a set of NL sentences Align every MR in Mi to every NL sentence in Si for i=1..N Use KRISP’s training algorithm to learn classifiers Find best alignment for MRs and NL sentences in (Mi, Si) by semantic parsing using these classifiers Repeat till alignments don’t change We plan to first do preliminary experiments by artificially making our corpus non-parallel and then extracting the alignments then test on a real-world corpus

Long Term: Complex Relation Extraction Bunescu & Mooney [2005b] use string-based kernel to extract binary relation “protein-protein interaction” from text This can be viewed as learning for an MRL grammar with only one production INTERACTION  PROTEIN PROTEIN Complex relation is an n-ary relation among n typed entities [McDonald et al., 2005] For example, (person, job, company) NL sentence: John Smith is the CEO of Inc. Corp. Extraction: (John Smith, CEO, Inc. Corp.)

Complex Relation Extraction contd. (person, job,company) (person, job) (job, company) John Smith is the CEO of Inc. Corp. KRISP should be applicable to extract complex relations by treating it like higher level production composed of lower level productions.

Conclusions KRISP: A new kernel-based approach to learning semantic parser String-kernel-based SVM classifiers trained for each MRL production Classifiers used to compositionally build complete MRs of NL sentences Evaluated on two real-world corpora Performs better than deterministic rule-based systems Performs comparable to recent statistical systems Proposed work: exploit NL syntax, form committees and broaden application domains

Thank You! Questions??

Extra: Dealing with Constants MRL grammar may contain productions corresponding to constants in the domain: STATEID  ‘new york’ RIVERID  ‘colorado’ NUM  ‘2’ STRING  ‘DR4C10’ User can specify these as constant productions giving their NL substrings Classifiers are not learned for these productions Matching substring’s probability is taken as 1 If n constant productions have same substring then each gets probability of 1/n STATEID  ‘colorado’ RIVERID  ‘colorado’

Extra: Better String Subsequence Kernel Subsequences with gaps should be downweighted Decay factor λ in the range of (0,1] penalizes gaps All subsequences are the implicit features and penalties are the feature values s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 4+?

Extra: Better String Subsequence Kernel Subsequences with gaps should be downweighted Decay factor λ in the range of (0,1] penalizes gaps All subsequences are the implicit features and penalties are the feature values s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 4+λ3*λ0 +? Gap of 3 => λ3 Gap of 0 => λ0

Extra: Better String Subsequence Kernel Subsequences with gaps should be downweighted Decay factor λ in the range of (0,1] penalizes gaps All subsequences are the implicit features and penalties are the feature values s = “left side of our penalty area” t = “our left penalty area” K(s,t) = 4+3λ+3 λ3+ λ5

Extra: KRISP’s Training Algorithm contd. What if none of the ω most probable derivations of a sentence is correct? Extended Earley’s algorithm can be forced to derive only the correct derivations by making sure all subtrees it generates exist in the correct MR parse

Extra: N-best MRs for Geo880

Extra: KRISP’s Average Running Times Corpus Average Training Time (minutes) Average Testing Time (minutes) Geo250 1.44 0.05 Geo880 18.1 0.65 CLang 58.85 3.18 Average running times per fold in minutes taken by KRISP.

Extra: KRISP’s Learning PR Curves on CLang

Extra: KRISP’s Learning PR Curves on Geo250

Extra: KRISP’s Learning PR Curves on Geo880

Extra: Experimental Methodology Correctness CLang: output exactly matches the correct representation Geoquery: the resulting query retrieves the same answer as the correct representation If the ball is in our penalty area, all our players except player 4 should stay in our half. Correct: ((bpos (penalty-area our)) (do (player-except our{4}) (pos (half our))) ((bpos (penalty-area opp)) (do (player-except our{4}) (pos (half our))) Output:

Extra: Formal Language Grammar NL: If our player 4 has the ball, our player 4 should shoot. CLang: ((bowner our {4}) (do our {4} shoot)) CLang Parse: Non-terminals: RULE, CONDITION, ACTION… Terminals: bowner, our, 4… Productions: RULE  CONDITION DIRECTIVE DIRECTIVE  do TEAM UNUM ACTION ACTION  shoot RULE CONDITION DIRECTIVE do TEAM UNUM ACTION bowner our 4 shoot