1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Partially Observable Markov Decision Process (POMDP)
Efficiency of Algorithms Csci 107 Lecture 6-7. Topics –Data cleanup algorithms Copy-over, shuffle-left, converging pointers –Efficiency of data cleanup.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Dynamic Bayesian Networks (DBNs)
Test Case Filtering and Prioritization Based on Coverage of Combinations of Program Elements Wes Masri and Marwa El-Ghali American Univ. of Beirut ECE.
Iowa State University Department of Computer Science, Iowa State University Artificial Intelligence Research Laboratory Center for Computational Intelligence,
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Planning under Uncertainty
CS107 Introduction to Computer Science
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
1 A Novel Binary Particle Swarm Optimization. 2 Binary PSO- One version In this version of PSO, each solution in the population is a binary string. –Each.
Module C9 Simulation Concepts. NEED FOR SIMULATION Mathematical models we have studied thus far have “closed form” solutions –Obtained from formulas --
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
1 Modeling Parameter Setting Performance in Domains with a Large Number of Parameters: A Hybrid Approach CUNY / SUNY / NYU Linguistics Mini-conference.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Graphical Solutions Plot all constraints including nonnegativity ones
Radial Basis Function Networks
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
Topology Design for Service Overlay Networks with Bandwidth Guarantees Sibelius Vieira* Jorg Liebeherr** *Department of Computer Science Catholic University.
The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.
FDA- A scalable evolutionary algorithm for the optimization of ADFs By Hossein Momeni.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Simulated Annealing.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
Genetic Algorithms CSCI-2300 Introduction to Algorithms
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
SNU OOPSLA Lab. 1 Great Ideas of CS with Java Part 1 WWW & Computer programming in the language Java Ch 1: The World Wide Web Ch 2: Watch out: Here comes.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Optimization Problems
Performance of Distributed Constraint Optimization Algorithms A.Gershman, T. Grinshpon, A. Meisels and R. Zivan Dept. of Computer Science Ben-Gurion University.
Ensemble Methods in Machine Learning
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Can small quantum systems learn? NATHAN WIEBE & CHRISTOPHER GRANADE, DEC
Artificial Intelligence Lecture No. 8 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, January 26, 2000.
Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.
Computational Physics (Lecture 10) PHY4370. Simulation Details To simulate Ising models First step is to choose a lattice. For example, we can us SC,
Ch 4. Language Acquisition: Memoryless Learning 4.1 ~ 4.3 The Computational Nature of Language Learning and Evolution Partha Niyogi 2004 Summarized by.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Chapter 9. A Model of Cultural Evolution and Its Application to Language From “The Computational Nature of Language Learning and Evolution” Summarized.
Today Cluster Evaluation Internal External
Biointelligence Laboratory, Seoul National University
Randomness and Computation
William Gregory Sakas City University of New York (CUNY)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Genetic Algorithms CSCI-2300 Introduction to Algorithms
Artificial Intelligence CIS 342
PRESENTATION: GROUP # 5 Roll No: 14,17,25,36,37 TOPIC: STATISTICAL PARSING AND HIDDEN MARKOV MODEL.
Normal Form (Matrix) Games
Presentation transcript:

1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under: Topics in Artificial Intelligence ) The Graduate School of the City University of New York Fall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York

2 Syntax acquisition can be viewed as a state space search — nodes represent grammars including a start state and a target state. — arcs represent a possible change from one hypothesized grammar to another G targ A possible state space for parameter space with 3 parameters.

s  L(G curr ) s  L(G attempt ) s  L(G curr )  G attempt = 110  s  L(G 110 ) s  L(G curr )  G attempt = 011  s  L(G 011 ) s  L(G curr )  G attempt = 000  s  L(G 000 ) Error-driven Greediness Local state space for the TLA (Not the TLA-) If G curr = 010, then G attempt = random G  { 000, 110, 011 } SVC

4  i denotes the ambiguity factor  i = Pr ( s  L(G targ )  L(G i ) ) i,j denotes the overlap factor i,j = Pr (s  L(G i )  L(G J ) )  i denotes the probability of picking or "looking ahead” at a new hypothesis grammar G i New Probabilistic Formulation of TLA performance

5 The probability that the learner moves from state G curr to state G new = (1-  curr ) (  new ) Pr (G new can parse s|G curr can’t parse s) Pr(G curr  G new ) = (  new ) (  new - curr, new ) Error-driven Greediness SVC After some algebra:

G-Ring G 4 G-Ring G 2 target grammar G targ grammars Parameter space H 4 with G targ = Each ring or G-Ring contains exactly those grammars a certain hamming distance from the target. For example, ring G 2 contains 0011, 0101, 1100, 1010, 1001 and 0110 all of which differ from the target grammar 1111 by 2 bits.

7 Weak Smoothness Requirement - All the members of a G-Ring can parse s with an equal probability. Strong Smoothness Requirement - The parameter space is weakly smooth and the probability that s can be parsed by a member of a G-Ring increases monotonically as distance from the target decreases. Smoothness - there exists a correlation between the similarity of grammars and the similarity of the languages that they generate.

8 Experimental Setup 1) Adapt the formulas for the transition probabilities to work with G-rings 2) Build a generic transition matrix into which varying values of  and can be employed 3) Use standard Markov technique to calculate the expected number of inputs consumed by the system (construct the fundamental matrix) Goal - find the ‘sweet spot’ for TLA performance

9 Three experiments 1) G-Rings equally likely to parse an input sentence (uniform domain) 2) G-Rings are strongly smooth (smooth domain) 3) Anything goes domain Problem: How to find combinations of  and that are optimal? Solution: Use an an optimization algorithm: GRG2 (Lasdon and Waren, 1978).

10 Result 1: The TLA performs worse than blind guessing in a uniform domain - exponential increase in # sentences Logarithmic scale Results obtained employing optimal values of  and

11 Result 2: The TLA performs extremely well in a smooth domain - but still nonlinear increase Linear scale Results obtained employing optimal values of  and

12 Result 3: The TLA performs a bit better in the Anything Goes scenario - optimizer chooses ‘accelerating’ strong smoothness Linear scale Results obtained employing optimal values of  and

13 In summary TLA is an infeasible learner: With cross-language ambiguity uniformly distributed across the domain of languages, the number of sentences required by the the number of sentences consumed by the TLA is exponential in the number of parameters. TLA is a feasible learner: In strongly smooth domains, the number of sentences increases at a rate much closer to linear as the number of parameters increases (i.e. the number of grammars increases exponentially).

14 A second case study: The Structural Triggers Learner (Fodor 1998)

15 The Parametric Principle - (Fodor 1995,1998; Sakas and Fodor, 2001) Set individual parameters. Do not evaluate whole grammars. Halves the size of the grammar pool with each successful learning event. e.g.When 5 of 30 parameters are set  3% of grammar pool remains

16 — Parametric Principle requires certainty — But how to know when a sentence may be parametrically ambiguous? Structural Triggers Learner (STL), Fodor (1995, 1998) Problem: Solution: For the STL a parameter value = structural trigger = “treelet” onoff (e.g English)(e.g. German) V before O VP O V O V (e.g English)(e.g. German) VO / OV

17 1)Sentence. 2)Parse with current grammar G curr. Success  Keep G curr. Failure  Parse with G curr + all parametric treelets Adopt the treelets that contributed. STL Algorithm

18 So, the STL —uses the parser to decode parametric signatures of sentences. —can detect parametric ambiguity Don't learn from sentences that contain a choice point = waiting-STL variant —and thus can abide by the Parametric Principle

19 Computationally modeling STL performance: –the nodes represent current number of parameters that have been set - not grammars –arcs represent a possible change in the number of parameters that have been set A state space for the STL performing in a 3 parameter domain Here, each input may express 0, 1 or 2 new parameters.

20 Transition probabilities for the waiting-STL depend on:  The number of parameters that have been set( t )  the number of relevant parameters ( r )  the expression rate ( e )  the ambiguity rate ( a )  the "effective" expression rate ( e' ) Learner’s state Formalization of the input space.

21 Transition probabilities for the STL-minus Probability of encountering a sentence s with w "new" unset parameters. Probability that those parameters expressed by s are expressed unambiguously. Probability of setting w "new" parameters, given t have been set (and given values of r, e, e'). Markov transition probability of shifting from state S t to S t+w

22 Results after Markov analysis for the STL-minus Exponential in % ambiguity Seemingly linear in # of parameters

23 Exponential in % ambiguity Seemingly linear in # of parameters Results for STL (NOT minus)

24 Striking effect of ambiguity ( r fixed) 20 parameters to be set 10 parameters expressed per input Logarithmic scale

25 Subtle effect of ambiguity on efficiency wrt r As ambiguity increases, the cost of the Parametric Principle skyrockets as the domain scales up (r increases) Linear scale x axis = # of parameters in the domain 10 parameters expressed per input ,878 9,772,740 # parameters

26 The effect of ambiguity (interacting with e and r) How / where is the cost incurred? By far the greatest amount of damage inflicted by ambiguity occurs at the very earliest stages of learning - the wait for the first fully unambiguous trigger + a little wait for sentences that express the last few parameters unambiguously

27 The logarithm of expected number of sentences consumed by the waiting- STL in each state after learning has started. e = 10, r = 30, and e΄ = 0.2 (a΄ = 0.8) closer and closer to convergence Logarithmic scale

28 STL — Bad News Ambiguity is damaging even to a parametrically-principled learner Abiding by the Parametric Principle does not does not, in and of itself, guarantee merely linear increase in the complexity of the learning task as the number of parameters increases..

29 Can learn Can't learn STL — Good News Part 1 Learning task might be manageable if there are at least some sentences with low expression to get learning off the ground. Sakas and Fodor (1998) Parameters expressed per sentence

30 Add a distribution factor to the transition probabilities = Probability that i parameters are expressed by a sentence given distribution D on input text I

31 Average number of inputs consumed by the waiting-STL. Expression rate is not fixed per sentence. e varies uniformly from 0 to e max Still exponential in % ambiguity, but manageable For comparison: e varies from 0 to 10 requires 430 sentences e fixed at 5 requires 3,466 sentences

32 Effect of ambiguity is still exponential, but not as bad as for fixed e. r = 20, e is uniformly distributed from 0 to 10. Logarithmic scale

33 Effect of high ambiguity rates Varying rate of expression, uniformly distributed Still exponential in a, but manageable larger domain than in previous tables

34 STL — Good News Part 2 With a uniformly distributed expression rate, the cost of the Parametric Principle is linear (in r ) and doesn’t skyrocket Linear scale

35 In summary: With a uniformly distributed expression rate, the number of sentences required by the STL falls in a manageable range (though still exponential in % ambiguity) The number of sentences increases only linearly as the number of parameters increases (i.e. the number of grammars increases exponentially).

36 No Best Strategy Conjecture (roughly in the spirit of Schaffer, 1994): Algorithms may be extremely efficient in specific domains, but not in others; there is generally no best learning strategy. Recommends: Have to know the specific facts about the distribution or shape of ambiguity in natural language.

37 Research agenda: Three-fold approach to building a cognitive computational model of human language acquisition: 1) formulate a framework to determine what distributions of ambiguity make for feasible learning 2) conduct a psycholinguistic study to determine if the facts of human (child-directed) language are in line with the conducive distributions 3) conduct a computer simulation to check for performance nuances and potential obstacles (e.g. local max based on defaults, or subset principle violations)