1 Empirical Learning Methods in Natural Language Processing Ido Dagan Bar Ilan University, Israel.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Machine learning continued Image source:

Sentiment Analysis An Overview of Concepts and Selected Techniques.

What is Statistical Modeling

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Supervised learning Given training examples of inputs and corresponding outputs, produce the “correct” outputs for new inputs Two main scenarios: –Classification:

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Introduction to Machine Learning Approach Lecture 5.

Chapter 5 Data mining : A Closer Look.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.

Albert Gatt Corpora and Statistical Methods Lecture 9.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

This week: overview on pattern recognition (related to machine learning)

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Today Ensemble Methods. Recap of the course. Classifier Fusion

1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Data Mining and Decision Support

NTU & MSRA Ming-Feng Tsai

POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

KNN & Naïve Bayes Hongning Wang

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning with Spark MLlib

School of Computer Science & Engineering

Natural Language Processing (NLP)

Machine Learning in Natural Language Processing

Statistical NLP: Lecture 9

Language and Statistics

Text Categorization Berlin Chen 2003 Reference:

Information Retrieval

Natural Language Processing (NLP)

Statistical NLP : Lecture 9 Word Sense Disambiguation

Natural Language Processing (NLP)

Presentation transcript:

1 Empirical Learning Methods in Natural Language Processing Ido Dagan Bar Ilan University, Israel

2 Introduction Motivations for learning in NLP 1.NLP requires huge amounts of diverse types of knowledge – learning makes knowledge acquisition more feasible, automatically or semi-automatically 2.Much of language behavior is preferential in nature, so need to acquire both quantitative and qualitative knowledge

3 Introduction (cont.) Apparently, empirical modeling obtains (so far) mainly “first-degree” approximation of linguistic behavior –Often, more complex models improve results only to a modest extent –Often, several simple models obtain comparable results Ongoing goal – deeper modeling of language behavior within empirical models

4 Linguistic Background (?) Morphology Syntax – tagging, parsing Semantics –Interpretation – usually out of scope –“Shallow” semantics: ambiguity, semantic classes and similarity, semantic variability

5 Information Units of Interest - Examples Explicit units: –Documents –Lexical units: words, terms (surface/base form) Implicit (hidden) units: –Word senses, name types –Document categories –Lexical syntactic units: part of speech tags –Syntactic relationships between words – parsing –Semantic relationships

6 Data and Representations Frequencies of units Co-occurrence frequencies –Between all relevant types of units (term-doc, term-term, term-category, sense-term, etc.) Different representations and modeling –Sequences –Feature sets/vectors (sparse)

7 Tasks and Applications Supervised/classification: identify hidden units (concepts) of explicit units –Syntactic analysis, word sense disambiguation, name classification, relations, categorization, … Unsupervised: identify relationships and properties of explicit units (terms, docs) –Association, topicality, similarity, clustering Combinations

8 Using Unsupervised Methods within Supervised Tasks Extraction and scoring of features Clustering explicit units to discover hidden concepts and to reduce labeling effort Generalization of learned weights or triggering-rules from known features to similar ones (similarity or class based) Similarity/distance to training as the basis for classification method (nearest neighbor)

9 Characteristics of Learning in NLP –Very high dimensionality –Sparseness of data and relevant features –Addressing the basic problems of language: Ambiguity – of concepts and features –One way to say many things Variability –Many ways to say the same thing

10 Supervised Classification Hidden concept is defined by a set of labeled training examples (category, sense) Classification is based on entailment of the hidden concept by related elements/features –Example: two senses of “sentence”: word, paragraph, description Sense1 judge, court, lawyer Sense2 Single or multiple concepts per example –Word sense vs. document categories

11 Supervised Tasks and Features Typical Classification Tasks: –Lexical: Word sense disambiguation, target word selection in translation, name-type classification, accent restoration, text categorization (notice task similarity) –Syntactic: POS tagging, PP-attachment, parsing –Complex: anaphora resolution, information extraction Features (“feature engineering”): –Adjacent context: words, POS In various relationships – distance, syntactic possibly generalized to classes –Other: morphological, orthographic, syntactic

12 Learning to Classify Two possibilities for acquiring the “entailment” relationships: –Manually: by an expert time consuming, difficult – “expert system” approach –Automatically: concept is defined by a set of training examples training quantity/quality Training: learn entailment of concept by features of training examples (a model) Classification: apply model to new examples

13 Supervised Learning Scheme Classification Model “Labeled” Examples New Examples Classifications Training Algorithm Classification Algorithm

14 Avoiding/Reducing Manual Labeling Basic supervised setting – examples are annotated manually by labels (sense, text category, part of speech) Settings in which labeled data can be obtained without manual annotation: –Anaphora, target word selection The system displays the file on the monitor and prints it. Bootstrapping approaches  Sometimes referred as unsupervised learning, though it actually addresses a supervised task of identifying an externally imposed class (“unsupervised” training)

15 Learning Approaches Model-based: define entailment relations and their strengths by training algorithm –Statistical/Probabilistic: model is composed of probabilities (scores) computed from training statistics –Iterative feedback/search (neural network): start from some model, classify training examples, and correct model according to errors Memory-based: no training algorithm and model - classify by matching to raw training (compared to unsupervised tasks)

16 Evaluation Evaluation mostly based on (subjective) human judgment of relevancy/correctness –In some cases – task is objective (e.g. OCR), or applying mathematical criteria (likelihood) Basic measure for classification – accuracy In many tasks (extraction, multiple class per- instance, …) most instances are “negative”; therefore using recall/precision measures, following information retrieval (IR) tradition Cross validation – different training/test splits

17 Evaluation: Recall/Precision Recall: #correct extracted/total correct Precision: #correct extracted/total extracted Recall/precision curve - by varying the number of extracted items, assuming the items are sorted by decreasing score

18 Micro/Macro averaging Often results are evaluated for multiple tasks –Many categories, many ambiguous words Macro-averaging: compute results separately for each category and average Micro-averaging (common): refer to all classification instances, from all categories, as one pile and compute results –Gives more weight to common categories

19 Course Organization Material organized mostly by types of learning approaches, while demonstrating applications as we go along Emphasis on demonstrating how computational linguistics tasks can be modeled (with simplifications) as statistical/learning problems Some sections covering the lecturer’s personal work perspective

20 Course Outline Sequential modeling –POS tagging –Parsing Supervised (instance-based) classification –Simple statistical models –Naïve Bayes classification –Perceptron/Winnow (one layer NN) –Improving supervised classification Unsupervised learning - clustering

21 Course Outline (1) Supervised classification Basic/earlier models: PP-attachment, decision list, target word selection Confidence interval Naive Bayes classification Simple smoothing -- add-constant Winnow Boosting

22 Course Outline (2) Part-of-speech tagging Hidden Markov Models and the Viterbi algorithm Smoothing -- Good-Turing, back-off Unsupervised parameter estimation with Expectation Maximization (EM) algorithm Transformation-based learning Shallow parsing Transformation based Memory based Statistical parsing and PCFG (2 hours) Full parsing - Probabilistic Context Free Grammar (PCFG)

23 Course Outline (3) Reducing training data Selective sampling for training Bootstrapping Unsupervised learning Word association Information theory measures Distributional word similarity, similarity-based smoothing Clustering

24 Misc. Major literature sources: –Foundations of Statistical Natural Language Processing, by Manning & Schutze, MIT Press –Articles Additional slide credits: –Prof. Shlomo Argamon, Chicago –Some slides from the book web-site