Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Slides:

Advertisements

Similar presentations

Unsupervised Learning

Advertisements

ฟังก์ชั่นการแจกแจงความน่าจะเป็น แบบไม่ต่อเนื่อง Discrete Probability Distributions.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

FSA and HMM LING 572 Fei Xia 1/5/06.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.

Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.

Machine Learning CMPT 726 Simon Fraser University

Forward-backward algorithm LING 572 Fei Xia 02/23/06.

The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Introduction LING 572 Fei Xia Week 1: 1/4/06. Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Visual Recognition Tutorial

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.

Chapter 5 Discrete Probability Distributions

1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Probability and Statistics Review Thursday Sep 11.

Introduction to information theory

Crash Course on Machine Learning

Bayesian Decision Theory Making Decisions Under uncertainty 1.

Some basic concepts of Information Theory and Entropy

1 Advanced Smoothing, Evaluation of Language Models.

2. Mathematical Foundations

Final review LING572 Fei Xia Week 10: 03/11/

Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.

Text Classification, Active/Interactive learning.

Theory of Probability Statistics for Business and Economics.

1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

1 Introduction LING 570 Fei Xia Week 1: 9/26/07. 2 Outline Course overview Tokenization Homework #1 Quiz #1.

Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 5-1 Business Statistics: A Decision-Making Approach 8 th Edition Chapter 5 Discrete.

1 Probability theory LING 570 Fei Xia Week 2: 10/01/07.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.

CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.

Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: me the slides by 6am on 3/9 Final report: .

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 5-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Business Statistics,

Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.

Introduction to EM algorithm

LECTURE 23: INFORMATION THEORY REVIEW

LECTURE 07: BAYESIAN ESTIMATION

CPSC 503 Computational Linguistics

Presentation transcript:

Introduction LING 572 Fei Xia Week 1: 1/3/06

Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory

Course overview

Course objective Focus on statistical methods that produce state- of-the-art results Questions: for each algorithm –How the algorithm works: input, output, steps –What kind of tasks an algorithm can be applied to? –How much data is needed? Labeled data Unlabeled data

General info Course website: –Syllabus (incl. slides and papers): updated every week. –Message board –ESubmit Office hour: W: 3-5pm. Prerequisites: –Ling570 and Ling571. –Programming: C, C++, or Java, Perl is a plus. –Introduction to probability and statistics

Expectations Reading: –Papers are online: who don’t have access to printers? –Reference book: Manning & Schutze (MS) –Finish reading before class. Bring your questions to class. Grade: –Homework (3): 30% –Project (6 parts): 60% –Class participation: 10% –No quizzes, exams

Assignments Hw1: FSA and HMM Hw2: DT, DL, and TBL. Hw3: Boosting  No coding  Bring the finished assignments to class.

Project P1: Method 1 (Baseline): Trigram P2: Method 2: TBL P3: Method 3: MaxEnt P4: Method 4: choose one of four tasks. P5: Presentation P6: Final report Methods 1-3 are supervised methods. Method 4: bagging, boosting, semi-supervised learning, or system combination. P1 is an individual task, P2-P6 are group tasks. A group should have no more than three people.  Use ESubmit  Need to use others’ code and write your own code.

Summary of Ling570 Overview: corpora, evaluation Tokenization Morphological analysis POS tagging Shallow parsing N-grams and smoothing WSD NE tagging HMM

Summary of Ling571 Parsing Semantics Discourse Dialogue Natural language generation (NLG) Machine translation (MT)

570/571 vs focuses more on statistical approaches. 570/571 are organized by tasks; 572 is organized by learning methods. I assume that you know –The basics of each task: POS tagging, parsing, … –The basic concepts: PCFG, entropy, … –Some learning methods: HMM, FSA, …

An example 570/571: –POS tagging: HMM –Parsing: PCFG –MT: Model 1-4 training 572: –HMM: forward-backward algorithm –PCFG: inside-outside algorithm –MT: EM algorithm  All special cases of EM algorithm, one method of unsupervised learning.

Course layout Supervised methods –Decision tree –Decision list –Transformation-based learning (TBL) –Bagging –Boosting –Maximum Entropy (MaxEnt)

Course layout (cont) Semi-supervised methods –Self-training –Co-training Unsupervised methods –EM algorithm Forward-backward algorithm Inside-outside algorithm EM for PM models

Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory

Problems and methods

Types of ML problems Classification problem Estimation problem Clustering Discovery …  A learning method can be applied to one or more types of ML problems.  We will focus on the classification problem.

Classification problem Given a set of classes and data x, decide which class x belongs to. Labeled data: –(x i, y i ) is a set of labeled data. –x i is a list of attribute values. –y i is a member of a pre-defined set of classes.

Examples of classification problem Disambiguation: –Document classification –POS tagging –WSD –PP attachment given a set of other phrases Segmentation: –Tokenization / Word segmentation –NP Chunking

Learning methods Modeling: represent the problem as a formula and decompose the formula into a function of parameters Training stage: estimate the parameters Test (decoding) stage: find the answer given the parameters

Modeling Joint vs. conditional models: –P(data, model) –P(model | data) –P(data | model) Decomposition: –Which variable conditions on which variable? –What independent assumptions?

An example of different modeling

Training Objective functions: –Maximize likelihood: –Minimize error rate –Maximum entropy –…. Supervised, semi-supervised, unsupervised: –Ex: Maximize likelihood Supervised: simple counting Unsupervised: EM

Decoding DP algorithm –CYK for PCFG –Viterbi for HMM –…–… Pruning: –TopN: keep topN hyps at each node. –Beam: keep hyps whose weights >= beam * max_weight –Threshold: keep hyps whose weights >= threshold –…–…

Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory

Probability Theory

Probability theory Sample space, event, event space Random variable and random vector Conditional probability, joint probability, marginal probability (prior)

Sample space, event, event space Sample space (Ω): a collection of basic outcomes. –Ex: toss a coin twice: {HH, HT, TH, TT} Event: an event is a subset of Ω. –Ex: {HT, TH} Event space (2 Ω ): the set of all possible events.

Random variable The outcome of an experiment need not be a number. We often want to represent outcomes as numbers. A random variable is a function that associates a unique numerical value with every outcome of an experiment. Random variable is a function X: Ω  R. Ex: toss a coin once: X(H)=1, X(T)=0

Two types of random variable Discrete random variable: X takes on only a countable number of distinct values. –Ex: Toss a coin 10 times. X is the number of tails that are noted. Continuous random variable: X takes on uncountable number of possible values. –Ex: X is the lifetime (in hours) of a light bulb.

Probability function The probability function of a discrete variable X is a function which gives the probability p(x i ) that the random variable equals x i : a.k.a. p(x i ) = p(X=x i ).

Random vector Random vector is a finite-dimensional vector of random variables: X=[X 1,…,X k ]. P(x) = P(x 1,x 2,…,x n )=P(X 1 =x 1,…., X n =x n ) Ex: P(w 1, …, w n, t 1, …, t n )

Three types of probability Joint prob: P(x,y)= prob of x and y happening together Conditional prob: P(x|y) = prob of x given a specific value of y Marginal prob: P(x) = prob of x for all possible values of y

Common equations

More general cases

Information Theory

Information theory It is the use of probability theory to quantify and measure “information”. Basic concepts: –Entropy –Joint entropy and conditional entropy –Cross entropy and relative entropy –Mutual information and perplexity

Entropy Entropy is a measure of the uncertainty associated with a distribution. The lower bound on the number of bits it takes to transmit messages. An example: –Display the results of horse races. –Goal: minimize the number of bits to encode the results.

An example Uniform distribution: p i =1/8. Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64) (0, 10, 110, 1110, , , , )

Entropy of a language The entropy of a language L: If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

Joint and conditional entropy Joint entropy: Conditional entropy:

Cross Entropy Entropy: Cross Entropy: Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).

Cross entropy of a language The cross entropy of a language L: If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

Relative Entropy Also called Kullback-Leibler distance: Another distance measure between prob functions p and q. KL distance is asymmetric (not a true distance):

Relative entropy is non-negative

Mutual information It measures how much is in common between X and Y: I(X;Y)=KL(p(x,y)||p(x)p(y))

Perplexity Perplexity is 2 H. Perplexity is the weighted average number of choices a random variable has to make.

Summary Course overview Problems and methods Mathematical foundation –Probability theory –Information theory  M&S Ch2

Next time FSA HMM: M&S Ch 9.1 and 9.2