Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Trees Decision tree representation ID3 learning algorithm
Propositional and First Order Reasoning. Terminology Propositional variable: boolean variable (p) Literal: propositional variable or its negation p 
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Probably Approximately Correct Model (PAC)
Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.
Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Inductive learning Simplest form: learn a function from examples
Mohammad Ali Keyvanrad
INTRODUCTION TO ARTIFICIAL INTELLIGENCE COS302 MICHAEL L. LITTMAN FALL 2001 Satisfiability.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Learning from Observations Chapter 18 Through
Advanced Topics in Propositional Logic Chapter 17 Language, Proof and Logic.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
First-Order Logic and Inductive Logic Programming.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.
Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
CSE 421 Algorithms Richard Anderson Lecture 27 NP-Completeness Proofs.
Learning From Observations Inductive Learning Decision Trees Ensembles.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
NP-Completeness (2) NP-Completeness Graphs 4/13/2018 5:22 AM x x x x x
Chapter 7. Classification and Prediction
Computational Learning Theory
NP-Completeness (2) NP-Completeness Graphs 7/23/ :02 PM x x x x
NP-Completeness (2) NP-Completeness Graphs 7/23/ :02 PM x x x x
First-Order Logic and Inductive Logic Programming
Data Mining Lecture 11.
NP-Completeness Yin Tat Lee
Intro to Theory of Computation
Intro to Theory of Computation
NP-Completeness (2) NP-Completeness Graphs 11/23/2018 2:12 PM x x x x
Discriminative Frequent Pattern Analysis for Effective Classification
Ensemble learning.
NP-Completeness Yin Tat Lee
CSE 6408 Advanced Algorithms.
CSE 589 Applied Algorithms Spring 1999
Switching Lemmas and Proof Complexity
NP-Completeness (2) NP-Completeness Graphs 7/9/2019 6:12 AM x x x x x
Presentation transcript:

Decision List LING 572 Fei Xia 1/12/06

Outline Basic concepts and properties Case study

Definitions A decision list (DL) is an ordered list of conjunctive rules. –Rules can overlap, so the order is important. A k-DL: the length of every rule is at most k. A decision tree determines an example’s class by using the first matched rule.

An example A simple DL: 1.If X 1 =v 11 && X 2 =v 21 then c 1 2.If X 2 =v 21 && X 3 =v 34 then c 2 Classify an example=(v 11,v 21,v 34 ) The DL is 2-DL.

Rivest’s paper It assumes that all attributes (including goal attribute) are binary. It shows DL is easily learnable from examples.

Assignment and formula Input attributes: x 1, …, x n An assignment gives each input attribute a value (1 or 0): e.g., A boolean formula (function) maps each assignment to a value (1 or 0):

Two formulae are equivalent if they give the same value for same input. Total number of different formulae:  Classification problem: learn a formula given a partial table

CNF an DNF Literal: Term: conjunction (“and”) of literals Clause: disjunction (“or”) of literals CNF (conjunctive normal form): the conjunction of clauses. DNF (disjunctive normal form): the disjunction of terms. k-CNF and k-DNF

A slightly different definition of DT A decision tree (DT) is a binary tree where each internal node is labeled with a variable, and each leaf is labeled with 0 or 1. k-DT: the depth of a DT is at most k. A DT defines a boolean formula: look at the paths whose leaf node is 1. An example

Decision list A decision list is a list of pairs (f 1, v 1 ), …, (f r, v r ), f i are terms, and f r =true. A decision list defines a boolean function: given an assignment x, DL(x)=v j, where j is the least index s.t. f j (x)=1.

Relations among different representations CNF, DNF, DT, DL k-CNF, k-DNF, k-DT, k-DL –For any k < n, k-DL is a proper superset of the other three. –Compared to DT, DL has a simple structure, but the complexity of the decisions allowed at each node is greater.

k-CNF and k-DNF are proper subsets of k-DL k-DNF is a subset of k-DL: –Each term t of a DNF is converted into a decision rule (t, 1). k-CNF is a subset of k-DL: –Every k-CNF is a complement of a k-DNF: k-CNF and k-DNF are duals of each other. –The complement of a k-DL is also a k-DL. Neither k-CNF nor k-DNF is a subset of the other –Ex: 1-DNF:

K-DT is a proper set of k-DL K-DT is a subset of k-DNF –Each leaf labeled with “1” maps to a term in k-DNF. K-DT is a subset of k-CNF –Each leaf labeled with “0” maps to a clause in k- CNF  k-DT is a subset of

K-DT, k-CNF, k-DNF and k-DT k-CNF k-DNFk-DT K-DL

Learnability Positive examples vs. negative examples of the concept being learned. –In some domains, positive examples are easier to collect. A sample is a set of examples. A boolean function is consistent with a sample if it does not contradict any example in the sample.

Two properties of a learning algorithm A learning algorithm is economical if it requires few examples to identify the correct concept. A learning algorithm is efficient if it requires little computational effort to identify the correct concept.  We prefer algorithms that are both economical and efficient.

Hypothesis space Hypothesis space F: a set of concepts that are being considered. Hopefully, the concept being learned should be in the hypothesis space of a learning algorithm. The goal of a learning algorithm is to select the right concept from F given the training data.

Discrepancy between two functions f and g: Ideally, we want to be as small as possible. To deal with ‘bad luck’ in drawing example according to P n, we define a confidence parameter:

“Polynomially learnable” A set of Boolean functions is polynomially learnable if there exists an algorithm A and a polynomial function when given a sample of f of size drawn according to Pn, A will with probability at least output a s.t. Furthermore, A’s running time is polynomially bounded in n and m. K-DL is polynomially learnable.

How to build a decision list Decision tree  Decision list Greedy, iterative algorithm that builds DLs directly.

The algorithm in (Rivest, 1987) 1.If the example set S is empty, halt. 2.Examine each term of length k until a term t is found s.t. all examples in S which make t true are of the same type v. 3.Add (t, v) to decision list and remove those examples from S. 4.Repeat 1-3.

The general greedy algorithm RuleList=[], E=training_data Repeat until E is empty or gain is small –f = Find_best_feature(E) –Let E’ be the examples covered by f –Let c be the most common class in E’ –Add (f, c) to RuleList –E=E – E’

Problem of greedy algorithm The interpretation of rules depends on preceding rules. Each iteration reduces the number of training examples. Poor rule choices at the beginning of the list can significantly reduce the accuracy of DL learned.  Several papers on alternative algorithms

Summary of (Rivest, 1987) Formal definition of DL Show the relation between k-DL, k-CNF, k-DNF and k-DL. Prove that k-DL is polynomially learnable. Give a simple greedy algorithm to build k- DL.

Outline Basic concepts and properties Case study

In practice Input attributes and the goal are not necessarily binary. –Ex: the previous word A term  a feature (it is not necessarily a conjunction of literals) –Ex: the word appears in a k-word window Only some feature types are considered, instead of all possible features: –Ex: previous word and next word Greedy algorithm: quality measure –Ex: a feature with minimum entropy

Case study: accent restoration Task: to restore accents in Spanish and French  A special case of WSD Ex: ambiguous de-accented forms: –cesse  cesse, cessé –cote  côté, côte, cote, coté Algorithm: build a DL for each ambiguous de-accented form: e.g., one for cesse, another one for cote Attributes: words within a window

The algorithm Training: –Find the list of de-accent forms that are ambiguous. –For each ambiguous form, build a decision list. Testing: check each word in a sentence – if it is ambiguous, then restore the accent form according to the DL

Step 1: Identify forms that are ambiguous

Step 2: Collecting training context Context: the previous three and next three words. Strip the accents from the data. Why?

Step 3: Measure collocational distributions Feature types are pre-defined.

Collocations

Step 4: Rank decision rules by log- likelihood word class There are many alternatives.

Step 5: Pruning DLs Pruning: –Cross-validation –Remove redundant rules: “WEEKDAY” rule precedes “domingo” rule.

Building DL For a de-accented form w, find all possible accented forms Collect training contexts: –collect k words on each side of w –strip the accents from the data Measure collocational distributions: –use pre-defined attribute combination: –Ex: “-1 w”, “+1w, +2w” Rank decision rules by log-likelihood Optional pruning and interpolation

Experiments Prior (baseline): choose the most common form.

Global probabilities vs. Residual probabilities Two ways to calculate the log-likelihood: –Global probabilities: using the full data set –Residual probabilities: using the residual training data More relevant, but less data and more expensive to compute. Interpolation: use both In practice, global probability works better.

Combining vs. Not combining evidence Each decision is based on a single piece of evidence. –Run-time efficiency and easy modeling –It works well, at least for this task, but why? Combining all available evidence rarely produces different results “The gross exaggeration of prob from combining all of these non-independent log-likelihood is avoided”:

Summary of case study It allows a wider context (compared to n- gram methods) It allows the use of multiple, highly non- independent evidence types (compared to Bayesian methods)  kitchen-sink approach of the best kind

Advance topics

Probabilistic DL DL: a rule is (f, v) Probabilistic DL: a rule is (f, v 1 /p 1 v 2 /p 2 … v n /p n )

Entropy of a feature q fired not fired q T: S1S1 S2S2 SnSn … S: T-S:

Algorithms for building DL AQ algorithm (Michalski, 1969) CN2 algorithm (Clark and Niblett, 1989) Segal and Etzioni (1994) Goodman (2002) …

Summary of decision list Rules are easily understood by humans (but remember the order factor) DL tends to be relatively small, and fast and easy to apply in practice. DL is related to DT, CNF, DNF, and TBL. Learning: greedy algorithm and other improved algorithms Extension: probabilistic DL –Ex: if A & B then (c 1, 0.8) (c 2, 0.2)