Introduction to Machine Learning Algorithms in Bioinformatics: Part II

Introduction to Machine Learning Algorithms in Bioinformatics: Part II
Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University

Outline Part I Part II Concept of Machine Learning (ML)
Machine Learning Algorithms and Applications Applications in Bioinformatics Part II Version Space Learning Decision Tree Learning

Exercise 1: EnjoySport Problem

Training Example for EnjoySport
Sky Sunny Rainy Temp Warm Cold Humid Normal High Wind Strong Water Warm Cool Forecast Same Change EnjoySports Yes No What is the general concept?

Representing Hypotheses
Many possible representations Here, h is conjunction of constraints on attributes Each constraint can be A specific value (e.g., Water = Warm) don’t care (e.g., “Water = ?”) No value allowed (e.g., “Water = 0”) For example, Sky Temp Humid Wind Water Forecast <Sunny ? ? Strong ? Same>

Prototypical Concept Learning Task
Given: Instances X: Possible days, each described by the attributes Sky, Temp, Humidity, Wind, Water, Forecast Target function c: EnjoySport: C->{0,1} Hypotheses H: Conjunctions of literals. E.g. <?, Cold, High, ?, ?, ? > Training examples D: Positive and negative examples of the target function <x1, c(x1) >,…<xm, c(xm) > Determine: A hypothesis h in H such that h(x) = c(x) for all x in D.

Example Concept Space S: G: <Sunny , Warm, ? , Strong , ? , ?>
<Sunny , ?, ? , Strong , ? , ?> <Sunny , Warm, ? , ? , ? , ?> <?, Warm, ? , Strong , ? , ?> G: <Sunny , ?, ? , ? , ? , ?>, <?, Warm, ? , ? , ? , ?>

Exercise 2: Janitor Robot Problem

Attributes, Features, and Dimensions
An attribute is a variable used to describe the objects in X. An attribute together with all its possible values is called a dimension of X. An attribute together with one of its values is called a feature. In the robot example, there are four dimensions.

Examples Janitorial robot problem Examples of concepts:
faculty  cs  fifth: conjunction of pos literals large  student : purely disjunctive concept student  (faculty  cs) :disjunctive normal form Attribute Status of office occupants Location of the office (floor) Department of the occupants size of the office Possible values faculty, staff, or student three, four, or five EE or CS large, medium, or small

Concept Space for the Robot Problem

Version Space Learning
A concept C1 is a specialization of a concept C2 if C1 C2. If C1 is a specialization of a concept C2, then C2 is a generalization of C1. Ex) “an office on the fifth floor” is more general than the concept “an office on the fifth floor belonging to an assistant professor.” If C1 is an immediate specialization of C2, if there is no other concept that is both a specialization of C2 and a generalization of C1.

Version Spaces A version space defines a graph whose nodes are concepts and whose arcs specify that one concept is an immediate specialization of another. The version space learning method uses bounds, general and specific, on the set of consistent hypotheses using the ordering induced by the specialization relationship. The version space method is optimal for conjunctions of positive literals in the sense that no methods does less work in implementing the bias that restricts attention to conjunctions of positive literals.

Simple Version Space for the Robot Problem
The most general concept:  Immediate specializations of  : (cs), (ee), (faculty), (staff), (four), (five) Immediate specializations of the next level: (cs  faculty), (cs  staff), (cs  four), (cs  five), (staff  four), (staff  five) (csfacultyfour), (csfacultyfive), (csstafffour), (csstafffive), ...

Version Space Learning Algorithm
Procedure VersionSpaceLearning() 1. Initialize the general and specific boundaries: GSET := { } SSET := {all conjunctions of three literals} 2. The boundaries are modified after each training instance I is presented: 2.1 If the example is positive, then perform the following steps: (1) eliminate all concepts in GSET that are not consistent with I. (2) generalize each concept in SSET until it is consistent with I by using generalization operators. 2.2 If the example is negative, then perform the following steps: (1) eliminate all concepts in SSET that are consistent with I. (2) specialize each concept in GSET until it is not consistent with I by using specialization operators.

Generalization Operators
Dropping conjuncts: (cs  faculty)  (faculty) Adding disjunctions: (cs  faculty)  (cs  faculty)  (cs  staff) Replacing constants with variables (cs  faculty)  (cs  ?x)

Specialization Operators
Adding conjuncts: (faculty)  (cs  faculty) Dropping disjunctions: (cs  faculty)  (cs  staff)  (cs  faculty) Replacing variables with constants (cs  ?x)  (cs  faculty)

Convergence of Boundaries in a Version Space (1/3)

1. After presenting a "+" example: (csfacultyfour) GSET = {} , SSET = {(cs  faculty  four)} 2. After presenting a "" example: (csstafffive) GSET = {(faculty), (four)} , SSET = {(cs  staff  five)} 3. After presenting a "+" example: (csfacultyfive) GSET = {(faculty)} , SSET = {(cs  faculty)} 4. After presenting a "-" example: (eefacultyfour) GSET = {(cs  faculty)},

Decision Tree Learning

An Illustrative Example: PlayTennis
Day Outlook 온도 Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14

Decision Trees: Example
Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Decision Trees Decision trees classify instances by sorting them down the tree from the root to some leaf node. An instance is classified by starting from the root, testing the attribute specified by this node, then moving down the branch corresponding to the value of the attribute in the given example. Node: test of some attribute of the instance Leaf nodes: the classification of the instance Branch: one of the possible values of for the attribute

Decision Tree Learning
Decision tree (DT) learning is a method for approximating discrete-valued target functions robust to noisy data capable of learning disjunctive expressions Learned trees can be re-represented as sets of if-then rules to improve human readability. Decision tree learning methods search a completely expressive hypothesis space. Inductive bias is a preference to smaller trees.

Decision Tree Representation
Decision tree represents disjunctions of conjunctions of constraints on the attribute values of instances. Each path from the root to a leaf corresponds to a conjunction of attribute tests. The tree itself represents a disjunction of these conjunctions.

Appropriate Problems for DT Learning
Instances are represented by attribute-value pairs. The target function has discrete output values. Disjunctive descriptions may be required. The training data may contain errors. The training data may contain missing attribute values.

ID3: A Decision Tree Learning Algorithm
ID3 learns decision trees by constructing them top-down: Begins with the question “which attribute should be tested at the root of the tree?” The best attribute is selected and used as the test at the root. A descendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node. The entire process is repeated using the subset of examples associated with each descendant node. A greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices.

Which Attribute to Choose?
We would like to select the attribute that is most useful for classifying examples. What is a good measure of the worth of an attribute? ID3 uses a statistical property, called information gain, to select among the candidate attributes at each step while growing the tree. Information gain is based on the concept of entropy.

Entropy Entropy characterizes the impurity or disorder of an arbitrary collection of examples. Entropy is a measure of the expected encoding length in bits (-log2pi is the code length of pi in bits).

Entropy(S) 1.0 0.0

Another Interpretation
Entropy specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn at random with uniform probability). Entropy(S) = 0, if all members in the same class Entropy(S) = 1, if |positive examples| = |negative examples| Entropy is a measure of uncertainty.

Information Gain First term: entropy of the original collection S. Second term: expected value of the entropy after S is partitioned using attribute A. The information gain, Gain(S, A), of an attribute A, relative to a collection of examples S, is the expected reduction in entropy caused by partitioning the examples according to attribute A. Information gain is the expected reduction in entropy caused by knowing the value of attribute A.

Interpretations of Gain(S,A)
Statistical property that measures how well a given attribute separates the training examples according to their target classification. Information provided about the target function value, given the value of some other attribute A. The number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of A.

Partially Learned Decision Tree

Computation of Gain(S, A)

Which Attribute is the Best Classifier? (1)
Humidity High Normal S:[9+, 5-] E=0.940 [3+, 4-] E=0.985 [6+, 1-] E=0.592

Which Attribute is the Best Classifier? (2)
Wind Weak Strong S:[9+, 5-] E=0.940 [6+, 2-] E=0.811 [3+, 3-] E=1.000 Classifying examples by Humidity provides more information gain than by Wind.

Hypothesis Space Search by ID3
ID3 searches a space of hypotheses for one that best fits the training examples The hypothesis space of ID3 is the set of possible decision trees ID3 is a simple-to-complex, hill-climbing search process, beginning with the empty tree. The evaluation function that guides the hill climbing search is the information gain measure.

Hypothesis Space

Properties of ID3 ID3’s hypothesis space is a complete space of finite discrete-valued functions. In contrast: conjunction hypothesis of CE not continuous-valued functions (cf. neural nets) ID3 maintains only a single current hypothesis. In contrast: The version space method maintains all consistent hypos. ID3 does not have the ability to determine how many alternative DTs are consistent with D.

Properties of ID3 (cont’d)
ID3 performs no backtracking in its search. Risk of converging to a locally optimal solutions that are not globally optimal (remedy: post-pruning described below). ID3 uses all training examples at each step of search. In contrast: Version space learning is incremental. ID3 makes statistically based decisions and thus less sensitive to errors in individual training examples.

Issues in DT Learning (1) How deeply to grow the decision tree
(2) Handling continuous attributes (3) Choosing an appropriate attribute selection measure (4) Handling the missing attribute values (5) Handling attributes with differing costs

(1) Avoiding Overfitting the Data
ID3 grows each branch of the tree just deeply enough to perfectly classify the training examples. What if there is noise in the data, or the training set is too small? Definition of overfitting: H: hypothesis space h, h’: elements of H D: set of training examples, X: set of all possible examples Hypothsis h is said to overfit the training data if there exists h’ s.t. Error(D; h) < Error(D; h’) and Error(X; h) > Error(X; h’)

Overfitting Phenomenon

Overfitting Avoidance: Three Approaches
Use a separate set of examples, validation set, to evaluate the utility of post-pruning nodes from the tree. Use all available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. Use an explicit measure of the complexity for encoding the training examples and the DT, halting growth of the tree when this encoding size is minimized.

(2) Continuous-Valued Attributes
ID3: both the target attribute and the attribute tested by decision nodes must be discrete valued. The second restriction can easily be removed by dynamically defining new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals. If A is continuous valued, create a new boolean attribute Ac that is true if A < c and false otherwise.

Example Consider the continuous-valued attribute Temperature in the table below. Sort the examples according to A. Identify adjacent examples that differ in their target classification. Generate a set of candidate thresholds midway between the corresponding values of A: (48+60)/2 and (80+90)/2. Choose the candidate threshold that maximizes the information gain. Temperature: 40 48 60 72 80 90 PlayTennis: No Yes

(3) Alternative Measures for Selecting Attributes
The information gain measure favors attributes with many values over those with few values. Example: Attribute Date (e.g. March ) Date alone would perfectly predict the target attribute over the training data, but would fare poorly on unseen examples. One solution is to use a different measure, such as the gain ratio which penalizes attributes such as Date by incorporating a term, called split information.

Alternative Measures: SplitInformation
SplitInformation is an entropy of S w.r.t. the values of attribute A. n개의 data를 n개의 value가 완벽하게 분류한다면 2부분으로 완벽하게 나누는 2개의 value를 가진다면

Alternative Measures: GainRatio
GainRatio penalizes attributes such as Date by incorporating a term, called split information, that is sensitive to how broadly and uniformly the attribute splits the data. SplitInformation term discourages the selection of attributes with many uniformly distributed values.

(4) Missing Attribute Values
Suppose that <x, c(x)> is in S and the value A(x) is unknown. One possibility assign the missing attribute the value that is most common among training examples at node n that have the classification c(x). Another method assign a probability to each of the possible values of A.

(5) Attributes with Differing Costs
Classification of medical diseases Monetary cost Cost to patient comport ID3 can be modified to bias the search in favor of low-cost attributes. Two application examples Robot perception task: different times for sonar reading and processing (attributes) Medical diagnosis: different symptoms and lab tests with different costs

Information Gain with Costs
Tan and Schlimmer (1990) Numez (1988)

More information on biological data mining and related research can be found at

Introduction to Machine Learning Algorithms in Bioinformatics: Part II

Similar presentations

Presentation on theme: "Introduction to Machine Learning Algorithms in Bioinformatics: Part II"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Machine Learning Algorithms in Bioinformatics: Part II

Similar presentations

Presentation on theme: "Introduction to Machine Learning Algorithms in Bioinformatics: Part II"— Presentation transcript:

Similar presentations

About project

Feedback