Artificial Intelligence Knowledge in Learning: Dr. Shahriar Bijani Shahed University Fall 2017
Slides’ Reference S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Chapter 19, Prentice Hall, 2010, 3rd Edition.
Introduction The idea in all of the learning approaches described till now: to construct a function that has the input– output behavior observed in the data. The learning methods ≈ searching a hypothesis space to find a suitable function, starting from only a very basic assumption about the form of the function assumptions: “second-degree polynomial” or “decision tree” and perhaps a preference for simpler hypotheses. This means that before you can learn something new, you must first forget (almost) everything you know.
Introduction Now, we study learning methods that can take advantage of prior knowledge about the world. The prior knowledge is represented as general first- order logical theories bring together the work on knowledge representation and learning.
A logical formulation of learning The hypothesis is represented by a set of logical sentences. Example descriptions and classifications will also be logical sentences, A new example can be classified by inferring a classification sentence from the hypothesis and the example description.
A logical formulation of learning Goal and Hypotheses: Goal predicate Q: e.g. WillWait Learning: to find an equivalent logical expression we can classify examples Each hypothesis proposes such an expression a candidate definition of Q: r WillWait(r) Pat(r,Some) Pat(r,Full) Hungry(r)Type(r,French) …
A logical formulation of learning Hypothesis space is the set of all hypotheses the learning algorithm is designed to entertain. One of the hypotheses is correct: H1 V H2 V…V Hn Each Hi predicts a certain set of examples: the extension of the goal predicate. Two hypotheses with different extensions are logically inconsistent with each other, otherwise, they are logically equivalent.
What are Examples An example: an object of some logical description to which the goal concept may or may not apply. Altrnate(X1)^!Bar(X1)^!Fri/Sat(X1)^… Ideally, we want to find a hypothesis that agrees with all the examples. The classification of the examples: WillWait(X1) or ¬WillWait(X1) . Each hypothesis hj have the form: ∀ x Goal(x) ⇔ Cj(x), where Cj(x) is a candidate definition
What are Examples The above decision tree expresses the following logical definition
What are Examples The relation between f and h are: ++, --, +- (false negative), -+ (false positive). If the last two occur, example I and h are logically inconsistent. An example can be a false negative for the hypothesis, if the hypothesis says it should be negative but in fact it is positive. E.g. the new example X13 described by would be a false negative for hr
Current-best hypothesis search Maintain a single hypothesis Adjust it as new examples arrive to maintain consistency Generalization for positive examples Specialization for negative examples
Current-best hypothesis search Algorithm Need to check for consistency with all existing examples each time taking a new example The current-best-hypothesis learning algorithm. It searches for a consistent hypothesis that fits all the examples and backtracks when no consistent specialization/generalization can be found. To start the algorithm, any hypothesis can be passed in; it will be specialized or gneralized as needed.
Current-best hypothesis search Generalization and specialization are defined as operations that change the extension of a hypothesis. If hypothesis h1, with definition C1, is a generalization of hypothesis h2 with definition C2: ∀ x C2 (x) ⇒ C1(x) Therefore in order to construct a generalization of h2 , we need to find a definition C1 that is logically implied by C2
Current-best hypothesis search The CURRENT-BEST-LEARNING algorithm and its variants have been used in many machine learning systems (since 1970) Some difficulties: Very expensive: Checking all the previous examples over again for each modification. Backtracking: The search process may involve a great deal of backtracking. hypothesis space can be a doubly exponentially large place. Problems: nondeterministic, no guarantee for simplest and correct h, need backtrack
Least-commitment search Keeping only one h as its best guess is the problem -> Can we keep as many as possible? Version Space (candidate elimination) Algorithm incremental least-commitment From intervals to boundary sets G-set and S-set S0 – the most specific set contains nothing <0,0,…,0> G0 – the most general set covers everything <?,?,…,?> Everything between is guaranteed to be consistent with examples. VS tries to generalize S0 and specialize G0 incrementally What we can do instead is to keep around all and only those hypotheses that are consistent with all the data so far. Each new example will either have no effect or will get rid of some of the hypotheses.
Version space Generalization and specialization : When to stop find d-sets that contain only true/+, and true/-; Sj can only be generalized and Gj can only be specialized False positive for Si, too general, discard it False negative for Si, too specific, generalize it minimally False positive for Gi, too general, specialize it minimally False negative for Gi, too specific, discard it When to stop One concept left (Si = Gi) The version space collapses (G is more special than S, or G and S become empty) Run out of examples One major problem: can’t handle noise No completely successful solution the problem of noise. Drawbacks * If the domain contains noise or insufficient attributes for exact classification, the version space will always collapse. • If we allow unlimited disjunction in the hypothesis space, the S-set will always contain a single most-specific hypothesis: the disjunction of the descriptions of the positive examples seen to date. Similarly, the G-set will contain just the negation of the disjunction of the descriptions of the negative examples. • For some hypothesis spaces, the number of elements in the S-set or G-set may grow exponentially in the number of attributes, even though efficient learning algorithms exist for those hypothesis spaces.
Using prior knowledge For Decision Tree and logical description learning, we assume no prior knowledge We do have some prior knowledge, so how can we use it? We need a logical formulation as opposed to the function learning. To understand the role of prior knowledge, we need to talk about the logical relationships among hypotheses, example descriptions, and classifications.
Inductive learning in the logical setting The objective is to find a hypothesis that explains the classifications of the examples, given their descriptions. Hypothesis ^ Description |= Classifications Hypothesis is unknown, explains the observations Descriptions - the conjunction of all the example descriptions Classifications - the conjunction of all the example classifications Knowledge free learning Decision trees Description = Classifications
A cumulative learning process The new approach is to design agents that already know something and are trying to learn some more. Intuitively, this should be faster and better than without using knowledge, assuming what’s known is always correct. How to implement this cumulative learning with increasing knowledge?
Some examples of using knowledge One can jump to general conclusions after only one observation. Traveling to Brazil: Language and name ? A pharmacologically ignorant but diagnostically sophisticated medical student … These are all cases in which the use of background knowledge allows much faster learning than one might expect from a pure induction program.
Some general schemes Explanation-based learning (EBL) Hypothesis^Description |= Classifications Background |= Hypothesis doesn’t learn anything factually new from instance Relevance-based learning (RBL) Hypothesis^Descriptions |= Classifications Background^Descriptions^Classification |= Hypothesis deductive in nature Knowledge-based inductive learning (KBIL) Background^Hypothesis^Description |= Classifications
Inductive logical programming (ILP) ILP can formulate hypotheses in general first-order logic Others like Decision Trees are more restricted languages Prior knowledge is used to reduce the complexity of learning: prior knowledge further reduces the H space prior knowledge helps find the shorter H Again, assuming prior knowledge is correct
Explanation-based learning (EBL) Hypothesis^Description |= Classifications Background |= Hypothesis A method to extract general rules from individual observations The goal is to solve a similar problem faster next time. Memoization - speed up by saving results and avoiding solving a problem from scratch EBL does it one step further - from observations to rules e.g. differentiating and simplifying algebraic expressions
Explanation-based learning (EBL) The “explanation” can be a logical proof, but more generally it can be any reasoning or problem-solving process. The key is to be able to identify the necessary conditions for those same steps to apply to another case.
Why EBL? Explaining why something is a good idea is much easier than coming up with the idea. Once something is understood, it can be generalized and reused in other circumstances. Extracting general rules from examples EBL constructs two proof trees simultaneously by variablization of the constants in the first tree See Fig 19.7
Basic EBL Given an example, construct a proof tree using the background knowledge In parallel, construct a generalized proof tree for the variabilized goal Construct a new rule (leaves => the root) Drop any conditions that are true regardless of the variables in the goal
Efficiency of EBL Choosing a general rule too many rules -> slow inference aim for gain - significant increase in speed as general as possible Operationality - A subgoal is operational means it is easy to solve Trade-off between Operationality and Generality Empirical analysis of efficiency in EBL study
Learning using Relevant Information (RBL) Hypothesis^Descriptions |= Classifications Background^Descriptions^Classification |= Hypothesis Prior knowledge: People in a country usually speak the same language Nat(x,n) ^Nat(y,n)^Lang(x,l)=>Lang(y,l) Observation: Given nationality, language is fully determined Given Sima is Iranian & speaks Persian Nat(Sima, I) ^ Lang(Sima, P) We can logically conclude Nat(y,I) => Lang(y,P)
Functional dependencies We have seen a form of relevance: determination - language (Persian) is a function of nationality (Iranian) Determination is really a relationship between the predicates The corresponding generalization follows logically from the determinations and descriptions.
We can generalize from Sima to all Iranians, but not to all nations We can generalize from Sima to all Iranians, but not to all nations. So, determinations can limit the H space to be considered. Determinations specify a sufficient basis vocabulary from which to construct hypotheses concerning the target predicate. A reduction in the H space size should make it easier to learn the target predicate For n Boolean features, if the determination contains d features, what is the saving for the required number of examples? |H| = O(2^2^n) , so the number of examples is O(2^n) the learner will require only O(2^d) examples, a reduction of O(2^(n-d))
Learning using Relevant Information A determination P Q says if any examples match on P, they must also match on Q Find the simplest determination consistent with the observations Search through the space of determinations from one predicate, two predicates Algorithm - Fig 19.8 Time complexity is n choosing p Feature selection is about finding determination Feature selection is an active research area for machine learning, pattern recognition, statistics
Its learning performance improves. Combining relevance based learning with decision tree learning -> RBDTL Reduce the required training data Reduce the hypothesis space Its learning performance improves. Performance in terms of training set size Gains: time saving, less chance to overfit The time complexity of this algorithm depends on the size of the smallest consistent determination. Obviously, in cases where all the available attributes are relevant, RBDTL will show no advantage.
Other issues about relevance based learning noise handling using other prior knowledge Semi-supervised learning Expert knowledge as constraints from attribute-based representation to any FOL
Inductive Logic Programming (ILP) knowledge-based inductive learning (KBIL) Background^Hypothesis^Description |= Classifications It combines inductive methods with FOL. ILP represents theories as logic programs. ILP offers complete algorithms for inducing general, first-order theories from examples. It can learn successfully in domains where attribute-based algorithms fail completely. for the unknown Hypothesis, given the Background knowledge and examples described by Descriptions and Classifications .
Inductive Logic Programming: example problem of learning family relationships from examples Attribute-based learning algorithms are incapable of learning relational predicates. Descriptions: Father (Philip, Charles) Father (Philip, Anne) ... Mother(Mum, Margaret) Mother(Mum, Elizabeth) ... Married(Diana, Charles) Married(Elizabeth, Philip)... Male(Philip) Male(Charles) ... Female (Beatrice) Female(Margaret) The sentences in Classifications depend on the target concept being learned e.g. Grandparent: The complete set of Classifications contains 20× 20= 400 conjuncts: Grandparent(Mum,Charles) Grandparent(Elizabeth,Beatrice) ¬Grandparent (Mum, Harry) ¬Grandparent(Spencer, Peter ) ... The goal of an inductive learning program is to come up with a set of sentences for the Hypothesis such that the entailment constraint is satisfied
Inductive Logic Programming: example Notice: an attribute-based learning algorithm, such as Decision-Tree-Learning, will not solve this problem. Why? In order to express Grandparent as an attribute (i.e., a unary predicate), we would need to make pairs of people into objects: Grandparent (Mum,Charles ) ... Then we get stuck in trying to represent the example descriptions. The only possible attributes are horrible things such as FirstElementIsMotherOfElizabeth(Mum,Charles). By applying ILP, Background can be: Parent(x,y) ⇔ [Mother(x, y) ∨ Father (x, y)], Grandparent(x,y)⇔[∃z Parent(x,z) ∧ Parent(z,y)].
Inductive Logic Programming: example An example - a typical family tree
Inductive Logic Programming (ILP) Constructive induction algorithms: Algorithms that can generate new predicates a necessary part of the cumulative learning one of the hardest problems in machine learning some ILP techniques provide effective mechanisms for it Two principal approaches to ILP: Top-down inductive learning methods: using a generalization of decision tree methods Inductive learning with inverse deduction: using techniques based on inverting a resolution proof.
Inverse resolution If Classifications follow from B^H^D, then we can prove this by resolution with refutation (because resolution is complete). The normal resolution is C1 and C2 -> C (the resolvent) If we run the proof backwards, we can find a H such that the proof goes through. C -> C1 and C2 C and C2 -> C1 The key, then, is to find a way to invert the resolution process. Background ∧ Hypothesis ∧ Descriptions |= Classification
Inverse resolution Generating inverse proofs A family tree example Background ∧ Hypothesis ∧ Descriptions |= Classification
Inverse resolution involves search Each inverse resolution step is nondeterministic For any C and C1, there can be many C2 Discovering new knowledge with IR It’s not easy - a monkey and a typewriter Discovering new predicates with IR The ability to use background knowledge provides significant advantages
Top-down learning (FOIL) A generalization of DT induction to the first- order case by the same author of C4.5 Starting with a general rule and specialize it to fit data Now we use first-order literals instead of attributes, and H is a set of clauses instead of a decision tree. Example: =>grandfather(x,y) (page 701) positive and negative examples adding literals one at a time to the left-hand side e.g., Father (x,y) => Grandfather(x,y) How to choose literal? (Algorithm on page 702) the rule should agree with some + examples, none of – examples FOIL removes the covered + examples, repeats
Summary Using prior knowledge in cumulative learning Prior knowledge allows for shorter H’s. Prior knowledge plays different logical roles as in entailment constraints EBL, RBL, KBIL ILP generates new predicates so that concise new theories can be expressed.