1 Data Classification and Segmentation : Bayesian Methods III HK Book: Section 7.4, Domingos’ Paper (online) Instructor: Qiang Yang Hong Kong University.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification Techniques: Decision Tree Learning
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
What is Statistical Modeling
Lecture Notes for Chapter 4 Introduction to Data Mining
1. Abstract 2 Introduction Related Work Conclusion References.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
When to use Data Mining. Introduction An important question that should be answered before you commence any data mining project is whether data mining.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe.
Data Mining with Naïve Bayesian Methods
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Lecture 5 (Classification with Decision Trees)
Chapter 16 Chi Squared Tests.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
DataMining By Guan Hang Su CS157A section 2 fall 2005.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.
Bayesian Networks. Male brain wiring Female brain wiring.
Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Conditional Probability.  So far for the loan project, we know how to: Compute probabilities for the events in the sample space: S = {success, failure}.
Universit at Dortmund, LS VIII
Oracle Data Mining Update and Xerox Application Charlie Berger Sr. Director of Product Management, Life Sciences and Data Mining
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Data Mining By Dave Maung.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
ISQS 7342 Dr. zhangxi Lin By: Tej Pulapa. DT in Forecasting Targeted Marketing - Know before hand what an online customer loves to see or hear about.
Lecture Notes for Chapter 4 Introduction to Data Mining
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Chapter 7. Classification and Prediction
Source: Procedia Computer Science(2015)70:
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Bayesian Classification
Presented by: Prof. Ali Jaoua
Naïve Bayes Classifiers
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
©Jiawei Han and Micheline Kamber
Welcome! Knowledge Discovery and Data Mining
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

1 Data Classification and Segmentation : Bayesian Methods III HK Book: Section 7.4, Domingos’ Paper (online) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank

2

3 First, Review Naïve Bayesian

4 Naïve Bayesian is Surprisingly Good. Why?

5 Independence Test Naïve Bayesian’s good performance may be due to independence of attributes? Modeling independence of attributes A m and A n D() is zero when completely independent D() is large if dependent

6 How to measure independence? H(A|C): the once the class C is given, how much is A given? In other words, how random is A once C is given We know randomness can be measured in Entropy Entropy is –p*log(p), and then summed over all possible values Thus, to measure the dependence of A on C, we can look at each value Ci of C, such that under Ci, Find the entropy of A Then, sum over all possible Ci

7 Measuring dependency Example: C=Play attribute, A=Windy Attribute C1=yes, C2=no Pr(C=Yes)=9/14 When C=C1 (Play = yes) A=True: 3 counts A=False: 6 counts Pr(A=True and C=C1)=3/14 Pr(A=False and C=C1)=6/14 When C=C2 (no) Pr(A=True and C=C2)=3/14 Pr(A=False and C=C2)=2/14 H(A|C)=9/14*[(- 3/14)log(3/14)- (6/14)log(6/14)]+5/14*[- 3/14log(3/14)-2/14log(2/14)] WindyPlay FALSEno TRUEno FALSE*yes FALSE*yes FALSE*yes TRUEno TRUE*yes FALSEno FALSE*yes FALSE*yes TRUE*yes TRUE*yes FALSE*yes TRUEno

8 Property of H(A|C) If A is completely dependent on C, what is H(A|C)? In the extreme case, whenever C=yes, A is true; whenever C=no, A=false. Pr(C=yes and A=true)=1 Pr(C=yes and A=false)=0 H(A|C)=? The other extreme: if A is completely independent on C, what is H(A|C)? Pr(…) between 0 and 1, H(A|C)=??

9 Extending to two attributes For A1 and A2, we simply consider the Cartesian product of their values This gives a single attribute Then we measure the dependency of A1A2 (a new attribute) on class C Example: Consider A1=Humidity and A2=Windy, Class= Play Humidity={high, normal} Windy={True, False} A1A2=HumidityWindy={hightrue, highfalse, normaltrue, normalfalse}

10 HumidityHumidityWindyPlay highhighFALSEno highhighTRUEno highhighFALSEyes highhighFALSEyes normalnormal FALSEyes normalnormal TRUEno normalnormal TRUEyes highhigh FALSEno normalnormal FALSEyes normalnormal FALSEyes normalnormal TRUEyes highhigh TRUEyes normalnormal FALSEyes highhigh TRUEno HumidityWindyPlay highFALSEno highTRUEno highFALSEyes highFALSEyes normal FALSEyes normal TRUEno normal TRUEyes high FALSEno normal FALSEyes normal FALSEyes normal TRUEyes high TRUEyes normal FALSEyes high TRUEno

11 Putting them together When the two attributes A m and A n are independent, given C, H(A m A n |C)=H(A m |C)+H(A n |C) D(…) = 0 When they are dependent D(…) is a large value Thus, D (…) is a measure of dependency between attributes

12 Most domains are not independent

13 Why Perform So well? (section 4 of paper) Assume three attributes A, B, C Two classes: + and – (say, play=+ means yes) Assume A and B are the same – completely dependent Assume Pr(+)=Pr(-)=0.5 Assume that A and C are independent Thus, Pr(A,C|+)=Pr(A|+)*Pr(C|+) Optimal Decision: If Pr(+)*Pr(A,B,C|+)>Pr(-)*Pr(A,B,C|-), then answer =+; else answer=- Pr(A,B,C|+)=Pr(A,A,C|+)=Pr(A,C|+)=Pr(A|+)*Pr(C|+) Likewise for – Thus, Optimal method is: Pr(A|+)*Pr(C|+) > Pr(A|-)*Pr(C|-)

14 Analysis If we use the Naïve Bayesian method, IF Pr(+)*Pr(A|+)*Pr(B|+)*Pr(C|+)> Pr(- )*Pr(A|-)*Pr(B|-)*Pr(C|-) Then answer = + Else, answer = - Since A=B, and Pr(+)=Pr(-), we have Pr(A) 2 *Pr(C|+)> Pr(A) 2 *Pr(C|-)

15 Simplify Optimal Formula Let Pr(+|A)=p, Pr(+|C)=q Pr(A|+)=Pr(+|A)*Pr(A)/Pr(+)=p*Pr(A)/Pr(+) Pr(A|-)=(1-p)*Pr(A)/Pr(-) Pr(C|+)=Pr(+|C)*Pr(C)/Pr(+)=q*Pr(C)/Pr(+) Pr(C|-)=(1-q)*Pr(C)/Pr(-) Thus, the optimal method If Pr(+)*Pr(A|+)*Pr(C|+)>Pr(-)*Pr(A|-)*Pr(C|-), then answer =+; else answer=- becomes p*q > (1-p)*(1-q) (Eq1)

16 Simplify the NB Formula Naïve Bayesian Formula Pr(A) 2 *Pr(C|+)> Pr(A) 2 *Pr(C|-) becomes p 2 q > (1-p) 2 q (Eq 2) Thus, our question is: In order to know why Naïve Bayesian perform so well, we want to ask: When does the optimal decision agree (or differ) with Naïve Bayesian decision? That is, where do formulas (Eq 1) and (Eq 2) agree or disagree?

17 disagree Optimal

18 Conclusion In most cases, naïve Bayesian performs the same as the optimal classifiers That is, the error rates are minimal This has been confirmed in many practical applications

19 Applications of Bayesian Method Gene Analysis Nir Friedman Iftach Nachman Dana Pe’er, Institute of Computer Science, Hebrew University Text and analysis Spam Filter Microsoft Work News classification for personal news delivery on the Web User Profiles Credit Analysis in Financial Industry Analyze the probability of payment for a loan

20 Gene Interaction Analysis DNA Gene DNA is a double-stranded molecule Hereditary information is encoded Complementation rules Gene is a segment of DNA Contain the information required to make a protein

21 Gene Interaction Result: Example of interaction between proteins for gene SVS1. The width of edges corresponds to the conditional probability.

22 Spam Killer Bayesian Methods are used for weed out spam s

23 Spam Killer

24 Construct your training data Each is one record: M s are classified by user into Spams: + class Non-spams: - class A M is a spam if Pr(+|M)>Pr(-|M) Features: Words, values = {1, 0} or {frequency} Phrases Attachment {yes, no} How accurate: TP rate > 90% We wish FP rate to be as low as possible Those are the s that are nonspam but are classified as spam

25 Naïve Bayesian In Oracle9i What is the target market? Oracle9i Data Mining is best suited for companies that have lots of data, are committed to the Oracle platform, and want to automate and operationalize their extraction of business intelligence. The initial end user is a Java application developer, although the end user of the application enhanced by data mining could be a customer service rep, marketing manager, customer, business manager, or just about any other imaginable user. What algorithms does Oracle9i Data Mining support? Oracle9i Data Mining provides programmatic access to two data mining algorithms embedded in Oracle9i Database through a Java-based API. Data mining algorithms are machine-learning techniques for analyzing data for specific categories of problems. Different algorithms are good at different types of analysis. Oracle9i Data Mining provides two algorithms: Naive Bayes for Classifications and Predictions and Association Rules for finding patterns of co-occurring events. Together, they cover a broad range of business problems. Naive Bayes: Oracle9i Data Mining's Naive Bayes algorithm can predict binary or multi-class outcomes. In binary problems, each record either will or will not exhibit the modeled behavior. For example, a model could be built to predict whether a customer will churn or remain loyal. Naive Bayes can also make predictions for multi-class problems where there are several possible outcomes. For example, a model could be built to predict which class of service will be preferred by each prospect. Binary model example: Q: Is this customer likely to become a high-profit customer? A: Yes, with 85% probability Multi-class model example: Q: Which one of five customer segments is this customer most likely to fit into — Grow, Stable, Defect, Decline or Insignificant? A: Stable, with 55% probability