15-781 Machine Learning (Recitation 1) By Jimeng Sun 9/15/05.

Slides:



Advertisements
Similar presentations
EE 4780 Huffman Coding Example. Bahadir K. Gunturk2 Huffman Coding Example Suppose X is a source producing symbols; the symbols comes from the alphabet.
Advertisements

1 Tree Based Data mining Techniques Haiqin Yang. 2 Why Data Mining? - Necessity is the Mother of Invention  The amount of data increases  Need to convert.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Chain Rules for Entropy
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Decision Trees.
Today: Entropy Information Theory. Claude Shannon Ph.D
Ensemble Learning: An Introduction
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Distributed Source Coding 教師 : 楊士萱 老師 學生 : 李桐照. Talk OutLine Introduction of DSCIntroduction of DSC Introduction of SWCQIntroduction of SWCQ ConclusionConclusion.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
X= {x 0, x 1,….,x J-1 } Y= {y 0, y 1, ….,y K-1 } Channel Finite set of input (X= {x 0, x 1,….,x J-1 }), and output (Y= {y 0, y 1,….,y K-1 }) alphabet.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
§1 Entropy and mutual information
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Classification ©Jiawei Han and Micheline Kamber
Copyright © 2001, 2003, Andrew W. Moore Entropy and Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
§4 Continuous source and Gaussian channel
Príznaky Znižovanie dimenzie: viac príznakov => viac informácie, vyššia presnosť viac príznakov => zložitejšia extrakcia viac príznakov => zložitejší tréning.
Review from before Christmas Break. Sampling Distributions Properties of a sampling distribution of means:
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Basic Concepts of Encoding Codes, their efficiency and redundancy 1.
Channel Capacity.
Machine Learning Queens College Lecture 2: Decision Trees.
Grasshopper communication What do they have to talk about? Are you a male or a female? Are you receptive to mating? Are you a grasshopper? Do you belong.
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Cryptography and Authentication A.J. Han Vinck Essen, 2008
© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 12 L8.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 8 Prof. Charles E. Leiserson.
7 sum of RVs. 7-1: variance of Z Find the variance of Z = X+Y by using Var(X), Var(Y), and Cov(X,Y)
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
October 5, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL7.1 Prof. Charles E. Leiserson L ECTURE 8 Hashing II Universal hashing Universality.
Decision Trees Recitation 1/17/08 Mary McGlohon
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
1 Lecture 7 System Models Attributes of a man-made system. Concerns in the design of a distributed system Communication channels Entropy and mutual information.
Presented by Minkoo Seo March, 2006
A study involving stress is done on a college campus among the students. The stress scores are known to follow a uniform distribution with the lowest stress.
1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy ; Hastie 9.2.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Lecture 3 Appendix 1 Computation of the conditional entropy.
Oliver Schulte Machine Learning 726 Decision Tree Classifiers.
The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National University.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Statistical methods in NLP Course 2
Statistics Lecture 19.
Chapter 5: Probability: What are the Chances?
© 2013 ExcelR Solutions. All Rights Reserved Data Mining - Supervised Decision Tree & Random Forest.
Decision Trees on MapReduce
Chapter 5: Probability: What are the Chances?
Some Rules for Expectation
Evaluating Hypotheses
Chapter 5: Probability: What are the Chances?
Distributed Compression For Binary Symetric Channels
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Problem 1.
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Watermarking with Side Information
Presentation transcript:

Machine Learning (Recitation 1) By Jimeng Sun 9/15/05

Entropy, Information Gain Decision Tree Probability Entropy

Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys) distribution Entropy P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m

Entropy H(*) H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “XBOX” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v X = College Major Y = Likes “ XBOX ” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes Specific Conditional Entropy H(Y|X=v)

Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “ XBOX ” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes Specific Conditional Entropy H(Y|X=v)

Conditional Entropy H(Y|X) Definition of Conditional Entropy: H(Y |X) = The average specific conditional entropy of Y = if you choose a record at random what will be the conditional entropy of Y, conditioned on that row’s value of X = Expected number of bits to transmit Y if both sides will know the value of X = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “ XBOX ” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Conditional Entropy Definition of Conditional Entropy: H(Y|X) = The average conditional entropy of Y = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “ XBOX ” Example: vjvj Prob(X=v j )H(Y | X = v j ) Math0.51 History0.250 CS0.250 H(Y|X) = 0.5 * * * 0 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Information Gain Definition of Information Gain: IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) - H(Y | X) X = College Major Y = Likes “ XBOX ” Example: H(Y) = 1 H(Y|X) = 0.5 Thus IG(Y|X) = 1 – 0.5 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Decision Tree

Tree pruning

Probability

Test your understanding All the time? Only when X and Y are independent? It can fail even if X and Y are independent?