Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes”

Slides:



Advertisements
Similar presentations
1 WHY MAKING BAYESIAN NETWORKS BAYESIAN MAKES SENSE. Dawn E. Holmes Department of Statistics and Applied Probability University of California, Santa Barbara.
Advertisements

A Tutorial on Learning with Bayesian Networks
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Probabilistic Inference Lecture 1
Visual Recognition Tutorial
Structure Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers Russell Greiner* and Wei Zhou University of Alberta.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
Machine Learning CMPT 726 Simon Fraser University
Required Sample size for Bayesian network Structure learning
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Visual Recognition Tutorial
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Experimental Evaluation
Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,
Learning Bayesian Networks (From David Heckerman’s tutorial)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian Learning Part 3+/- σ. Administrivia Final project/proposal Hand-out/brief discussion today Proposal due: Mar 27 Midterm exam: Thurs, Mar 22 (Thurs.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Simulation Output Analysis
PBG 650 Advanced Plant Breeding
Example 16,000 documents 100 topic Picked those with large p(w|z)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Statistical Learning (From data to distributions).
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Bayesian statistics Probabilities for everything.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Stick-Breaking Constructions
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Machine Learning Chapter 5. Evaluating Hypotheses
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
Latent Dirichlet Allocation
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
March 7, Using Pattern Recognition Techniques to Derive a Formal Analysis of Why Heuristic Functions Work B. John Oommen A Joint Work with Luis.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Oliver Schulte Machine Learning 726
Evaluating Hypotheses
Bayes Net Learning: Bayesian Approaches
More about Posterior Distributions
Bayesian Models in Machine Learning
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes” ] to produce generative model … subject to fixed total $$ Active Budgeted Discriminative Generative Label Attribute – Learning Budgeted Parameter Learning Foundations Nodes  r.v.s X Arcs  dependencies Parameters  quantify conditional distribution of each variable, given parents drawn from a distribution  ~ Pr(.) Here: Variables discrete, distributed as (independent) Beta/Dirichlet R Budgeted Distribution Learning of Belief Net Parameters Liuyang Li, Barnabas Poczos, Csaba Szepesvari, Russell Greiner Typical Parameter Learning Motivation Bayesian networks (BNs) model of a joint distr’n used in many applications… How to GET the BN? Hand-engineer BNs? Requires person to KNOW the structure, parameters…  Learn BNs from data Most models assume data are available initially… but might not be available initially data can be costly to obtain (time, energy,...) Medical tests can be very expensive! Loss Function R.v. X ~ Pr(. |  ) Parameter-tuple  induces distr’n of X Recall  ~p(.) Single estimate: Mean: E[  ]=  For any parameter value , loss of using  when should use  is: KL(  ||  ) =  x  X Pr(x |  ) ln[Pr(x |  ) / Pr(x |  ) ] Don’t know  … so average J( p(.) ) = E  ~p(.) [ KL(  || E[  ] ) ] Set of probes: A = { (A,1), (D,1), (D,2), (A,2), (E,1) } X A = values returned (r.v.) (A,1,b), (D,1,x), (D,2,+), (A,2,a), (E,1,0)  X A = E[  | X A ] =“mean parameter values”, wrt X A J( A ) = Loss of posterior, based on probes A = E  ~p(.) E X A ~ Pr(. |  ) [ KL(  ||  X A ) ] Problem Definition Given structure of belief net; prior over parameters no initial data, but … budget B  + to use to acquire data Collect data by performing sequence of probes Probe d ij obtains value of attribute i for patient j … at cost c i Cost of K probes is After spending B, use data to estimate model parameters Goal: an accurate generative model of the joint distribution over the features R 01 b+ x b–y0 +  A=1  A=0 Beta(5, 6)  B=1  B=0 Beta(3, 4) b  D=1|B= b  D=0|B= b 1Beta( 4, 1) 0Beta( 4, 6) b0+x1 b1+y0 a1–x0 b1–y0 a0+y1 Patient 1 Patient 2 Response R Learner R  A=1  A=0 Beta(5, 6)  B=1  B=0 Beta(3, 4) b  D=1|B=b  D=0|B=b 1Beta( 4, 1) 0Beta( 4, 6) Costs $ 4 $18 $ 5 Res $12 Total Budget: $30 Remaining Budget: $30 $26 $21 $16 $12 $0 1: b 2: x 3: + 5: 0 4: a Patient 1 Patient 2 Response  probability  Related Work IAL: Interventional Active Learning of Generative BNs [Ton&Koller, 2001a,b]; [Murphy,2001]; [Steck&Jaakkoa,2002] Learner sets values of certain features (interventions) over all training instances Learner receives ALL of remaining values of each specified instance Seeks BN parameters optimizing Generative distribution Differences: BGL cannot see (for free) any values: must pay for each probe BGL has explicit budget BGL can pay for only some features of instance Active Budgeted Discriminative Generative Label Attribute – Learning b+1 b1+y0 a–0 b–0 a+1 ++ b + b + R + + – – ABCDE bx0 a y  A=1  A=0 Beta(3, 3)  B=1  B=0 Beta(1, 1) b  D=1|B= b  D=0|B=b 1Beta( 1, 1) 0Beta( 4, 1) R ICML 2010, the 27th International Conference on Machine Learning Proposition: When independent: Loss Function decomposes J( A ) =  i J i ( A i ) J i ( A i ) just depends on size of A i | A i | = | A i ’ |  J i ( A i ) = J i ( A i ’ ) So view J i ( A i ) = J i ( | A i | ) = J i ( k ) J i (.) for X i ~ Bernoulli(  ) and  ~ Beta( ,  ) is Monotone non-increasing and Supermodular Monotone: A  A ’  J( A )  J( A ’ ) Supermodular: A  A ’  J( A )– J( A  {v} )  J( A ’ )– J( A ’  {v} ) Let  i (k) = J i (k) – J i (k+1) … always positive, & smaller with increasing k… True for Beta/Bernoulli variables! Need to compute  i (k+1) from  i (k): For Beta priors, requires O(k) time IGA requires O( (n’+ B ) B ln n’) time, O(n’) space c min  min{ c i }; n’ = n / c min Structure  Adaptive?  IndependentDependent – -Ad, -De + (Allocation alg) IGA( budget B ; costs  c i  ; reductions   i (k)  ) s = 0 ; a 1 = a 2 = … = a n = 0 while s < B do j * := arg max j {  i (a j ) / c j } a j* += 1; s += c j* return [ a 1, …, a n ] Theorem: If all c i s are equal all J i (.)s are monotone and supermodular Then IGA computes optimal allocation. Theorem: It is NP-hard to compute budgeted allocation policy that minimizes J(.) even if variables are independent when costs can differ. RoundRobin: Instance#1: f1, then f2, then …, then fn, then… Instance#2: f1, then f2, then …, then fn, then… Adaptive Greedy Algorithm (AGA): Probe: arg max j {  i (1) / c j } Optimal is  O(n) better than AGA IGA: 1 flip to A and 1 flip to B AGA: Flip B, and then flip A Optimal Adaptive: Empirical studies: Structure  Adaptive?  IndependentDependent – + -Ad, -De A 0 1 A B A 0 1 B B Constant Costs Non-Constant Costs IGA, AGA > Random, RoundRobin – Wilcoxon signed rank p<0.05 Given: prior is product of Dirichlet dist’ns, COMPLETE training tuples Then: posterior is product of Dirichlet dist’ns If incomplete training tuples (missing r values): Then posterior is MIXTURE of O(2 r ) product of Dirichlet dist’ns “Complete tuple” fails: if not complete graph if not BDE if not unit cost GreedyAlg, given partial data Approximate J(.) … then greedy using this approx Structure  Adaptive?  IndependentDependent – -Ad, +De + +Ad, +De ?Conjecture? If BN is complete graph over d variables parameters have BDE priors probes have unit cost c i Then: Optimal allocation strategy = collect COMPLETE tuples Holds: For d=2 S=5 B =10 Greedy > Random, RoundRobin – Wilcoxon signed rank p<0.05