Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes”

Similar presentations


Presentation on theme: "Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes”"— Presentation transcript:

1 Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes” ] to produce generative model … subject to fixed total $$ Active Budgeted Discriminative Generative Label Attribute – Learning Budgeted Parameter Learning Foundations Nodes  r.v.s X Arcs  dependencies Parameters  quantify conditional distribution of each variable, given parents drawn from a distribution  ~ Pr(.) Here: Variables discrete, distributed as (independent) Beta/Dirichlet R Budgeted Distribution Learning of Belief Net Parameters Liuyang Li, Barnabas Poczos, Csaba Szepesvari, Russell Greiner Typical Parameter Learning Motivation Bayesian networks (BNs) model of a joint distr’n used in many applications… How to GET the BN? Hand-engineer BNs? Requires person to KNOW the structure, parameters…  Learn BNs from data Most models assume data are available initially… but might not be available initially data can be costly to obtain (time, energy,...) Medical tests can be very expensive! Loss Function R.v. X ~ Pr(. |  ) Parameter-tuple  induces distr’n of X Recall  ~p(.) Single estimate: Mean: E[  ]=  For any parameter value , loss of using  when should use  is: KL(  ||  ) =  x  X Pr(x |  ) ln[Pr(x |  ) / Pr(x |  ) ] Don’t know  … so average J( p(.) ) = E  ~p(.) [ KL(  || E[  ] ) ] Set of probes: A = { (A,1), (D,1), (D,2), (A,2), (E,1) } X A = values returned (r.v.) (A,1,b), (D,1,x), (D,2,+), (A,2,a), (E,1,0)  X A = E[  | X A ] =“mean parameter values”, wrt X A J( A ) = Loss of posterior, based on probes A = E  ~p(.) E X A ~ Pr(. |  ) [ KL(  ||  X A ) ] Problem Definition Given structure of belief net; prior over parameters no initial data, but … budget B  + to use to acquire data Collect data by performing sequence of probes Probe d ij obtains value of attribute i for patient j … at cost c i Cost of K probes is After spending B, use data to estimate model parameters Goal: an accurate generative model of the joint distribution over the features R 01 b+ x b–y0 +  A=1  A=0 Beta(5, 6)  B=1  B=0 Beta(3, 4) b  D=1|B= b  D=0|B= b 1Beta( 4, 1) 0Beta( 4, 6) b0+x1 b1+y0 a1–x0 b1–y0 a0+y1 Patient 1 Patient 2 Response R Learner R  A=1  A=0 Beta(5, 6)  B=1  B=0 Beta(3, 4) b  D=1|B=b  D=0|B=b 1Beta( 4, 1) 0Beta( 4, 6) Costs $ 4 $18 $ 5 Res $12 Total Budget: $30 Remaining Budget: $30 $26 $21 $16 $12 $0 1: b 2: x 3: + 5: 0 4: a Patient 1 Patient 2 Response  probability  Related Work IAL: Interventional Active Learning of Generative BNs [Ton&Koller, 2001a,b]; [Murphy,2001]; [Steck&Jaakkoa,2002] Learner sets values of certain features (interventions) over all training instances Learner receives ALL of remaining values of each specified instance Seeks BN parameters optimizing Generative distribution Differences: BGL cannot see (for free) any values: must pay for each probe BGL has explicit budget BGL can pay for only some features of instance Active Budgeted Discriminative Generative Label Attribute – Learning b+1 b1+y0 a–0 b–0 a+1 ++ b + b + R + + – – ABCDE bx0 a y  A=1  A=0 Beta(3, 3)  B=1  B=0 Beta(1, 1) b  D=1|B= b  D=0|B=b 1Beta( 1, 1) 0Beta( 4, 1) R ICML 2010, the 27th International Conference on Machine Learning Proposition: When independent: Loss Function decomposes J( A ) =  i J i ( A i ) J i ( A i ) just depends on size of A i | A i | = | A i ’ |  J i ( A i ) = J i ( A i ’ ) So view J i ( A i ) = J i ( | A i | ) = J i ( k ) J i (.) for X i ~ Bernoulli(  ) and  ~ Beta( ,  ) is Monotone non-increasing and Supermodular Monotone: A  A ’  J( A )  J( A ’ ) Supermodular: A  A ’  J( A )– J( A  {v} )  J( A ’ )– J( A ’  {v} ) Let  i (k) = J i (k) – J i (k+1) … always positive, & smaller with increasing k… True for Beta/Bernoulli variables! Need to compute  i (k+1) from  i (k): For Beta priors, requires O(k) time IGA requires O( (n’+ B ) B ln n’) time, O(n’) space c min  min{ c i }; n’ = n / c min Structure  Adaptive?  IndependentDependent – -Ad, -De + (Allocation alg) IGA( budget B ; costs  c i  ; reductions   i (k)  ) s = 0 ; a 1 = a 2 = … = a n = 0 while s < B do j * := arg max j {  i (a j ) / c j } a j* += 1; s += c j* return [ a 1, …, a n ] Theorem: If all c i s are equal all J i (.)s are monotone and supermodular Then IGA computes optimal allocation. Theorem: It is NP-hard to compute budgeted allocation policy that minimizes J(.) even if variables are independent when costs can differ. RoundRobin: Instance#1: f1, then f2, then …, then fn, then… Instance#2: f1, then f2, then …, then fn, then… Adaptive Greedy Algorithm (AGA): Probe: arg max j {  i (1) / c j } Optimal is  O(n) better than AGA IGA: 1 flip to A and 1 flip to B AGA: Flip B, and then flip A Optimal Adaptive: Empirical studies: Structure  Adaptive?  IndependentDependent – + -Ad, -De A 0 1 A B A 0 1 B B Constant Costs Non-Constant Costs IGA, AGA > Random, RoundRobin – Wilcoxon signed rank p<0.05 Given: prior is product of Dirichlet dist’ns, COMPLETE training tuples Then: posterior is product of Dirichlet dist’ns If incomplete training tuples (missing r values): Then posterior is MIXTURE of O(2 r ) product of Dirichlet dist’ns “Complete tuple” fails: if not complete graph if not BDE if not unit cost GreedyAlg, given partial data Approximate J(.) … then greedy using this approx Structure  Adaptive?  IndependentDependent – -Ad, +De + +Ad, +De ?Conjecture? If BN is complete graph over d variables parameters have BDE priors probes have unit cost c i Then: Optimal allocation strategy = collect COMPLETE tuples Holds: For d=2 S=5 B =10 Greedy > Random, RoundRobin – Wilcoxon signed rank p<0.05


Download ppt "Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes”"

Similar presentations


Ads by Google