Download presentation
Presentation is loading. Please wait.
Published byTyrone Johns Modified over 9 years ago
1
Describing Data The canonical descriptive strategy is to describe the data in terms of their underlying distribution As usual, we have a p-dimensional data matrix with variables X 1, …, X p The joint distribution is P(X 1, …, X p ) The joint gives us complete information about the variables Given the joint distribution, we can answer any question about the relationships among any subset of variables –are X2 and X5 independent? –generating approximate answers to queries for large databases or selectivity estimation Given a query (conditions that observations must satisfy), estimate the fraction of rows that satisfy this condition (the selectivity of the query) These estimates are needed during query optimization If we have a good approximation for the joint distribution of data, we can use it to efficiently compute approximate selectivities
2
Graphical Models In the next 3-4 lectures, we will be studying graphical models e.g. Bayesian networks, Bayes nets, Belief nets, Markov networks, etc. We will study: –representation –reasoning –learning Materials based on upcoming book by Nir Friedman and Daphne Koller. Slides courtesy of Nir Friedman.
3
Probability Distributions Let X 1,…,X p be random variables Let P be a joint distribution over X 1,…,X p If the variables are binary, then we need O(2 p ) parameters to describe P Can we do better? Key idea: use properties of independence
4
Independent Random Variables Two variables X and Y are independent if –P(X = x|Y = y) = P(X = x) for all values x,y –That is, learning the values of Y does not change prediction of X If X and Y are independent then –P(X,Y) = P(X|Y)P(Y) = P(X)P(Y) In general, if X 1,…,X p are independent, then –P(X 1,…,X p )= P(X 1 )...P(X p ) –Requires O(n) parameters
5
Conditional Independence Unfortunately, most of random variables of interest are not independent of each other A more suitable notion is that of conditional independence Two variables X and Y are conditionally independent given Z if –P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z –That is, learning the values of Y does not change prediction of X once we know the value of Z –notation: I ( X, Y | Z )
6
Example: Naïve Bayesian Model A common model in early diagnosis: –Symptoms are conditionally independent given the disease (or fault) Thus, if –X 1,…,X p denote whether the symptoms exhibited by the patient (headache, high-fever, etc.) and –H denotes the hypothesis about the patients health then, P(X 1,…,X p,H) = P(H)P(X 1 |H)…P(X p |H), This naïve Bayesian model allows compact representation –It does embody strong independence assumptions
7
Modeling assumptions: Ancestors can affect descendants' genotype only by passing genetic materials through intermediate generations Example: Family trees Noisy stochastic process: Example: Pedigree A node represents an individual’s genotype Homer Bart Marge LisaMaggie
8
Markov Assumption We now make this independence assumption more precise for directed acyclic graphs (DAGs) Each random variable X, is independent of its non- descendents, given its parents Pa(X) Formally, I (X, NonDesc(X) | Pa(X)) Descendent Ancestor Parent Non-descendent X Y1Y1 Y2Y2
9
Markov Assumption Example In this example: –I ( E, B ) –I ( B, {E, R} ) –I ( R, {A, B, C} | E ) –I ( A, R | B,E ) –I ( C, {B, E, R} | A) Earthquake Radio Burglary Alarm Call
10
I-Maps A DAG G is an I-Map of a distribution P if all the Markov assumptions implied by G are satisfied by P (Assuming G and P both use the same set of random variables) Examples: XYXY
11
Factorization Given that G is an I-Map of P, can we simplify the representation of P? Example: Since I(X,Y), we have that P(X|Y) = P(X) Applying the chain rule P(X,Y) = P(X|Y) P(Y) = P(X) P(Y) Thus, we have a simpler representation of P(X,Y) XY
12
Factorization Theorem Thm: if G is an I-Map of P, then Proof: By chain rule: wlog. X 1,…,X p is an ordering consistent with G From assumption: Since G is an I-Map, I (X i, NonDesc(X i )| Pa(X i )) Hence, We conclude, P(X i | X 1,…,X i-1 ) = P(X i | Pa(X i ) )
13
Factorization Theorem From assumption: Thm: if G is an I-Map of P, then Proof: By chain rule: wlog. X 1,…,X p is an ordering consistent with G Since G is an I-Map, I (X i, NonDesc(X i )| Pa(X i )) We conclude, P(X i | X 1,…,X i-1 ) = P(X i | Pa(X i ) ) Hence,
14
Factorization Example P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E) Earthquake Radio Burglary Alarm Call versus P(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)
15
Consequences We can write P in terms of “local” conditional probabilities If G is sparse, – that is, |Pa(X i )| < k, each conditional probability can be specified compactly –e.g. for binary variables, these require O(2 k ) params. representation of P is compact –linear in number of variables
16
Pause…Summary We defined the following concepts The Markov Independences of a DAG G –I (X i, NonDesc(X i ) | Pa i ) G is an I-Map of a distribution P –If P satisfies the Markov independencies implied by G We proved the factorization theorem if G is an I-Map of P, then
17
Let Markov(G) be the set of Markov Independencies implied by G The factorization theorem shows G is an I-Map of P We can also show the opposite: Thm: G is an I-Map of P Conditional Independencies
18
Proof (Outline) Example: X Y Z
19
Implied Independencies Does a graph G imply additional independencies as a consequence of Markov(G)? We can define a logic of independence statements Some axioms: –I( X ; Y | Z ) I( Y; X | Z ) –I( X ; Y 1, Y 2 | Z ) I( X; Y 1 | Z )
20
d-seperation A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from Markov(G)
21
Paths Intuition: dependency must “flow” along paths in the graph A path is a sequence of neighboring variables Examples: R E A B C A E R Earthquake Radio Burglary Alarm Call
22
Paths We want to know when a path is –active -- creates dependency between end nodes –blocked -- cannot create dependency end nodes We want to classify situations in which paths are active.
23
Blocked Unblocked E R A E R A Path Blockage Three cases: –Common cause – Blocked Active
24
Blocked Unblocked E C A E C A Path Blockage Three cases: –Common cause –Intermediate cause – Blocked Active
25
Blocked Unblocked E B A C E B A C E B A C Path Blockage Three cases: –Common cause –Intermediate cause –Common Effect Blocked Active
26
Path Blockage -- General Case A path is active, given evidence Z, if Whenever we have the configuration B or one of its descendents are in Z No other nodes in the path are in Z A path is blocked, given evidence Z, if it is not active. A C B
27
A –d-sep(R,B)? Example E B C R
28
–d-sep(R,B) = yes –d-sep(R,B|A)? Example E B A C R
29
–d-sep(R,B) = yes –d-sep(R,B|A) = no –d-sep(R,B|E,A)? Example E B A C R
30
d-Separation X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z. Checking d-separation can be done efficiently (linear time in number of edges) –Bottom-up phase: Mark all nodes whose descendents are in Z –X to Y phase: Traverse (BFS) all edges on paths from X to Y and check if they are blocked
31
Soundness Thm: If –G is an I-Map of P –d-sep( X; Y | Z, G ) = yes then –P satisfies I( X; Y | Z ) Informally, Any independence reported by d-separation is satisfied by underlying distribution
32
Completeness Thm: If d-sep( X; Y | Z, G ) = no then there is a distribution P such that –G is an I-Map of P –P does not satisfy I( X; Y | Z ) Informally, Any independence not reported by d-separation might be violated by the underlying distribution We cannot determine this by examining the graph structure alone
33
I-Maps revisited The fact that G is I-Map of P might not be that useful For example, complete DAGs –A DAG is G is complete if we cannot add an arc without creating a cycle These DAGs do not imply any independencies Thus, they are I-Maps of any distribution X1X1 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4
34
Minimal I-Maps A DAG G is a minimal I-Map of P if G is an I-Map of P If G’ G, then G’ is not an I-Map of P Removing any arc from G introduces (conditional) independencies that do not hold in P
35
Minimal I-Map Example If is a minimal I-Map Then, these are not I-Maps: X1X1 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4 X1X1 X3X3 X2X2 X4X4
36
Constructing minimal I-Maps The factorization theorem suggests an algorithm Fix an ordering X 1,…,X n For each i, –select Pa i to be a minimal subset of {X 1,…,X i-1 }, such that I(X i ; {X 1,…,X i-1 } - Pa i | Pa i ) Clearly, the resulting graph is a minimal I-Map.
37
Non-uniqueness of minimal I-Map Unfortunately, there may be several minimal I-Maps for the same distribution –Applying I-Map construction procedure with different orders can lead to different structures E B A C R Original I-Map E B A C R Order: C, R, A, E, B
38
Choosing Ordering & Causality The choice of order can have drastic impact on the complexity of minimal I-Map Heuristic argument: construct I-Map using causal ordering among variables Justification? –It is often reasonable to assume that graphs of causal influence should satisfy the Markov properties.
39
P-Maps A DAG G is P-Map (perfect map) of a distribution P if –I(X; Y | Z) if and only if d-sep(X; Y |Z, G) = yes Notes: A P-Map captures all the independencies in the distribution P-Maps are unique, up to DAG equivalence
40
P-Maps Unfortunately, some distributions do not have a P- Map Example: A minimal I-Map: This is not a P-Map since I(A;C) but d-sep(A;C) = no A B C
41
Bayesian Networks A Bayesian network specifies a probability distribution via two components: –A DAG G –A collection of conditional probability distributions P(X i |Pa i ) The joint distribution P is defined by the factorization Additional requirement: G is a minimal I-Map of P
42
Summary We explored DAGs as a representation of conditional independencies: –Markov independencies of a DAG –Tight correspondence between Markov(G) and the factorization defined by G –d-separation, a sound & complete procedure for computing the consequences of the independencies –Notion of minimal I-Map –P-Maps This theory is the basis for defining Bayesian networks
43
Markov Networks We now briefly consider an alternative representation of conditional independencies Let U be an undirected graph Let N i be the set of neighbors of X i Define Markov(U) to be the set of independencies I( X i ; {X 1,…,X n } - N i - {X i } | N i ) U is an I-Map of P if P satisfies Markov(U)
44
Example This graph implies that I(A; C | B, D ) I(B; D | A, C ) Note: this example does not have a directed P-Map A D B C
45
Markov Network Factorization Thm: if P is strictly positive, that is P(x 1, …, x n ) > 0 for all assignments then U is an I-Map of P if and only if there is a factorization where C 1, …, C k are the maximal cliques in U Alternative form:
46
Bayesian Networks to Markov Networks We’ve seen that Pa i separate X i from its non- descendents What separates X i from the rest of the nodes? Markov Blanket: Minimal set Mb i such that I(X i ; {X 1,…,X n } - Mb i - {X i } | Mb i ) To construct that Markov blanket we need to consider all paths from X i to other nodes
47
Markov Blanket (cont) Three types of Paths: “Upward” paths –Blocked by parents X
48
Markov Blanket (cont) Three types of Paths: “Upward” paths –Blocked by parents “Downward” paths –Blocked by children X
49
Markov Blanket (cont) Three types of Paths: “Upward” paths –Blocked by parents “Downward” paths –Blocked by children “Sideway” paths –Blocked by “spouses” X
50
Markov Blanket (cont) We define the Markov Blanket for a DAG G Mb i consist of –Pa i –X i ’s children –Parents of X i ’s children (excluding X i ) Easy to see: If X j in Mb i then X i in Mb j
51
Moralized Graphs Given a DAG G, we define the moralized graph of G to be an undirected graph U such that –if X Y in G, then X -- Y in U –if X Y Z in G, then X -- Z in U –no other edges are in U In other words: X -- Y in U if X in Y’s Markov blanket If G in an I-Map of P, then U is also an I-Map of P
52
Markov Networks vs. Bayesian Networks The transformation to a Moral graph loses information about independencies It is easy to show that undirected graphs satisfy: –I( X ; Y | Z ) I( X ; Y | Z, Z’ ) Adding more evidence does not create dependencies Thus, Markov networks cannot model “explaining away”
53
Example I( E ; B ) is not satisfied by the moralized graph E B A C R E B A C R
54
Relationship between Directed & Undirected Models Chain Graphs Directed Graphs Undirected Graphs
55
CPDs So far, we focused on how to represent independencies using DAGs The “other” component of a Bayesian networks is the specification of the conditional probability distributions (CPDs) We start with the simplest representation of CPDs and then discuss additional structure
56
Tabular CPDs When the variable of interest are all discrete, the common representation is as a table: For example P(C|A,B) can be represented by ABP(C = 0 | A, B)P(C = 1 | A, B) 000.250.75 010.50 100.120.88 110.330.67
57
Tabular CPDs Pros: Very flexible, can capture any CPD of discrete variables Can be easily stored and manipulated Cons: Representation size grows exponentially with the number of parents! Unwieldy to assess probabilities for more than few parents
58
Structured CPD To avoid the exponential blowup in representation, we need to focus on specialized types of CPDs This comes at a cost in terms of expressive power We now consider several types of structured CPDs
59
Deterministic CPDs The simplest form of CPDs is one where P(X|Y 1,…,Y k ) is defined as where f is some function In this case X is determined by the values of Y 1,…,Y k Depending on the class of functions we are willing to consider, this representation can be compact
60
Deterministic CPDs and d-seperation Deterministic relations can induce additional independencies in a graph Example: Suppose that C is determined by A, B In standard DAG we have that d-sep(D; E | A, B ) = no However, observing A and B, implies that we also know the value of C Thus, we can conclude that Ind(D; E | A, B) C B E A D
61
Deterministic CPDs and d-separation General solution: –Given a query d-sep(X ; Y | Z ) –While there is X i such that P(X i | Pa i ) is a deterministic CPT, and Pa i Z Z Z { X i } – run d-sep( X ; Y ; Z )
62
Causal Independence Consider the following situation In tabular CPD, we need to assess the probability of fever in eight cases These involve all possible interactions between diseases For three disease, this might be feasible…. For ten diseases, not likely…. Disease 1 Disease 3 Disease 2 Fever
63
Causal Independence Simplifying assumption: –Each disease attempts to cause fever, independently of the other diseases –The patient has fever if one of the diseases “succeeds” We can model this using a Bayesian network fragment Fever Fever 1 Fever 3 Fever 2 Disease 1 Disease 3 Disease 2 OR gate F = or(SF,F1,F2,F3) Hypothetical variables “Fever caused by Disease i” Spontenuous Fever
64
Noisy-Or CPD Models P(X|Y 1,…,Y k ), X, Y 1,…, Y k are all binary Paremeters: –p i -- probability of X = 1 due to Y i = 1 –p 0 -- probability of X = 1 due to other causes Plugging these in the model we get
65
Noisy-Or CPD Benefits of noisy-or –“Reasonable” assumptions in many domains e.g., medical domain –Few parameters. –Each parameter can be estimated independently of the others The same idea can be extended to other functions: noisy-max, noisy-and, etc. Frequently used in large medical expert systems
66
Context Specific Independence Consider the following examples: Alarm sound depends on –Whether the alarm was set before leaving the house –Burglary –Earthquake Arriving on time depends on –Travel route –The congestion on the two possible routes Set Earthquake Burglary Alarm Travel Route Route 2 traffic Route 1 traffic Arrive on time
67
Context-Specific Independence In both of these example we have context-specific indepdencies (CSI) –Independencies that depends on a particular value of one or more variables In our examples: –Ind( A ; B, E | S = 0 ) Alarm sound is independent of B and E when the alarm is not set –Ind( A ; R 2 | T = 1 ) Arrival time is independent of traffic on route 2 if we choose to travel on route 1
68
Representing CSI When we have such CSI, P(X | Y 1,…,Y k ) is the same for several values of Y 1,…,Y k There are many ways of representing these regularities A natural representation: decision trees –Internal nodes: tests on parents –Leaves: probability distributions on X Evaluate P(X | Y 1,…,Y k ) by traversing tree S B.0 E.8.1.7 01 01 01
69
Detecting CSI Given evidence on some nodes, we can identify the “relevant” parts of the trees –This consists of the paths in the tree that are consistent with context Example –Context S = 0 –Only one path of tree is relevant A parent is independent given the context if it does not appear on one of the relevant paths S B.0 E.8.1.7 01 01 01
70
CSI and d-seperation Once we represent local CSI in CPDS, we can deduce about additional CSI in the DAG Example: –Context S = 0 –Ind(A ; B, E | S = 0 ) –Thus, two edges become inactive in the DAG –Possible conclusion Ind(C ; B | S = 0) Earthquake Radio Burglary Alarm Call Set
71
Decision Tree CPDs Benefits Decision trees offer a flexible and intuitive language to represent CSI Incorporated into several commercial tools for constructing Bayesian networks Comparison to noisy-or Noisy-or CPDs require full trees to represent General decision tree CPDs cannot be represented by noisy-or
72
Continuous CPDs When X is a continuous variables, we need to represent the density of X, given any value of its parents We do not have a general representation that can capture all possible conditional densities
73
Gaussian Distribution One of the most common representations Unconditional density: 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -4-2024
74
Linear-Gaussian CPDs Represent P(X | Y 1,…,Y k ) as a Gaussian –Fixed variance –Mean depends on the value of Y 1,…,Y k
75
Linear Gaussian CPDs Let B be a Bayesian network of continuous variables with linear-Gaussian CPDs Then B defines a multivariate Gaussian distribution
76
Conditional Gaussian CPDs A model for networks that combine discrete and continuous variables If X is continuous –Y 1,…,Y k are continuous –Z 1,…,Z l are discrete Conditional Gaussian (CG) CPD: For each joint value of Z 1,…,Z l define a different linear- Gaussian parameters Resulting multivariate distribution: mixture of multivariate Gaussians –Each assignment of values to discrete variables selects a multivariate Gaussian over continuous variables
77
Summary Many choices for representing CPDs Any “statistical” model of conditional distribution can be used –e.g., any regression model Representing structure in CPDs can have implications on independencies among variables
78
Inference in Bayesian Networks
79
Inference We now have compact representations of probability distributions: –Bayesian Networks –Markov Networks Network describes a unique probability distribution P How do we answer queries about P ? We use inference as a name for the process of computing answers to such queries
80
Queries: Likelihood There are many types of queries we might ask. Most of these involve evidence –An evidence e is an assignment of values to a set E variables in the domain –Without loss of generality E = { X k+1, …, X n } Simplest query: compute probability of evidence This is often referred to as computing the likelihood of the evidence
81
Queries: A posteriori belief Often we are interested in the conditional probability of a variable given the evidence This is the a posteriori belief in X, given evidence e A related task is computing the term P(X, e) –i.e., the likelihood of e and X = x for values of X –we can recover the a posteriori belief by
82
A posteriori belief This query is useful in many cases: Prediction: what is the probability of an outcome given the starting condition –Target is a descendent of the evidence Diagnosis: what is the probability of disease/fault given symptoms –Target is an ancestor of the evidence As we shall see, the direction between variables does not restrict the directions of the queries –Probabilistic inference can combine evidence form all parts of the network
83
Queries: A posteriori joint In this query, we are interested in the conditional probability of several variables, given the evidence P(X, Y, … | e ) Note that the size of the answer to query is exponential in the number of variables in the joint
84
Queries: MAP In this query we want to find the maximum a posteriori assignment for some variable of interest (say X 1,…,X l ) That is, x 1,…,x l maximize the probability P(x 1,…,x l | e) Note that this is equivalent to maximizing P(x 1,…,x l, e)
85
Queries: MAP We can use MAP for: Classification –find most likely label, given the evidence Explanation –What is the most likely scenario, given the evidence
86
Queries: MAP Cautionary note: The MAP depends on the set of variables Example: –MAP of X –MAP of (X, Y)
87
Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP-hard Not surprising, since we can simulate Boolean gates.
88
Hardness Hardness does not mean we cannot solve inference –It implies that we cannot find a general procedure that works efficiently for all networks –For particular families of networks, we can have provably efficient procedures
89
Approaches to inference Exact inference –Inference in Simple Chains –Variable elimination –Clustering / join tree algorithms Approximate inference –Stochastic simulation / sampling methods –Markov chain Monte Carlo methods –Mean field theory
90
Inference in Simple Chains How do we compute P(X 2 ) ? X1X1 X2X2
91
Inference in Simple Chains (cont.) How do we compute P(X 3 ) ? we already know how to compute P(X 2 )... X1X1 X2X2 X3X3
92
Inference in Simple Chains (cont.) How do we compute P(X n ) ? Compute P(X 1 ), P(X 2 ), P(X 3 ), … We compute each term by using the previous one X1X1 X2X2 X3X3 XnXn... Complexity: Each step costs O(|Val(X i )|*|Val(X i+1 )|) operations Compare to naïve evaluation, that requires summing over joint values of n-1 variables
93
Inference in Simple Chains (cont.) Suppose that we observe the value of X 2 =x 2 How do we compute P(X 1 |x 2 ) ? –Recall that we it suffices to compute P(X 1,x 2 ) X1X1 X2X2
94
Inference in Simple Chains (cont.) Suppose that we observe the value of X 3 =x 3 How do we compute P(X 1,x 3 ) ? How do we compute P(x 3 |x 1 ) ? X1X1 X2X2 X3X3
95
Inference in Simple Chains (cont.) Suppose that we observe the value of X n =x n How do we compute P(X 1,x n ) ? We compute P(x n |x n-1 ), P(x n |x n-2 ), … iteratively X1X1 X2X2 X3X3 XnXn...
96
Inference in Simple Chains (cont.) Suppose that we observe the value of X n =x n We want to find P(X k |x n ) How do we compute P(X k,x n ) ? We compute P(X k ) by forward iterations We compute P(x n | X k ) by backward iterations X1X1 X2X2 XkXk XnXn...
97
Elimination in Chains We now try to understand the simple chain example using first-order principles Using definition of probability, we have ABC E D
98
Elimination in Chains By chain decomposition, we get ABC E D
99
Elimination in Chains Rearranging terms... ABC E D
100
Elimination in Chains Now we can perform innermost summation This summation, is exactly the first step in the forward iteration we describe before ABC E D X
101
Elimination in Chains Rearranging and then summing again, we get ABC E D X X
102
Elimination in Chains with Evidence Similarly, we understand the backward pass We write the query in explicit form ABC E D
103
Elimination in Chains with Evidence Eliminating d, we get ABC E D X
104
Elimination in Chains with Evidence Eliminating c, we get ABC E D X X
105
Elimination in Chains with Evidence Finally, we eliminate b ABC E D X X X
106
Variable Elimination General idea: Write query in the form Iteratively –Move all irrelevant terms outside of innermost sum –Perform innermost sum, getting a new term –Insert the new term into the product
107
A More Complex Example Visit to Asia Smoking Lung Cancer Tuberculosis Abnormality in Chest Bronchitis X-Ray Dyspnea “Asia” network:
108
V S L T A B XD We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors
109
V S L T A B XD We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors Eliminate: v Note: f v (t) = P(t) In general, result of elimination is not necessarily a probability term Compute:
110
V S L T A B XD We want to compute P(d) Need to eliminate: s,x,t,l,a,b Initial factors Eliminate: s Summing on s results in a factor with two arguments f s (b,l) In general, result of elimination may be a function of several variables Compute:
111
V S L T A B XD We want to compute P(d) Need to eliminate: x,t,l,a,b Initial factors Eliminate: x Note: f x (a) = 1 for all values of a !! Compute:
112
V S L T A B XD We want to compute P(d) Need to eliminate: t,l,a,b Initial factors Eliminate: t Compute:
113
V S L T A B XD We want to compute P(d) Need to eliminate: l,a,b Initial factors Eliminate: l Compute:
114
V S L T A B XD We want to compute P(d) Need to eliminate: b Initial factors Eliminate: a,b Compute:
115
Variable Elimination We now understand variable elimination as a sequence of rewriting operations Actual computation is done in elimination step Exactly the same computation procedure applies to Markov networks Computation depends on order of elimination
116
Dealing with evidence How do we deal with evidence? Suppose get evidence V = t, S = f, D = t We want to compute P(L, V = t, S = f, D = t) V S L T A B XD
117
Dealing with Evidence We start by writing the factors: Since we know that V = t, we don’t need to eliminate V Instead, we can replace the factors P(V) and P(T|V) with These “select” the appropriate parts of the original factors given the evidence Note that f p(V) is a constant, and thus does not appear in elimination of other variables V S L T A B XD
118
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: V S L T A B XD
119
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get V S L T A B XD
120
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get V S L T A B XD
121
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get Eliminating a, we get V S L T A B XD
122
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get Eliminating a, we get Eliminating b, we get V S L T A B XD
123
Complexity of variable elimination Suppose in one elimination step we compute This requires multiplications –For each value for x, y 1, …, y k, we do m multiplications additions –For each value of y 1, …, y k, we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor!
124
Understanding Variable Elimination We want to select “good” elimination orderings that reduce complexity We start by attempting to understand variable elimination via the graph we are working with This will reduce the problem of finding good ordering to graph-theoretic operation that is well-understood
125
Undirected graph representation At each stage of the procedure, we have an algebraic term that we need to evaluate In general this term is of the form: where Z i are sets of variables We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor –that is, if X,Y are in some Z i Note: this is the Markov network that describes the probability on the variables we did not eliminate yet
126
Chordal Graphs elimination ordering undirected chordal graph Graph: Maximal cliques are factors in elimination Factors in elimination are cliques in the graph Complexity is exponential in size of the largest clique in graph L T A B X V S D V S L T A B XD
127
Induced Width The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination This quantity is called the induced width of a graph according to the specified ordering Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph
128
General Networks From graph theory: Thm: Finding an ordering that minimizes the induced width is NP-Hard However, There are reasonable heuristic for finding “relatively” good ordering There are provable approximations to the best induced width If the graph has a small induced width, there are algorithms that find it in polynomial time
129
Elimination on Trees Formally, for any tree, there is an elimination ordering with induced width = 1 Thm Inference on trees is linear in number of variables
130
PolyTrees A polytree is a network where there is at most one path from one variable to another Thm: Inference in a polytree is linear in the representation size of the network –This assumes tabular CPT representation A C B D E FG H
131
Approaches to inference Exact inference –Inference in Simple Chains –Variable elimination –Clustering / join tree algorithms Approximate inference –Stochastic simulation / sampling methods –Markov chain Monte Carlo methods –Mean field theory
132
Stochastic simulation Suppose you are given values for some subset of the variables, G, and want to infer values for unknown variables, U Randomly generate a very large number of instantiations from the BN –Generate instantiations for all variables – start at root variables and work your way “forward” Only keep those instantiations that are consistent with the values for G Use the frequency of values for U to get estimated probabilities Accuracy of the results depends on the size of the sample (asymptotically approaches exact results)
133
Markov chain Monte Carlo methods So called because –Markov chain – each instance generated in the sample is dependent on the previous instance –Monte Carlo – statistical sampling method Perform a random walk through variable assignment space, collecting statistics as you go –Start with a random instantiation, consistent with evidence variables –At each step, for some nonevidence variable, randomly sample its value, consistent with the other current assignments Given enough samples, MCMC gives an accurate estimate of the true distribution of values
134
Learning Bayesian Networks
135
Learning Bayesian networks Inducer Data + Prior information E R B A C.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B)
136
Known Structure -- Complete Data E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is specified –Inducer needs to estimate parameters Data does not contain missing values
137
Unknown Structure -- Complete Data E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is not specified –Inducer needs to select arcs & estimate parameters Data does not contain missing values
138
Known Structure -- Incomplete Data Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A Network structure is specified Data contains missing values –We consider assignments to missing values E, B, A.
139
Known Structure / Complete Data Given a network structure G –And choice of parametric family for P(X i |Pa i ) Learn parameters for network Goal Construct a network that is “closest” to probability that generated the data
140
Learning Parameters for a Bayesian Network E B A C Training data has the form:
141
Learning Parameters for a Bayesian Network E B A C Since we assume i.i.d. samples, likelihood function is
142
Learning Parameters for a Bayesian Network E B A C By definition of network, we get
143
Learning Parameters for a Bayesian Network E B A C Rewriting terms, we get
144
General Bayesian Networks Generalizing for any Bayesian network: The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization
145
General Bayesian Networks (Cont.) Decomposition Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other.
146
From Binomial to Multinomial For example, suppose X can have the values 1,2,…,K We want to learn the parameters 1, 2. …, K Sufficient statistics: N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE:
147
Likelihood for Multinomial Networks When we assume that P(X i | Pa i ) is multinomial, we get further decomposition:
148
Likelihood for Multinomial Networks When we assume that P(X i | Pa i ) is multinomial, we get further decomposition: For each value pa i of the parents of X i we get an independent multinomial problem The MLE is
149
Maximum Likelihood Estimation Consistency Estimate converges to best possible value as the number of examples grow To make this formal, we need to introduce some definitions
150
KL-Divergence Let P and Q be two distributions over X A measure of distance between P and Q is the Kullback-Leibler Divergence KL(P||Q) = 1 (when logs are in base 2) = –The probability P assigns to an instance is, on average, half the probability Q assigns to it KL(P||Q) 0 KL(P||Q) = 0 iff are P and Q equal
151
Consistency Let P(X| ) be a parametric family –We need to make various regularity condition we won’t go into now Let P * (X) be the distribution that generates the data Let be the MLE estimate given a dataset D Thm As N , where with probability 1
152
Consistency -- Geometric Interpretation P*P* P(X| * ) Space of probability distribution Distributions that can represented by P(X| )
153
Is MLE all we need? Suppose that after 10 observations, –ML estimates P(H) = 0.7 for the thumbtack –Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet?
154
Bayesian Inference Frequentist Approach: Assumes there is an unknown but fixed parameter Estimates with some confidence Prediction by using the estimated parameter value Bayesian Approach: Represents uncertainty about the unknown parameter Uses probability to quantify this uncertainty: –Unknown parameters as random variables Prediction follows from the rules of probability: –Expectation over the unknown parameters
155
Bayesian Inference (cont.) We can represent our uncertainty about the sampling process using a Bayesian network The values of X are independent given The conditional probabilities, P(x[m] | ), are the parameters in the model Prediction is now inference in this network X[1]X[2]X[m] X[m+1] Observed dataQuery
156
Bayesian Inference (cont.) Prediction as inference in this network where Posterior Likelihood Prior Probability of data X[1]X[2]X[m] X[m+1]
157
Example: Binomial Data Revisited Prior: uniform for in [0,1] –P( ) = 1 Then P( |D) is proportional to the likelihood L( :D) (N H,N T ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 00.20.40.60.81
158
Bayesian Inference and MLE In our example, MLE and Bayesian prediction differ But… If prior is well-behaved Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value Both are consistent
159
Dirichlet Priors Recall that the likelihood function is A Dirichlet prior with hyperparameters 1,…, K is defined as for legal 1,…, K Then the posterior has the same form, with hyperparameters 1 +N 1,…, K +N K
160
Dirichlet Priors (cont.) We can compute the prediction on a new event in closed form: If P( ) is Dirichlet with hyperparameters 1,…, K then Since the posterior is also Dirichlet, we get
161
Dirichlet Priors -- Example 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 00.20.40.60.81 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)
162
Prior Knowledge The hyperparameters 1,…, K can be thought of as “imaginary” counts from our prior experience Equivalent sample size = 1 +…+ K The larger the equivalent sample size the more confident we are in our prior
163
Effect of Priors Prediction of P(X=H ) after seeing data with N H = 0.25N T for different sample sizes 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 020406080100 0 0.1 0.2 0.3 0.4 0.5 0.6 020406080100 Different strength H + T Fixed ratio H / T Fixed strength H + T Different ratio H / T
164
Effect of Priors (cont.) In real data, Bayesian estimates are less sensitive to noise in the data 0.1 0.2 0.3 0.4 0.5 0.6 0.7 5101520253035404550 P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result
165
Conjugate Families The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy –Dirichlet prior is a conjugate family for the multinomial likelihood Conjugate families are useful since: –For many distributions we can represent them with hyperparameters –They allow for sequential update within the same representation –In many cases we have closed-form solution for prediction
166
Bayesian Networks and Bayesian Prediction Priors for each parameter group are independent Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1] Y|X XX m X[m] Y[m] Query
167
Bayesian Networks and Bayesian Prediction (Cont.) We can also “read” from the network: Complete data posteriors on parameters are independent XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1] Y|X XX m X[m] Y[m] Query
168
Bayesian Prediction(cont.) Since posteriors on parameters for each family are independent, we can compute them separately Posteriors for parameters within families are also independent: Complete data independent posteriors on Y|X=0 and Y|X=1 XX Y|X m X[m] Y[m] Refined model XX Y|X=0 m X[m] Y[m] Y|X=1
169
Bayesian Prediction(cont.) Given these observations, we can compute the posterior for each multinomial X i | pa i independently –The posterior is Dirichlet with parameters (X i =1|pa i )+N (X i =1|pa i ),…, (X i =k|pa i )+N (X i =k|pa i ) The predictive distribution is then represented by the parameters
170
Assessing Priors for Bayesian Networks We need the (x i,pa i ) for each node x j We can use initial parameters 0 as prior information –Need also an equivalent sample size parameter M 0 –Then, we let (x i,pa i ) = M 0 P(x i,pa i | 0 ) This allows to update a network using new data
171
Learning Parameters: Case Study (cont.) Experiment: Sample a stream of instances from the alarm network Learn parameters using –MLE estimator –Bayesian estimator with uniform prior with different strengths
172
Learning Parameters: Case Study (cont.) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0500100015002000250030003500400045005000 KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50
173
Learning Parameters: Summary Estimation relies on sufficient statistics –For multinomial these are of the form N (x i,pa i ) –Parameter estimation Bayesian methods also require choice of priors Both MLE and Bayesian are asymptotically equivalent and consistent Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)
174
Learning Structure from Complete Data
175
Benefits of Learning Structure Efficient learning -- more accurate models with less data –Compare: P(A) and P(B) vs. joint P(A,B) Discover structural properties of the domain –Ordering of events –Relevance Identifying independencies faster inference Predict effect of actions –Involves learning causal relationship among variables
176
Why Struggle for Accurate Structure? Increases the number of parameters to be fitted Wrong assumptions about causality and domain structure Cannot be compensated by accurate fitting of parameters Also misses causality and domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arcMissing an arc
177
Approaches to Learning Structure Constraint based –Perform tests of conditional independence –Search for a network that is consistent with the observed dependencies and independencies Pros & Cons Intuitive, follows closely the construction of BNs Separates structure learning from the form of the independence tests Sensitive to errors in individual tests
178
Approaches to Learning Structure Score based –Define a score that evaluates how well the (in)dependencies in a structure match the observations –Search for a structure that maximizes the score Pros & Cons Statistically motivated Can make compromises Takes the structure of conditional probabilities into account Computationally hard
179
Likelihood Score for Structures First cut approach: –Use likelihood function Recall, the likelihood score for a network structure and parameters is Since we know how to maximize parameters from now we assume
180
Likelihood Score for Structure (cont.) Rearranging terms: where H(X) is the entropy of X I(X;Y) is the mutual information between X and Y –I(X;Y) measures how much “information” each variables provides about the other –I(X;Y) 0 –I(X;Y) = 0 iff X and Y are independent –I(X;Y) = H(X) iff X is totally predictable given Y
181
Likelihood Score for Structure (cont.) Good news: Intuitive explanation of likelihood score: –The larger the dependency of each variable on its parents, the higher the score Likelihood as a compromise among dependencies, based on their strength
182
Likelihood Score for Structure (cont.) Bad news: Adding arcs always helps –I(X;Y) I(X;Y,Z) –Maximal score attained by fully connected networks –Such networks can overfit the data --- parameters capture the noise in the data
183
Avoiding Overfitting “Classic” issue in learning. Approaches: Restricting the hypotheses space –Limits the overfitting capability of the learner –Example: restrict # of parents or # of parameters Minimum description length –Description length measures complexity –Prefer models that compactly describes the training data Bayesian methods –Average over all possible parameter values –Use prior knowledge
184
Bayesian Inference Bayesian Reasoning---compute expectation over unknown G Assumption: G s are mutually exclusive and exhaustive We know how to compute P(x[M+1]|G,D) –Same as prediction with fixed structure How do we compute P(G|D) ?
185
Marginal likelihood Prior over structures Using Bayes rule: P(D) is the same for all structures G Can be ignored when comparing structures Probability of Data Posterior Score
186
Marginal Likelihood By introduction of variables, we have that This integral measures sensitivity to choice of parameters Likelihood Prior over parameters
187
Marginal Likelihood: Binomial case Assume we observe a sequence of coin tosses…. By the chain rule we have: recall that where N m H is the number of heads in first m examples.
188
Marginal Likelihood: Binomials (cont.) We simplify this by using Thus
189
Binomial Likelihood: Example Idealized experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) (log P(D))/M
190
Marginal Likelihood: Example (cont.) Actual experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 (log P(D))/M M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5)
191
Marginal Likelihood: Multinomials The same argument generalizes to multinomials with Dirichlet prior P( ) is Dirichlet with hyperparameters 1,…, K D is a dataset with sufficient statistics N 1,…,N K Then
192
Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y Network structure determines form of marginal likelihood 1234567 Network 1: Two Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],…,Y[7]) XY
193
Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y Network structure determines form of marginal likelihood 1234567 Network 2: Three Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],Y[4],Y[6],Y[7]) P(Y[2],Y[3],Y[5]) XY
194
Idealized Experiment P(X = H) = 0.5 P(Y = H|X = H) = 0.5 + pP(Y = H|X = T) = 0.5 - p -1.8 -1.75 -1.7 -1.65 -1.6 -1.55 -1.5 -1.45 -1.4 -1.35 -1.3 1101001000 Independent P = 0.05 P = 0.10 P = 0.15 P = 0.20 (log P(D))/M M
195
Marginal Likelihood for General Network The marginal likelihood has the form: where N(..) are the counts from the data (..) are the hyperparameters for each family given G Dirichlet Marginal Likelihood For the sequence of values of X i when X i ’ s parents have a particular value
196
Priors We need: prior counts (..) for each network structure G This can be a formidable task –There are exponentially many structures…
197
BDe Score Possible solution: The BDe prior Represent prior using two elements M 0, B 0 –M 0 - equivalent sample size –B 0 - network representing the prior probability of events
198
BDe Score Intuition: M 0 prior examples distributed by B 0 Set (x i,pa i G ) = M 0 P(x i,pa i G | B 0 ) –Note that pa i G are not the same as the parents of X i in B 0. –Compute P(x i,pa i G | B 0 ) using standard inference procedures Such priors have desirable theoretical properties –Equivalent networks are assigned the same score
199
Bayesian Score: Asymptotic Behavior Theorem: If the prior P( |G) is “well-behaved”, then
200
Asymptotic Behavior: Consequences Bayesian score is consistent –As M the “true” structure G* maximizes the score (almost surely) –For sufficiently large M, the maximal scoring structures are equivalent to G* Observed data eventually overrides prior information –Assuming that the prior assigns positive probability to all cases
201
Asymptotic Behavior This score can also be justified by the Minimal Description Length (MDL) principle This equation explicitly shows the tradeoff between –Fitness to data --- likelihood term –Penalty for complexity --- regularization term
202
Scores -- Summary Likelihood, MDL, (log) BDe have the form BDe requires assessing prior network. It can naturally incorporate prior knowledge and previous experience BDe is consistent and asymptotically equivalent (up to a constant) to MDL All are score-equivalent –G equivalent to G’ Score(G) = Score(G’)
203
Optimization Problem Input: –Training data –Scoring function (including priors, if needed) –Set of possible structures Including prior knowledge about structure Output: –A network (or networks) that maximize the score Key Property: –Decomposability: the score of a network is a sum of terms.
204
Learning Trees Trees: –At most one parent per variable Why trees? –Elegant math we can solve the optimization problem efficiently (with a greedy algorithm) –Sparse parameterization avoid overfitting while adapting to the data
205
Learning Trees (cont.) Let p(i) denote the parent of X i, or 0 if X i has no parents We can write the score as Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network
206
Learning Trees (cont) Algorithm: Construct graph with vertices: 1, 2, … Set w(i j) be Score( X j | X i ) - Score(X j ) Find tree (or forest) with maximal weight –This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm) Theorem: This procedure finds the tree with maximal score When score is likelihood, then w(i j) is proportional to I(X i ; X j ) this is known as the Chow & Liu method
207
Not every edge in tree is in the the original network Tree direction is arbitrary --- we can’t learn about arc direction Learning Trees: Example Tree learned from alarm data correct arcs spurious arcs PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP
208
Difficulty Theorem: Finding maximal scoring network structure with at most k parents for each variables is NP-hard for k > 1
209
Heuristic Search We address the problem by using heuristic search Define a search space: –nodes are possible structures –edges denote adjacency of structures Traverse this space looking for high-scoring structures Search techniques: –Greedy hill-climbing –Best first search –Simulated Annealing –...
210
Heuristic Search (cont.) Typical operations: S C E D S C E D Reverse C E Delete C E Add C D S C E D S C E D
211
Exploiting Decomposability in Local Search Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move S C E D S C E D S C E D S C E D
212
Greedy Hill-Climbing Simplest heuristic local search –Start with a given network empty network best tree a random network –At each iteration Evaluate all possible changes Apply change that leads to best improvement in score Reiterate –Stop when no modification improves score Each step requires evaluating approximately n new changes
213
Greedy Hill-Climbing: Possible Pitfalls Greedy Hill-Climbing can get struck in: –Local Maxima: All one-edge changes reduce the score –Plateaus: Some one-edge changes leave the score unchanged Happens because equivalent networks received the same score and are neighbors in the search space Both occur during structure search Standard heuristics can escape both –Random restarts –TABU search
214
Equivalence Class Search Idea: Search the space of equivalence classes Equivalence classes can be represented by PDAGs (partially ordered graph) Benefits: The space of PDAGs has fewer local maxima and plateaus There are fewer PDAGs than DAGs
215
Equivalence Class Search (cont.) Evaluating changes is more expensive These algorithms are more complex to implement X Z YX Z YX Z Y Add Y---Z Original PDAG New PDAG Consistent DAG Score
216
Learning in Practice: Alarm domain 0 0.5 1 1.5 2 0500100015002000250030003500400045005000 KL Divergence M True Structure/BDe M' = 10 Unknown Structure/BDe M' = 10
217
Model Selection So far, we focused on single model –Find best scoring model –Use it to predict next example Implicit assumption: –Best scoring model dominates the weighted sum Pros: –We get a single structure –Allows for efficient use in our tasks Cons: –We are committing to the independencies of a particular structure –Other structures might be as probable given the data
218
Model Averaging Recall, Bayesian analysis started with –This requires us to average over all possible models
219
Model Averaging (cont.) Full Averaging –Sum over all structures –Usually intractable--- there are exponentially many structures Approximate Averaging –Find K largest scoring structures –Approximate the sum by averaging over their prediction –Weight of each structure determined by the Bayes Factor The actual score we compute
220
Search: Summary Discrete optimization problem In general, NP-Hard –Need to resort to heuristic search –In practice, search is relatively fast (~100 vars in ~10 min): Decomposability Sufficient statistics In some cases, we can reduce the search problem to an easy optimization problem –Example: learning trees
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.