Probability and Statistics for Data Mining COMP5318
Question 1 Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’? Gender% of credit card holders % of gender who default Male6055 Female4035
Probability Probability is the mathematical language to understand uncertainty. We need to make decisions in the presence of uncertainty which is ever present. Example: The Earth is warming- a phenomenon that is known as Global Warming (GW). Is modern human activity the cause of GW. –Physics driven approach –Data driven approach
Experiments and Observation When an experiment is carried out we observe the outcome – which is often uncertain. –If not uncertain then why carry out the experiment? We look into a random shopping basket. Does it contain a a packet of “Tofu”? We toss a coin, does it land on “Heads”? We ask a question: “Is it raining in Broom, WA, right now”?
Building Blocks of Probability The space of all possible outcomes is called the sample space. –Non-trivial to decide. Single Coin Toss. The space is {H,T}. Shopping Basket. The space of all possible combinations of all items sold in the store. Shopping Basket: {Tofu, Not-Tofu}.
Events Events are subsets of the sample space. Events are often defined in familiar terms. In the shopping basket scenario –A vegetarian shopping basket is an event. –all possible vegetarian item combinations. Throw of a dice. The event we are looking for could be: Even Number = {2,4,6}, where the sample space = {1,2,3,4,5,6}
Events Let G be the set of all galaxies. Characterize each galaxy by three number –d: distance from earth –a: major axis –b: minor axis Elliptic Galaxies (EG) –EG ={(a,b,d) | a/b > 1.5} Distant Spiral Galaxies (DSG) –DSG ={(a,b,d) | a/b 10}
Events Let G be the set of all genes. Each gene can be “on” or “off”. Let E correspond to the event: all genes which are “on” when the skin cells are “starved”.
Events are Sets At the most basic level events are sets. Therefore we can carry out set union, difference and intersection on events. For example: –E1: shopping baskets which contain Tofu –E2: shopping baskets which contain Milk –E1 U E2: shopping baskets which contain either Tofu or Milk
Probability Let S be the space of all possible elementary outcomes. Let = Power(S) be the power set of S. Then the probability P is function: P : [0,1] that satisfy the following properties (axioms):
Interpretation of Probability Physical or Ontological: Long term frequency –50% chance that a coin will land on heads. –20% of all Woolworth shopping baskets are vegetarian. –22% of all Woolworth shopping baskets in Northbridge plaza are vegetarian. Epistemological : Degree of Belief –20% chance that my neighbours are watering their lawn on “dry” days. –99% chance that the green immovable object outside my house is a Tree. –90% chance that Australia will win the cricket world cup.
Consequences of Axioms
Example Two coin tosses. Let H1 be the event that a heads occurs on toss 1 and H2 a heads on toss 2. All events are equally likely. Sample space = {HH, HT, TH, TT} –H1 = {HH, HT} –H2 = {HH,TH} –P(H1 U H2) = ½ + ½ - ¼ = 3/4
Example Two events A and B are independent if –P(A ∩ B) = P(A)P(B) P(A∩B) is also written as P(AB) and P(A,B). If A and B are disjoint event then A and B such that P(A) > 0 and P(B) > 0 then A and B cannot be independent –P(A ∩ B) = 0. Yet P(A)P(B) > 0 Except for this case you cannot determine independence by looking at a Venn diagram
Question A shopping basket can either be kosher or not. The probability that it will be kosher is 3/4. Examine 10 baskets at a check out counter. What is the probability that there will be at least one kosher basket.
Answer Let E be the event “At least one kosher basket.” Let NK i be the event that the i-th basket is non-kosher. Independence
Example For an Online Book Seller (OBS) the conversion rate is 1/100, i.e., every 100 th visitors ends up making a purchase. What is the probability that at least one purchase will be made in 10 consecutive visits (by distinct customers).
Example Two people take turns to sink a basketball. P1 succeeds with probability 1/3 and P2 with ¼. What is the probability that P1 succeeds before P2. Requires clever setting up of the events. –Let E be the event that P1 succeeds before P2. –Let A i be the event that P1 succeeds before P2 on the ith trial. –A i ∩A j = Ø and E = [ i=1 1 A i
Conditional Probability Very Important Concept P(A|B) is “fraction of occurrences of B in which A also occurs” –P(A|B) = P(A ∩ B)/P(B); P(B) > 0 For a fixed B, P(.|B) is a probability –Therefore if A1 and A2 are disjoint then –P(A1 U A2 |B) = P(A1|B) + P(A2|B) Note, P(A|B U C) =/= P(A|B) + P(A|C) Also P(A|B) =/= P(B|A)
Standard Example DDcDc Suppose a test is positive. What is the probability of disease? D is disease +/-; Test positive or negative
Standard Data Mining Example Suppose the data above closely resembles the behaviour of the population at large. What is the chance that those who buy a Diaper will also buy Beer. = P(Diaper ∩ Beer)/P(Diaper) = 0.6/0.8 = 0.75 Is Diaper an Event?
Conditional Independence If A and B are independent then P(A|B)=P(A) P(AB) = P(A|B)P(B) Law of Total Probability.
Bayes Theorem
Question 1 Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’? Gender% of credit card holders % of gender who default Male6055 Female4035
Answer to Question 1 But what does G=F and D=Y mean? We have not even formally defined them.