1 Mining surprising patterns using temporal description length Soumen Chakrabarti (IIT Bombay) Sunita Sarawagi (IIT Bombay) Byron Dom (IBM Almaden)

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 11 Optimal Portfolio Choice
Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Online learning, minimizing regret, and combining expert advice
Playback delay in p2p streaming systems with random packet forwarding Viktoria Fodor and Ilias Chatzidrossos Laboratory for Communication Networks School.
Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no , pages
Dynamic Bayesian Networks (DBNs)
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Biased Coins Or “Why I am considering buying 840 blank dice” Michael Gibson.
Absorbing Random walks Coverage
Randomized Algorithms Tutorial 3 Hints for Homework 2.
A Generalized Model for Financial Time Series Representation and Prediction Author: Depei Bao Presenter: Liao Shu Acknowledgement: Some figures in this.
For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:
Search and Unemploy-ment
EE663 Image Processing Edge Detection 5 Dr. Samir H. Abdul-Jauwad Electrical Engineering Department King Fahd University of Petroleum & Minerals.
Simple Linear Regression
Basic Data Mining Techniques Chapter Decision Trees.
Topic 2: Statistical Concepts and Market Returns
1 Simple Linear Regression and Correlation Chapter 17.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Experimental Evaluation
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04/03 | SDM 09 / Travel-Time Prediction Travel-Time Prediction using Gaussian Process.
01/16/2002 Reliable Query Reporting Project Participants: Rajgopal Kannan S. S. Iyengar Sudipta Sarangi Y. Rachakonda (Graduate Student) Sensor Networking.
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Demand Management and Forecasting
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Inductive learning Simplest form: learn a function from examples
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Frequent itemset mining and temporal extensions Sunita Sarawagi
7.2 Means and Variances of Random Variables.  Calculate the mean and standard deviation of random variables  Understand the law of large numbers.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Association Rule Mining
Copyright © 2006 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
ANOVA, Regression and Multiple Regression March
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
VizTree Huyen Dao and Chris Ackermann. Introducing example
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Demand Management and Forecasting Chapter 11 Portions Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Simple Linear Regression Chapter Introduction In Chapters 17 to 19 we examine the relationship between interval variables via a mathematical.
Warm-Up The least squares slope b1 is an estimate of the true slope of the line that relates global average temperature to CO2. Since b1 = is very.
Regression Analysis AGEC 784.
Inference for Least Squares Lines
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
I don’t need a title slide for a lecture
Richard Anderson Autumn 2006 Lecture 1
Chapter 4: Demand Forecast in Fashion Supply Chains
Mathematical Foundations of BME Reza Shadmehr
Dr. Arslan Ornek MATHEMATICAL MODELS
Presentation transcript:

1 Mining surprising patterns using temporal description length Soumen Chakrabarti (IIT Bombay) Sunita Sarawagi (IIT Bombay) Byron Dom (IBM Almaden)

2 Market basket mining algorithms l Find prevalent rules that hold over large fractions of data l Useful for promotions and store arrangement l Intensively researched 1990 Milk and cereal sell together!

3 Prevalent  Interesting l Analysts already know about prevalent rules l Interesting rules are those that deviate from prior expectation l Mining’s payoff is in finding surprising phenomena Milk and cereal sell together! Zzzz... Milk and cereal sell together!

4 What makes a rule surprising? l Does not match prior expectation n Correlation between milk and cereal remains roughly constant over time l Cannot be trivially derived from simpler rules n Milk 10%, cereal 10% n Milk and cereal 10% … surprising n Eggs 10% n Milk, cereal and eggs 0.1% … surprising! n Expected 1%

5 Two views on data mining Mining Program Data Discovery Mining Program Data Model of Analyst’s Knowledge of the Data Discovery Analyst

6 Our contributions l A new notion of surprising patterns n Detect changes in correlation along time n Filter out steady, uninteresting correlations l Algorithms to mine for surprising patterns n Encode data into bit streams using two models n Surprise = difference in number of bits needed l Experimental results n Demonstrate superiority over prevalent patterns

7 A simpler problem: one item l Milk-buying habits modeled by biased coin l Customer tosses this coin to decide whether to buy milk n Head or “1” denotes “basket contains milk” n Coin bias is Pr[milk] l Analyst wants to study Pr[milk] along time n Single coin with fixed bias is not interesting n Changes in bias are interesting

8 The coin segmentation problem l Players A and B l A has a set of coins with different biases l A repeatedly n Picks arbitrary coin n Tosses it arbitrary number of times l B observes H/T l Guesses transition points and biases Pick Toss Return A B

9 How to explain the data l Given n head/tail observations n Can assume n different coins with bias 0 or 1 Data fits perfectly (with probability one) Many coins needed n Or assume one coin May fit data poorly l “Best explanation” is a compromise 1/4 5/71/3

10 Coding examples l Sequence of k zeroes n Naïve encoding takes k bits n Run length takes about log k bits l 1000 bits, 10 randomly placed 1’s, rest 0’s n Posit a coin with bias 0.01 n Data encoding cost is (Shannon’s theorem):

11 How to find optimal segments Sequence of 17 tosses: Derived graph with 18 nodes: Edge cost = model cost + data cost Model cost = one node ID + one Pr[head] Data cost for Pr[head] = 5/7, 5 heads, 2 tails Shortest path

12 Approximate shortest path l Suppose there are T tosses l Make T 1–  chunks each with T  nodes (tune  ) l Find shortest paths within chunks l Some nodes are chosen in each chunk l Solve a shortest path with all chosen nodes

13 Two or more items l “Unconstrained” segmentation n k items induce a 2 k sided coin n “milk and cereal” = 11, “milk, not cereal” = 10, “neither” = 00, etc. l Shortest path finds significant shift in any of the coin face probabilities l Problem: some of these shifts may be completely explained by lower order marginal

14 Example l Drop in joint sale of milk and cereal is completely explained by drop in sale of milk l Pr[milk & cereal] / (Pr[milk] Pr[cereal]) remains constant over time l Call this ratio 

15 Constant-  segmentation l Compute global  over all time l All coins must have this common value of  l Segment by constrained optimization l Compare with unconstrained coding cost Observed support Independence

16 Is all this really needed? l Simpler alternative n Aggregate data into suitable time windows n Compute support, correlation, , etc. in each window n Use variance threshold to choose itemsets l Pitfalls n Choices: windows, thresholds n May miss fine detail n Over-sensitive to outliers

17 … but no simpler Smoothing leads to an estimated trend that is descriptive rather than analytic or explanatory. Because it is not based on an explicit probabilistic model, the method cannot be treated rigorously in terms of mathematical statistics. The Statistical Analysis of Time Series T. W. Anderson

18 Experiments l 2.8 million baskets over 7 years, l items, average 2.62 items per basket l Two algorithms n Complete MDL approach n MDL segmentation + statistical tests (MStat) l Anecdotes n MDL effective at penalizing obvious itemsets

19 Quality of approximation

20 Little agreement in itemset ranks l Simpler methods do not approximate MDL

21 MDL has high selectivity l Score of best itemsets stand out from the rest using MDL

22 Three anecdotes l  against time l High MStat score n Small marginals n Polo shirt & shorts l High correlation n Small % variation n Bedsheets & pillow cases l High MDL score n Significant gradual drift n Men’s & women’s shorts

23 Conclusion l New notion of surprising patterns based on n Joint support expected from marginals n Variation of joint support along time l Robust MDL formulation l Efficient algorithms n Near-optimal segmentation using shortest path n Pruning criteria l Successful application to real data