Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational and Statistical Issues in Data-Mining

Similar presentations


Presentation on theme: "Computational and Statistical Issues in Data-Mining"— Presentation transcript:

1 Computational and Statistical Issues in Data-Mining
Yoav Freund Banter Inc. 11/9/2018

2 Plan of talk Two large scale classification problems.
Generative versus Predictive modeling Boosting Applications of boosting Computational issues in data-mining. 11/9/2018

3 AT&T customer classification
Freund, Mason, Rogers, Pregibon, Cortes 2000 AT&T customer classification Distinguish business/residence customers Classification unavailable for about 30% of known customers. Calculate a “Buizocity” score Using statistics from call-detail records Records contain: calling number, called number, time of day, length of call. 3 years for fraud 11/9/2018

4 Massive datasets 260 Million calls / day
230 Million telephone numbers to be classified. Call detail: originating, called, terminated (e.g, for 800, forwarding) #, start, end time, quality, termination code (e.g., for wireless), quality, no rate information Statistics: #calls, Distribution within day, week, # #’s called, type (800, toll, etc.), incoming vs. outgoing 11/9/2018

5 Paul Viola’s face recognizer
Faces Non-Faces Training data 5000 faces non faces This situation with negative examples is actually quite common… where negative examples are free. 11/9/2018

6 Application of face detector
Many Uses - User Interfaces - Interactive Agents - Security Systems - Video Compression - Image Database Analysis Here is the basic problem… scan the image and find all faces labelling them with their scale and location This actually the output of our system… perhaps we have gotten a bit lucky but this sort of performance is typical Note the system is limited to frontal upright faces… more general poses are work in progress. 11/9/2018

7 Generative vs. Predictive models
11/9/2018

8 Toy Example Computer receives telephone call Measures Pitch of voice
Decides gender of caller Male Human Voice Female 11/9/2018

9 Generative modeling mean1 var1 mean2 var2 Probability Voice Pitch
11/9/2018

10 Discriminative approach
No. of mistakes Voice Pitch 11/9/2018

11 Ill-behaved data mean1 mean2 Probability Voice Pitch No. of mistakes
11/9/2018

12 Traditional Statistics vs. Machine Learning
Predictions Actions Data Estimated world state Decision Theory Statistics 11/9/2018

13 Comparison of methodologies
Model Generative Discriminative Goal Probability estimates Classification rule Performance measure Likelihood Misclassification rate Mismatch problems Outliers Misclassifications 11/9/2018

14 Boosting 11/9/2018

15 A weak learner weak learner Weighted training set h instances labels h
Non-negative weights sum to 1 Binary label Feature vector A weak learner A weak rule weak learner Weighted training set (x1,y1,w1),(x2,y2,w2) … (xn,yn,wn) h instances x1,x2,x3,…,xn labels y1,y2,y3,…,yn h The weak requirement: 11/9/2018

16 The boosting process Sign[ ] weak learner h1 weak learner h2 h3 h4 h5
(x1,y1,1/n), … (xn,yn,1/n) weak learner h2 (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) h8 (x1,y1,w1), … (xn,yn,wn) h9 (x1,y1,w1), … (xn,yn,wn) hT (x1,y1,w1), … (xn,yn,wn) Sign[ ] Final rule: + + + a1 h1 a2 h2 aT hT 11/9/2018

17 Main properties of adaboost
If advantages of weak rules over random guessing are: g1,g2,..,gT then in-sample error of final rule is at most (w.r.t. initial weights) Even after in-sample error reaches zero, additional boosting iterations usually improve out-of-sample error. [Schapire,Freund,Bartlett,Lee Ann. Stat. 98] 11/9/2018

18 What is a good weak learner?
The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Small enough to allow exhaustive search for the minimal weighted training error. Small enough to avoid over-fitting. Should be able to calculate predicted label very efficiently. Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0). 11/9/2018

19 Unique Binary Features
Image Features For real problems results are only as good as the features used... This is the main piece of ad-hoc (or domain) knowledge Rather than the pixels, we have selected a very large set of simple functions Sensitive to edges and other critcal features of the image ** At multiple scales Since the final classifier is a perceptron it is important that the features be non-linear… otherwise the final classifier will be a simple perceptron. We introduce a threshold to yield binary features Unique Binary Features 11/9/2018

20 Example Classifier for Face Detection
A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive... ROC curve for 200 feature classifier 11/9/2018

21 Joint work with Llew Mason
Alternating Trees Joint work with Llew Mason 11/9/2018

22 Decision Trees X Y -1 3 +1 X>3 5 -1 Y>5 -1 -1 +1 no yes no yes
11/9/2018

23 Decision tree as a sum sign X Y +0.1 -0.1 +1 -1 -0.2 +0.2 -0.3 X>3
no yes +0.1 Y>5 +0.2 -0.3 yes no 11/9/2018

24 An alternating decision tree
X Y sign +0.1 -0.1 +1 -1 +0.2 -0.3 -0.2 Y<1 0.0 no yes +0.7 X>3 -0.1 no yes +0.1 Y>5 +0.2 -0.3 yes no +0.7 11/9/2018

25 Example: Medical Diagnostics
Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances. 11/9/2018

26 Adtree for Cleveland heart-disease diagnostics problem
11/9/2018

27 Cross-validated accuracy
Learning algorithm Number of splits Average test error Test error variance ADtree 6 17.0% 0.6% C5.0 27 27.2% 0.5% C5.0 + boosting 446 20.2% Boost Stumps 16 16.5% 0.8% 11/9/2018

28 Alternating tree for “buizocity”
11/9/2018

29 Alternating Tree (Detail)
11/9/2018

30 Precision/recall graphs
Accuracy 11/9/2018 Score

31 “Drinking out of a fire hose”
Allan Wilks, 1997 11/9/2018

32 Massive distributed data streams
Data aggregation “Data warehouse” Front-end systems Cashier’s system Telephone switch Web server Web-camera Analytics 11/9/2018

33 The database bottleneck
Physical limit: disk “seek” takes 0.01 sec Same time to read/write 10^5 bytes Same time to perform 10^7 CPU operations Commercial DBMS are optimized for varying queries and transactions. Classification tasks require evaluation of fixed queries on massive data streams. 11/9/2018

34 Working with large flat files
Sort file according to X (“called telephone number”). Can be done very efficiently for very large files Counting occurrences becomes efficient because all records for a given X appear in the same disk block. Randomly permute records Reading k consecutive records suffices to estimate a few statistics for a few decisions (splitting a node in a decision tree). Done by sorting on a random number. “Hancock” – a system for efficient computation of statistical signatures for data streams. 11/9/2018

35 Working with data streams
“You get to see each record only once” Example problem: identify the 10 most popular items for each retail-chain customer over the last 12 months. To learn more: Stanford’s Stream Dream Team: 11/9/2018

36 Analyzing at the source
JAVA Front-end systems Download code Code generation Statistics aggregation Upload statistics Analytics 11/9/2018

37 Learn Slowly, Predict Fast!
Buizocity: 10,000 instances are sufficient for learning. 300,000,000 have to be labeled (weekly). Generate ADTree classifier in C, compile it and run it using Hancock. 11/9/2018

38 Paul Viola’s face detector:
Scan 50,000 location/scale boxes in each image, 15images per sec. to detect a few faces. Cascaded method minimizes average processing time Training takes a day on a fast parallel machine. FACE IMAGE BOX Classifier 1 F T NON-FACE Classifier 3 Classifier 2 11/9/2018

39 Summary Generative vs. Predictive methodology Boosting
Alternating trees The database bottleneck Learning slowly, predicting fast. 11/9/2018

40 Other work 1 Specialized data compression: Model averaging:
When data is collected in small bins, most bins are empty. Instead of storing the zeros smart compression dramatically reduces data size. Model averaging: Boosting and Bagging make classifiers more stable. We need theory that does not use Bayesian assumptions. Closely relates to margin-based analysis of boosting and of SVM. Zipf’s Law: Distribution of words in free text is extremely skewed. Methods should scale exponentially in entropy rather than linearly in number of words. 11/9/2018

41 Other work 2 Online methods: Effective label collection
Data distribution changes with time. Online refinement of feature set. Long-term learning. Effective label collection Selective sampling to label only hard cases. Comparing labels from different people to estimate reliability. Co-training: different channels train each-other. (Blum, Mitchell, McCallum) 11/9/2018

42 Contact me! 11/9/2018


Download ppt "Computational and Statistical Issues in Data-Mining"

Similar presentations


Ads by Google