Computational and Statistical Issues in Data-Mining

Computational and Statistical Issues in Data-Mining
Yoav Freund Banter Inc. 11/9/2018

Plan of talk Two large scale classification problems.
Generative versus Predictive modeling Boosting Applications of boosting Computational issues in data-mining. 11/9/2018

AT&T customer classification
Freund, Mason, Rogers, Pregibon, Cortes 2000 AT&T customer classification Distinguish business/residence customers Classification unavailable for about 30% of known customers. Calculate a “Buizocity” score Using statistics from call-detail records Records contain: calling number, called number, time of day, length of call. 3 years for fraud 11/9/2018

Massive datasets 260 Million calls / day
230 Million telephone numbers to be classified. Call detail: originating, called, terminated (e.g, for 800, forwarding) #, start, end time, quality, termination code (e.g., for wireless), quality, no rate information Statistics: #calls, Distribution within day, week, # #’s called, type (800, toll, etc.), incoming vs. outgoing 11/9/2018

Paul Viola’s face recognizer
Faces Non-Faces Training data 5000 faces non faces This situation with negative examples is actually quite common… where negative examples are free. 11/9/2018

Application of face detector
Many Uses - User Interfaces - Interactive Agents - Security Systems - Video Compression - Image Database Analysis Here is the basic problem… scan the image and find all faces labelling them with their scale and location This actually the output of our system… perhaps we have gotten a bit lucky but this sort of performance is typical Note the system is limited to frontal upright faces… more general poses are work in progress. 11/9/2018

Generative vs. Predictive models
11/9/2018

Toy Example Computer receives telephone call Measures Pitch of voice
Decides gender of caller Male Human Voice Female 11/9/2018

Generative modeling mean1 var1 mean2 var2 Probability Voice Pitch
11/9/2018

Discriminative approach
No. of mistakes Voice Pitch 11/9/2018

Ill-behaved data mean1 mean2 Probability Voice Pitch No. of mistakes
11/9/2018

Traditional Statistics vs. Machine Learning
Predictions Actions Data Estimated world state Decision Theory Statistics 11/9/2018

Comparison of methodologies
Model Generative Discriminative Goal Probability estimates Classification rule Performance measure Likelihood Misclassification rate Mismatch problems Outliers Misclassifications 11/9/2018

Boosting 11/9/2018

A weak learner weak learner Weighted training set h instances labels h
Non-negative weights sum to 1 Binary label Feature vector A weak learner A weak rule weak learner Weighted training set (x1,y1,w1),(x2,y2,w2) … (xn,yn,wn) h instances x1,x2,x3,…,xn labels y1,y2,y3,…,yn h The weak requirement: 11/9/2018

The boosting process Sign[ ] weak learner h1 weak learner h2 h3 h4 h5
(x1,y1,1/n), … (xn,yn,1/n) weak learner h2 (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) h8 (x1,y1,w1), … (xn,yn,wn) h9 (x1,y1,w1), … (xn,yn,wn) hT (x1,y1,w1), … (xn,yn,wn) Sign[ ] Final rule: + + + a1 h1 a2 h2 aT hT 11/9/2018

Main properties of adaboost
If advantages of weak rules over random guessing are: g1,g2,..,gT then in-sample error of final rule is at most (w.r.t. initial weights) Even after in-sample error reaches zero, additional boosting iterations usually improve out-of-sample error. [Schapire,Freund,Bartlett,Lee Ann. Stat. 98] 11/9/2018

What is a good weak learner?
The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Small enough to allow exhaustive search for the minimal weighted training error. Small enough to avoid over-fitting. Should be able to calculate predicted label very efficiently. Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0). 11/9/2018

Unique Binary Features
Image Features For real problems results are only as good as the features used... This is the main piece of ad-hoc (or domain) knowledge Rather than the pixels, we have selected a very large set of simple functions Sensitive to edges and other critcal features of the image ** At multiple scales Since the final classifier is a perceptron it is important that the features be non-linear… otherwise the final classifier will be a simple perceptron. We introduce a threshold to yield binary features Unique Binary Features 11/9/2018

Example Classifier for Face Detection
A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive... ROC curve for 200 feature classifier 11/9/2018

Joint work with Llew Mason
Alternating Trees Joint work with Llew Mason 11/9/2018

Decision Trees X Y -1 3 +1 X>3 5 -1 Y>5 -1 -1 +1 no yes no yes
11/9/2018

Decision tree as a sum sign X Y +0.1 -0.1 +1 -1 -0.2 +0.2 -0.3 X>3
no yes +0.1 Y>5 +0.2 -0.3 yes no 11/9/2018

An alternating decision tree
X Y sign +0.1 -0.1 +1 -1 +0.2 -0.3 -0.2 Y<1 0.0 no yes +0.7 X>3 -0.1 no yes +0.1 Y>5 +0.2 -0.3 yes no +0.7 11/9/2018

Example: Medical Diagnostics
Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances. 11/9/2018

Adtree for Cleveland heart-disease diagnostics problem
11/9/2018

Cross-validated accuracy
Learning algorithm Number of splits Average test error Test error variance ADtree 6 17.0% 0.6% C5.0 27 27.2% 0.5% C5.0 + boosting 446 20.2% Boost Stumps 16 16.5% 0.8% 11/9/2018

Alternating tree for “buizocity”
11/9/2018

Alternating Tree (Detail)
11/9/2018

Precision/recall graphs
Accuracy 11/9/2018 Score

“Drinking out of a fire hose”
Allan Wilks, 1997 11/9/2018

Massive distributed data streams
Data aggregation “Data warehouse” Front-end systems Cashier’s system Telephone switch Web server Web-camera Analytics 11/9/2018

The database bottleneck
Physical limit: disk “seek” takes 0.01 sec Same time to read/write 10^5 bytes Same time to perform 10^7 CPU operations Commercial DBMS are optimized for varying queries and transactions. Classification tasks require evaluation of fixed queries on massive data streams. 11/9/2018

Working with large flat files
Sort file according to X (“called telephone number”). Can be done very efficiently for very large files Counting occurrences becomes efficient because all records for a given X appear in the same disk block. Randomly permute records Reading k consecutive records suffices to estimate a few statistics for a few decisions (splitting a node in a decision tree). Done by sorting on a random number. “Hancock” – a system for efficient computation of statistical signatures for data streams. 11/9/2018

Working with data streams
“You get to see each record only once” Example problem: identify the 10 most popular items for each retail-chain customer over the last 12 months. To learn more: Stanford’s Stream Dream Team: 11/9/2018

Analyzing at the source
JAVA Front-end systems Download code Code generation Statistics aggregation Upload statistics Analytics 11/9/2018

Learn Slowly, Predict Fast!
Buizocity: 10,000 instances are sufficient for learning. 300,000,000 have to be labeled (weekly). Generate ADTree classifier in C, compile it and run it using Hancock. 11/9/2018

Paul Viola’s face detector:
Scan 50,000 location/scale boxes in each image, 15images per sec. to detect a few faces. Cascaded method minimizes average processing time Training takes a day on a fast parallel machine. FACE IMAGE BOX Classifier 1 F T NON-FACE Classifier 3 Classifier 2 11/9/2018

Summary Generative vs. Predictive methodology Boosting
Alternating trees The database bottleneck Learning slowly, predicting fast. 11/9/2018

Other work 1 Specialized data compression: Model averaging:
When data is collected in small bins, most bins are empty. Instead of storing the zeros smart compression dramatically reduces data size. Model averaging: Boosting and Bagging make classifiers more stable. We need theory that does not use Bayesian assumptions. Closely relates to margin-based analysis of boosting and of SVM. Zipf’s Law: Distribution of words in free text is extremely skewed. Methods should scale exponentially in entropy rather than linearly in number of words. 11/9/2018

Other work 2 Online methods: Effective label collection
Data distribution changes with time. Online refinement of feature set. Long-term learning. Effective label collection Selective sampling to label only hard cases. Comparing labels from different people to estimate reliability. Co-training: different channels train each-other. (Blum, Mitchell, McCallum) 11/9/2018

Contact me! 11/9/2018

Computational and Statistical Issues in Data-Mining

Similar presentations

Presentation on theme: "Computational and Statistical Issues in Data-Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational and Statistical Issues in Data-Mining

Similar presentations

Presentation on theme: "Computational and Statistical Issues in Data-Mining"— Presentation transcript:

Similar presentations

About project

Feedback