Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Lazy Paired Hyper-Parameter Tuning
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
An Overview of Machine Learning
Model assessment and cross-validation - overview
Computer vision: models, learning and inference
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Model Evaluation Metrics for Performance Evaluation
Sparse vs. Ensemble Approaches to Supervised Learning
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Ensemble Learning: An Introduction
Classification 10/03/07.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
STUDENTLIFE PREDICTIVE MODELING Hongyu Chen Jing Li Mubing Li CS69/169 Mobile Health March 2015.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Scaling up Decision Trees. Decision tree learning.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
COMP24111: Machine Learning Ensemble Models Gavin Brown
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Validation methods.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bagging and Random Forests
Advanced data mining with TagHelper and Weka
Trees, bagging, boosting, and stacking
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Analytics and OR DP- summary.
COMP61011 : Machine Learning Ensemble Models
Machine Learning Basics
Logistic Regression & Parallel SGD
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensemble learning Reminder - Bagging of Trees Random Forest
Machine Learning Support Vector Machine Supervised Learning
Ch13. Ensemble method (draft)
CAMCOS Report Day December 9th, 2015 San Jose State University
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor: Prof. Longbin Cao Wei Fan, Kun Zhang, and Xiaojing Yuan Recall Precision Ma Mb VEVE

What is the business problem and broad- based areas Problem: ozone pollution day detection Ground ozone level is a sophisticated chemical, physical process and “stochastic” in nature. Ozone level above some threshold is rather harmful to human health and our daily life. 8-hour peak and 1-hour peak standards. 8-hour average > 80 ppt (parts per billion) 1-hour average > 120 ppt It happens from 5 to 15 days per year. Broad-area: Environmental Pollution Detection and Protection Drawback of alternative approaches Simulation: c onsume high computational power; customized for a particular location, so solutions not portable to different places Physical model approach: hard to come up with good equations when there are many parameters, and changes from place to place

What are the research challenges that cannot be handled by the state-of-the-art? Dataset is sparse, skewed, stochastic, biased and streaming in the same time. High dimensional Very few positives Under similar conditions: sometimes it happens and sometimes it doesn’t P(x) difference between training and testing Training data from past, predicting the future Physical model is not well understood and cannot be customized easily from location to location

what is the main idea of your approach? Non-parametric models are easier to use when “physical or generative mechanism” is unknown. Reliable “conditional probabilities” estimation under “skewed, biased, high-dimensional, possibly irrelevant features Estimate “decision threshold” to predict on the unknown distribution of the future Random Decision Tree Super fast implementation Formal Analysis: Bound analysis MSE reduction Bias and bias reduction P(y|x) order correctness proof

TrainingSet Algorithm ….. Estimated probability values 1 fold Estimated probability values 10 fold 10CV Estimated probability values 2 fold Decision threshold V E VEVE “Probability- TrueLabel” file Concatenate P(y=“ozoneday”|x,θ) Lable 7/1/ Normal 7/2/ Ozone 7/3/ Ozone ……… PrecRec plot Recall Precision Ma Mb A CV based procedure for decision threshold selection Training Distribution Testing Distribution P(y=“ozoneday”|x,θ) Lable 7/1/ Normal 7/3/ Ozone 7/2/ Ozone ……… Decision Threshold when P(x) is different and P(y|x) is non-deterministic

Random Decision Tree B1: {0,1} B2: {0,1} B3: continuous B2: {0,1} B3: continuous B2: {0,1} B3: continuous B2: {0,1} B3: continous Random threshold 0.3 Random threshold 0.6 B1 chosen randomly B2 chosen randomly B3 chosen randomly RDT vs Random Forest 1.Original Data vs Bootstrap 2.Random pick vs. Random Subset + info gain 3.Probability Averaging vs. Voting 4.RDT: superfast

Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting)

what is the main advantage of your approach, how do you evaluate it? Fast and Reliable Compare with State-of-the-art data mining algorithms: Decision tree NB Logistic Regression SVM (linear and RBF kernel) Boosted NB and Decision Tree Bagging Random Forest Physical Equation-based Model Actual streaming environment on daily basis

what impact has been made in particular, changing the real world business? From 4-year studies on actual data, the proposed data mining approach consistently outperforms physical model-based method

can your approach be widely expanded to other areas? and how easy would it be? Other known application using proposed approach Fraud Detection Manufacturing Process Control Congestion Prediction Marketing Social Tagging Proposed method is general enough and doesn’t need any tuning or re-configuration