Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Data Mining Classification: Alternative Techniques
A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
CMPUT 466/551 Principal Source: CMU
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Introduction to Predictive Learning
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
KDD-Cup 2004 Chairs: Rich Caruana & Thorsten Joachims Web Master++: Lars Backstrom Cornell University.
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Machine Learning as Applied to Intrusion Detection By Christine Fossaceca.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Machine Learning CS 165B Spring 2012
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Konstantina Christakopoulou Liang Zeng Group G21
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Kaggle Competition Rossmann Store Sales.
Random Forests Feb., 2016 Roger Bohn Big Data Analytics 1.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Reporting Uncertainty
University of Waikato, New Zealand
Evaluating Classifiers
An Empirical Comparison of Supervised Learning Algorithms
Trees, bagging, boosting, and stacking
Basic machine learning background with Python scikit-learn
Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007
ECE 5424: Introduction to Machine Learning
CIKM Competition 2014 Second Place Solution
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
CSCI N317 Computation for Scientific Applications Unit Weka
Ensemble learning.
Support Vector Machine _ 2 (SVM)
Ensemble learning Reminder - Bagging of Trees Random Forest
What is Artificial Intelligence?
Presentation transcript:

Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand

The problem Detect homologous protein sequences 153 train sequences * ~1000 sequences ==> pairs classified as match or not Very skewed: only 1296 matches (< 1%!) BUT: excellent attributes

The attributes

Algorithms doing well 2fold cross-validation, looking only at predictive accuracy: –Linear SVM (with Logistic model on output for better probs, Platt1999) –10 AdaBoosted unpruned decision trees –Random rules (~ RandomForest, ECML2004 Rule learning WS)

Performance criteria Top1: fraction of blocks with a homologous sequence ranked top1 (max) RMSE: root mean squared error (min) RKL: average rank of the lowest ranked homologous sequence (min) APR: average of the average precision in each block (max) Only RMSE depends on absolute values, for all other criteria a good ranking is sufficient

Unique Solution Voted ensemble of three classifiers: –Linear SVM + logistic model on output –Adaboosted 10 unpruned J48 trees –10^5 random rules Non-standard voting: –If SVM and RandomRules agree ==> Average their probabilities –ELSE Use Booster as tie-breaker Lucky (first on Proteins, 18th on Physics)

Ensemble performance Top1RMSERKL APR Boost SMOlin RR10^ Voted

Attribute ranks RRA53A63A55A58A3A34A54 DSA53A55A58A59A60A54A3 SVMA53A58A3A60A59A8A45

What I should have done Optimize separately Bagging for better probability estimates More data engineering (e.g. PCA, …) View it as an outlier detection problem Utilize block structure ?

(Standard) Lessons Data engineering (good attributes) essential Ensembles are more robust Weka is not just an educational tool –[at least some parts scale well] Java/open source DM tools are competitive But: could improve Weka considerably ( volunteers and/or sponsors, get in touch :-)

Finally A big “THANK YOU” to the organizers of the KDD Cup 2004 !