Feature Selection Benjamin Biesinger - Manuel Maly - Patrick Zwickl.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
SVM—Support Vector Machines
Programming Types of Testing.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Style checker for JAVA Baile Herculane, – U. Sacklowski, Dept. of Comp. Sc., HU-Berlin1 A style checker for JAVA and its application at.
Memory-Based Recommender Systems : A Comparative Study Aaron John Mani Srinivasan Ramani CSCI 572 PROJECT RECOMPARATOR.
1 Frameworks. 2 Framework Set of cooperating classes/interfaces –Structure essential mechanisms of a problem domain –Programmer can extend framework classes,
Ensemble Learning: An Introduction
Three kinds of learning
An Extended Introduction to WEKA. Data Mining Process.
Recommender systems Ram Akella November 26 th 2008.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
1 Chapter 2 Problem Solving Techniques INTRODUCTION 2.2 PROBLEM SOLVING 2.3 USING COMPUTERS IN PROBLEM SOLVING : THE SOFTWARE DEVELOPMENT METHOD.
Yoonjung Choi.  The Knowledge Discovery in Databases (KDD) is concerned with the development of methods and techniques for making sense of data.  One.
Evaluating Performance for Data Mining Techniques
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Introduction To System Analysis and design
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Step-by-step techniques in SPSS Whitney I. Mattson 09/15/2010.
Multimedia Databases (MMDB)
Appendix: The WEKA Data Mining Software
Chapter 12Java: an Introduction to Computer Science & Programming - Walter Savitch 1 Chapter 12 l Basics of Recursion l Programming with Recursion Recursion.
Programming Project (Last updated: August 31 st /2010) Updates: - All details of project given - Deadline: Part I: September 29 TH 2010 (in class) Part.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Linux+ Guide to Linux Certification, Third Edition
Generic API Test tool By Moshe Sapir Almog Masika.
Weka: Experimenter and Knowledge Flow interfaces Neil Mac Parthaláin
Problems in large-scale computer vision David Crandall School of Informatics and Computing Indiana University.
Data Structures and Algorithms Introduction to Algorithms M. B. Fayek CUFE 2006.
Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Introduction to Problem Solving. Steps in Programming A Very Simplified Picture –Problem Definition & Analysis – High Level Strategy for a solution –Arriving.
Christopher Moh 2005 Competition Programming Analyzing and Solving problems.
Record Linkage in a Distributed Environment
Software Development Problem Analysis and Specification Design Implementation (Coding) Testing, Execution and Debugging Maintenance.
Chapter 3 Top-Down Design with Functions Part II J. H. Wang ( 王正豪 ), Ph. D. Assistant Professor Dept. Computer Science and Information Engineering National.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
An Exercise in Machine Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
INTRODUCTION TO PROGRAMMING. Program Development Life Cycle The program development life cycle is a model that describes the stages involved in a program.
8. DECISION STRUCTURES Rocky K. C. Chang October 18, 2015 (Adapted from John Zelle’s slides)
Analyzing Stock Quotes using Data Mining Techniques Name of Student: To Yi Fun University Number: First Presentation, Final Year Project, 2013.
Copyright  2004 limsoon wong Using WEKA for Classification (without feature selection)
Introduction to CADStat. CADStat and R R is a powerful and free statistical package [
Chapter – 8 Software Tools.
Feature Selection Benjamin Biesinger - Manuel Maly - Patrick Zwickl.
AdaptJ Sookmyung Women’s Univ. PSLAB. 1. 목차 1. Overview 2. Collecting Trace Data using the AdaptJ Agent 2.1 Recording a Trace 3. Analyzing Trace Data.
PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Efficient Image Classification on Vertically Decomposed Data
Chapter 15 QUERY EXECUTION.
Efficient Image Classification on Vertically Decomposed Data
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
A Unifying View on Instance Selection
CSCI N317 Computation for Scientific Applications Unit Weka
Lecture 10 – Introduction to Weka
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Feature Selection Benjamin Biesinger - Manuel Maly - Patrick Zwickl

Agenda Introduction: What is feature selection? What is our contribution? Phases: What is the sequence of actions in our solution? Solution: How does it work in particular? Results: What is returned? Analysis: What to do with it? What can we conclude from it?

Introduction Not all features of a data set are useful for classification A large number of attributes negatively influences the computation time The most essential features should be used for classification Feature selection is an approach Different search strategies and evaluations are available, but which is the best? Automatic feature selection: Several algorithms are run, compared and analyzed for trends → Implemented by us

Phases Phases: (I) Meta-classification - (II) Classification Before: File loading & preparation Afterwards: Comparison + output generation

Solution Java command-line application utilizing the WEKA toolkit Command-line arguments: Filename (of dataset), Classifier algorithm name, Split (feature selection classification percentage) Example: „winequality-red.csv M5Rules 20“ Computation of results and display in system output of console

Solution (Flow 1) 1. Parsing of dataset and creation of WEKA-specific „Instances“ object. 2. Split of Instances object in two parts, depending on percentage entered by user. 3. Combining all evaluation and search algorithms given in properties-files, and applying on 1. Instances object, finally storing results in dedicated objects (SData). 4. Classifying all combinations from step 3 with classifier entered by user on 2. Instances object. Again storing results in SData objects.

Solution (Flow 2) 5. Gaining aggregate information on all results by iterating over SData objects. 6. Print trend analysis and information on combined evaluation and search algorithms, plus the corresponding classification results (time + mean absolute error).

Solution (Output of selected features Attribute: bottom-right-square has Count: 8 … =============== Evaluation: ConsistencySubsetEval =============== --- Search: GreedyStepwise --- # of selected features: 1, selection time: 34, classification time: 36, mean abs. error:47,07% # of selected features: 2, selection time: 35, classification time: 34, mean abs. error:43,16% … --- Search: RandomSearch --- Automatic feature number (no influence by user): 5, selection time: 74, classification time: 118, mean abs. error:44,46%

Results Tested on 3 different datasets Tic Tac Toe Wine Quality (red) Balance Scale 2 comparisons per dataset were made For each feature selection individually Between different feature selection techniques Is there a trend which features are selected by most techniques?

1st Comparison Influence of number of selected features on Runtime Classification accuracy (measured in MAE)

1st Comparison Result Only those search algorithms used that implement RankedOutputSearch interface Capable to influence the number of features to select Number of features selected and MAE behave to each other directly proportional – to runtime inversely proportional

2nd Comparison Feature Selection Technique consists of Search algorithm Evaluation algorithm Not all combinations possible! Different feature selection techniques compared to each other concerning: Runtime Performance (measured in MAE)

2nd Comparison Result Different techniques select different amount of attributes In some extent, different attributes, too Some techniques are slower than others Huge runtime differences between search algorithms Some techniques select insufficient attributes to give acceptable results

Trend In all tested datasets there was a trend on which features were selected Higher count of selection implies bigger influence to the output

Analysis Different feature selection techniques – different characteristics ClassifierSubsetEval / RaceSearch very good classification results Less attributes – faster classification Algorithms that select less features are faster e.g. GeneticSearch

Lowest error rate Dataset Feature Selection Technique Runtime Mean absolute error Tic Tac Toe ClassifierSubsetEval / RaceSearch 64215,25 Wine Quality (red) ClassifierSubsetEval / RaceSearch ,8 Balance Scalemany9-3421,96

Lowest runtime Dataset Feature Selection Technique Runtime Mean absolute error Tic Tac Toex / RankSearch1750,85 Wine Quality (red) WrapperSubsetEval / GeneticSearch ,57 Balance Scalemany5-34-

Trend DatasetFirstSecondThird Tic Tac ToeTop-left-squareTop-right-squareTop-middle-square Wine Quality (red) Volatile acidityFixed acidityChlorides Balance ScaleRight-weightRight-distanceLeft-distance

Feature Selection Benjamin Biesinger - Manuel Maly - Patrick Zwickl Any questions? The essential features ;) hääh? Anything missed? thx