A Combinatorial Fusion Method for Feature Mining Ye Tian, Gary Weiss, D. Frank Hsu, Qiang Ma Fordham University Presented by Gary Weiss.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Conceptual Clustering
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Tuesday, May 14 Genetic Algorithms Handouts: Lecture Notes Question: when should there be an additional review session?
Algorithm Analysis (Big O) CS-341 Dick Steflik. Complexity In examining algorithm efficiency we must understand the idea of complexity –Space complexity.
Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Chapter 2: Algorithm Discovery and Design
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Skimming Scanning & Note-Taking
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
by B. Zadrozny and C. Elkan
XP 1 Excel Tables Purpose of tables – Process data in a group – Used to facilitate calculations – Used to enhance readability of output Types of tables.
Scientific Process ► 1) Developing a research idea and hypothesis ► 2) Choosing a research design (correlational vs. experimental) ► 3) Choosing subjects.
Evaluating a Research Report
TOPICS IN BUSINESS INTELLIGENCE K-NN & Naive Bayes – GROUP 1 Isabel van der Lijke Nathan Bok Gökhan Korkmaz.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Introduction to Access. Access 2010 is a database creation and management program.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
Gary M. Weiss Alexander Battistin Fordham University.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Introduction to Scientific Research. Science Vs. Belief Belief is knowing something without needing evidence. Eg. The Jewish, Islamic and Christian belief.
Presenting and Analysing your Data CSCI 6620 Spring 2014 Thesis Projects: Chapter 10 CSCI 6620 Spring 2014 Thesis Projects: Chapter 10.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Alternative Algorithms for Addition and Subtraction If we don’t teach them the standard way, how will they learn to compute?
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1/32 This Lecture Substitution model An example using the substitution model Designing recursive procedures Designing iterative procedures Proving that.
Major Science Project Process A blueprint for experiment success.
Subject-specific content: A Generic scoring guide for information-based topics 4 The student has a complete and detailed understanding of the information.
Evaluating Classification Performance
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Chapter 8 – Naïve Bayes DM for Business Intelligence.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Machine Learning: Ensemble Methods
AP CSP: Cleaning Data & Creating Summary Tables
Rule Induction for Classification Using
A paper on Join Synopses for Approximate Query Answering
Performance Measures II
Data Mining Practical Machine Learning Tools and Techniques
A Unifying View on Instance Selection
This Lecture Substitution model
CSCI N317 Computation for Scientific Applications Unit Weka
This Lecture Substitution model
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Presentation transcript:

A Combinatorial Fusion Method for Feature Mining Ye Tian, Gary Weiss, D. Frank Hsu, Qiang Ma Fordham University Presented by Gary Weiss

2 Introduction Feature construction/engineering often a critical step in the data mining process –Can be very time-consuming and may require a lot of manual effort Our approach is to use a combinatorial method to automatically construct new features –We refer to this as “feature fusion” –Geared toward helping to predict rare classes –For now it is restricted to numerical features, but can be extended to other features

3 How does this relate to MMIS? One MMIS category is local pattern analysis –How to efficiently identify quality knowledge from a single data source –Listed data preparation and selection as subtopics and also mentioned fusion We acknowledge that this work probably is not what most people think of as MMIS

4 How can we view this work as MMIS? Think of each feature as piece of information –Our fusion approach integrates these pieces Fusion itself is a proper topic for MMIS since it can also be used with multiple info sources –The fusion method we employ does not really care if the information (i.e., features) are from a single source As complexity of constructed features increases, each can be viewed as a classifier –Each fused feature is an information source –This view is bolstered by other work on data fusion that using ensembles to combine each fused feature

5 Description of the Method 1.A data set is a collection of records where each feature has a score –We assume numerical features 2.We then replace scores by ranks –Ordering of ranks determined by whether larger or small scores better predict class 3.Compute performance of each feature 4.Compute performance of feature combinations 5.Decide which combinations to evaluate/use

6 Step 1: A data set F1F2F3F4F5Class A B C D E F G H

7 Step 2: Scores replaced by Ranks F1F2F3F4F5 A12225 B21443 C33154 D44731 E66867 F78386 G H87572

8 Step 3: Compute Feature Performance Performance measures how well feature predicts minority class We sort rows by feature rank and measure performance on top n%, where n% belong to minority class In this case we evaluate on top 3 rows. Since 2 of 3 are minority (class=1), performance =.66 F2 RankClass B10 A21 C31 D40 G51 E60 H70 F80

9 Step 3 continued FeaturePerformance F10.67 F20.67 F30.67 F40.67 F50.00

10 Step 4: Compute Performance of Feature Combinations Let F6 be fused F1F2F3F4F5 Rank combination function is average of ranks Compute rank of F6 for each record Compute performance of F6 as in step 3 F6 Rank F6 Rank A2.4E6.6 B2.8F6.4 C3.2G5.0 D3.8H5.8

11 Step 5: What Combinations to Use? Given n features there are 2 n – 1 possible combinations –C(n,1) + C(n,2) … C(n.n) –This “fully exhaustive” fusion strategy is practical for many values of n We try other strategies in case not feasible –k-exhaustive strategy selects k best features and tries all combinations –k-fusion strategy uses all n features but fuses at most k features at once

12 Combinatorial Fusion Table Number Features k-fusion for values of k shown below

13 Combinatorial Fusion Algorithm Combinatorial strategy generates features –Performance metric determines which are best Used to determine which k features for k-fusion Also used to determine order of features to add We add a feature if it leads to a statistically significant improvement (p ≤.10) –As measured on validation data –This limits the number of features –But requires a lot of computation

14 Example Run of Algorithm AUC p-value Feature validtest(+ means added) {F1,F2,F3,F4,F5} F3F F1F F1F3

15 Description of Experiments We use Weka’s DT, 1-NN, and Naïve Bayes methods Analyze performance on 10 data sets –With and without fused features –Focus on AUC as the main metric More appropriate than accuracy especially with skewed data Use 3 combinatorial fusion strategies –2-fusion, 3-fusion, and 6-exhaustive

16 Results Summary Results over all 10 Data Sets Results over 4 Most Skewed Data Sets (< 10% Minority)

17 Discussion of Results No one of the 3 fusion schemes is clearly best The methods seem to help, but the biggest improvement is clearly with the DT method –May be explained by traditional DT methods having limited expressive power They can only consider 1 feature at a time Can never perfectly learn simple concepts like F1+F2 > 10, but can with feature-fusion –Bigger improvement for highly skewed data sets Identifying rare cases is difficult and may require looking at many features in parallel

18 Future Work More comprehensive experiments –More data sets, more skewed data sets, more combinatorial fusion strategies Use of heuristics to more intelligently choose fused features –Performance measure now used only to order –Use of diversity measures –Avoid building classifier to determine which fused features to add Handle non-numerical features

19 Conclusion Showed how a method from information fusion can be applied to feature construction Results encouraging but more study needed Extending the method should lead to further improvements

20 Questions?

21 Detailed Results: Accuracy

22 Detailed Results: AUC