Classification Lecture 12. Topics Classification Frame Terminology and measures Using Classifications –In system use –In system development Creating Classifications.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Brief introduction on Logistic Regression
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
What is Statistical Modeling
Taxonomy Lecture 12. Topics Tutorial Review Classification Frame Terminology Classical Taxonomy Using Classifications –In system use –In system development.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
© 2004 Prentice-Hall, Inc.Chap 1-1 Basic Business Statistics (9 th Edition) Chapter 1 Introduction and Data Collection.
Beginning the Research Design
Kinds of System and Problem Frames Lecture 6. Getting a Grip on Problems Are use cases enough? Two problems –The Mayday system –The Volunteer system Common.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Classification Lecture 11. Topics Tutorial Review Classification Frame Terminology and measures Using Classifications –In system use –In system development.
ISD3 Semester 2. Review 3 tier web architecture – describe, explain, terminology, typical interactions SQL & PHP Extended ER models Interaction in human.
Today Concepts underlying inferential statistics
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to Machine Learning Approach Lecture 5.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5 Data mining : A Closer Look.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
CHAPTER 4 Research in Psychology: Methods & Design
Chapter 4 Pattern Recognition Concepts continued.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
IIT Indore © Neminah Hubballi
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Evaluation – next steps
Instrumentation.
Chapter Eight The Concept of Measurement and Attitude Scales
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
An Introduction to Measurement and Evaluation Emily H. Wughalter, Ed.D. Summer 2008 Department of Kinesiology.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
An Introduction to Measurement and Evaluation Emily H. Wughalter, Ed.D. Summer 2010 Department of Kinesiology.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CpSc 810: Machine Learning Evaluation of Classifier.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
General Business 704 Data Analysis for Managers Introduction The Course, Data, and Excel.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Basic Business Statistics, 8e © 2002 Prentice-Hall, Inc. Chap 1-1 Inferential Statistics for Forecasting Dr. Ghada Abo-zaid Inferential Statistics for.
Requirements Analysis
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Constructing an Argument Definitions Distinctions Conceptual Analyses Thought Experiments.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
What Is Cluster Analysis?
Information Organization: Overview
Text Mining CSC 600: Data Mining Class 20.
CSSE463: Image Recognition Day 11
Associated with quantitative studies
Measuring Social Life: How Many? How Much? What Type?
Data Mining Classification: Alternative Techniques
Constructing an Argument
iSRD Spam Review Detection with Imbalanced Data Distributions
Text Mining CSC 576: Data Mining.
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Information Organization: Overview
Presentation transcript:

Classification Lecture 12

Topics Classification Frame Terminology and measures Using Classifications –In system use –In system development Creating Classifications –Card sorting

Classification Frame Classification separates candidates into two or more classes –classifying students by grade of degree We will look at the simple case of two classes first: –filtering Good or Spam –retrieving documents : Relevant or Irrelevant –classifying credit card transactions : Valid or fraudulent –detecting spelling mistakes : ok or mistake (red line) –medical testing : normal or abnormal –Systems Requirement : ambiguous or not abmiguous METAPHOR : SYSTEM IS A SIEVE

Classification Errors (Information Retrieval) RelevantIrrelevant Retrieved Not retrieved true negative true positive false negative (Type II error) false positive (Type 1 error) Precision = TP/ (TP + FP) = TP/ Retrieved Recall = TP / (TP + FN) = TP / Relevant Efficiency = (TP + TN) / (TP + TN + FP + FN) = (TP+TN) / Full Collection

Example Calculation : filtering Good Spam reject accept Precision = TP/ (TP + FP) = Recall = TP / (TP + FN) = Efficiency = (TP + TN) / (TP+TN+FP+FN) =

Example Calculation : filtering Good Spam reject accept Precision = TP/ (TP + FP) = 3/8 Recall = TP / (TP + FN) = 3/7 Efficiency = (TP + TN) / (TP+TN+FP+FN) = 9/18= 50% Recall > Precision => not quite balanced TP FP FN TN 46

Trade-off The two errors are usually in conflict –we can decrease the risk of a False Positive (reject more Spam) –but –we increase the risk of False Negatives (rejecting good ) a TRADE-OFF

Classification Errors Good studentPoor student Pass Fail Write in the terms – relevant, retrieved, true positive, false positive etc

Improved Precision Precision = TP/ (TP + FP) = TP/ Retrieved Recall = TP / (TP + FN) = TP / Relevant TP -True Positives relevant TN - True Negatives FN - False Negatives retrieved FP - False Positives

Precision and Recall Precision = TP/ (TP + FP) = TP/ Retrieved Recall = TP / (TP + FN) = TP / Relevant Efficiency = (TP + TN) / (TP + TN + FP + FN) = (TP+TN) / Full Collection TP -True Positives relevant TN - True Negatives FN - False Negatives retrieved FP - False Positives Full collection

Improved Recall Precision = TP/ (TP + FP) = TP/ Retrieved Recall = TP / (TP + FN) = TP / Relevant TP -True Positives relevant TN - True Negatives FN - False Negatives retrieved FP - False Positives

Exercise: Precision and Recall in Assessment Precision means …… Recall means …. Ideal values (as %) –Precision= –Recall= –Efficiency Estimated values –Precision= –Recall= –Efficiency

Classification in the News Criminal Justice as a Classifer –Murder, Manslaughter or Innocent What counts as ‘torture’? Prisoners of war – US invents a new category for the Quantanamo Bay prisoners Blood groups: –A,B,AB,O –RH+, RH- Classification of Cloud types (Cumulus, Cirrus…) by Luke Howard 1802 Hip evaluation to determine priority for replacement Programme classification – where does ‘Information Systems’ go?

Categories are Information Structures Many systems require the user to classify things in the real world into categories in order to process them: –Files and documents into a hierarchical directory structure –Subject matter in a dissertation into sections –Facilities in the University (helpdesk, reception.. –Skills in a Placements system –Budget headings, Nominal Ledger headings In the computer system, categories can be clearly distinguished: –Codes for each category In the real world: –categories don’t exist - The fallacy of misplaced concreteness –multiple taxonomies are valid – classifying the same things in different ways for different purposes Users typically has the task of –mapping the real, complex things into the appropriate categories interpreting categorical information Implications –Users face a ‘matching’ problem – which category does the item fit best? –IS designers have to devise support for these tasks as well. –Users will not be consistent in their classification (e.g. IS books in Library)

Categories in IS theory Much of IS theory is based on a taxonomy: –Problem /solution –Method/methodology/technique.. –ER model –Data Flow Diagram –Soft Systems Analysis - CATWOE –Logical /Physical –Swot analysis Strengths/Weaknesses/Opportunities/Treats –Objective, Goal, Requirement, Constraint

Classification and Systems Design Steps in Classification –defining the domain (what kinds of things are to be classified) –creating the taxonomy (the set of categories), its purpose and force –defining the representation of individuals –defining the mapping between individuals and categories –coding the categories –creating automatic classifiers –assisting human classifiers –assisting users to interpret categorical information –evaluating classification performance –supporting evolution of taxonomy and classifiers “An early step towards understanding any set of Phenomena is to learn what kinds of things there are in the set – to develop a taxonomy” Herbert Simon

A Poor Classification? The Argentinean writer Jorge Luis Borges ‘Imaginary Beasts’, ‘Labyrinths’..) quotes a ‘certain Chinese encyclopaedia’ in which animals are divided into: A) belonging to the Emperor B) embalmed C) tame D) suckling pigs E) sirens F) fabulous G) stray dogs H) included in the present classification I) frenzied J) innumerable K) drawn with a very fine camel hair brush L) et cetera M) having just broken the water pitcher N) that from a long way off look like flies

ABC Classifier Machine Human Categories/Classes Taxonomy

ABC Classifier Machine Human Categories/Classes Taxonomy Categories not Mutually Exclusive An object can be put in any of several categories

ABC Classifier Machine Human Categories/Classes Taxonomy Categories not Complete Some objects don’t belong anywhere

ABC Classifier Machine Human Categories/Classes Taxonomy Categories not Balanced Some categories much larger than others

ABC Classifier Machine Human Categories/Classes Taxonomy Categories Inconsistant Categories lack a single organising principle

Characteristics of a good Taxonomy Categories must be: –Mutually exclusive Every object in at most one category –Complete (exhaustive) Every object in at least one category –Balanced Categories divide objects evenly –Consistent Same characteristics used throughout –Hierarchical integrity Categories at one level not confused with categories at another level

Kinds of classification Classical –Classes defined by presence of features Square : 4 sides, equal length, equal angles Triangle : 3 sides, equal length, equal angles Probabilistic –Classes defined by weighted sum of features ‘bird’ moves, winged, feathered, sings, lays eggs Is a robin a bird? Is a emu a bird? Exemplar (prototype) –Classes defined by one or more key examples Robin is a central example of ‘bird’ Chicken is more remote example Which kind is used in IS Theory? Which kind is used in IS Use?

Automated Clustering Clustering techniques find groups of similar objects Used in data mining to identify customer groups with similar buying behaviour… Mathematical Techniques –k-nearest neighbour –ID3 to create decision tree Human Techniques –Card sorting

Classifying Learning Classifiers –Based on sample of population –Classified by hand –Split into two parts The training set used to compute the classifier The test set used to test the ability of the classifier –Many kinds of classifiers available, all need good understanding of statistics e.g. Naïve Bayesian, Decision Tree, SVM –Threshold set to balance recall and precision Rule and example based for human classifier but performance varies with experience and skill –E.g. book classification, Yahoo directory classification, medical diagnosis –Human classifiers need to be trained too –If classification done by end-users, classification is likely to be inconsistent

Review 3 tier web architecture – describe, explain, terminology, typical interactions SQL & PHP Extended ER models Interaction in human and computer systems – sequence diagrams, state-full interaction Alternative Development Processes –Agile Development and Extreme Programming – description, application, comparison with SSADM, choice of appropriate development model Frames – rationale, role in IS development, basic recognition in a problem description of simple frames and the following in detail Matching Frame – typical applications, fitness function, recognising nominal, ordinal, interval and ratio scales, use of weights Classification Frame – typical applications, terminology, calculation of recall and precision, guidelines for constructing a taxonomy

Preview XML and XSLT Business Processes and BPML Scenarios and Use cases Learning Frame