Some Key Questions about you Data Damian Gordon Brendan Tierney Brian Mac Namee.

Slides:



Advertisements
Similar presentations
Preparing Data for Quantitative Analysis
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 QUANTITATIVE DESIGN AND ANALYSIS MARK 2048 Instructor: Armand Gervais
Introduction to Data Mining with XLMiner
Lecture Notes for Chapter 2 Introduction to Data Mining
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Chapter 7 – K-Nearest-Neighbor
McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
PROBABILITY AND SAMPLES: THE DISTRIBUTION OF SAMPLE MEANS.
Data Description Tables and Graphs Data Reduction.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.
Presenting information
Chapter 5 Data mining : A Closer Look.
Mother and Child Health: Research Methods G.J.Ebrahim Editor Journal of Tropical Pediatrics, Oxford University Press.
Proposal in Detail – Part 2
FEBRUARY, 2013 BY: ABDUL-RAUF A TRAINING WORKSHOP ON STATISTICAL AND PRESENTATIONAL SYSTEM SOFTWARE (SPSS) 18.0 WINDOWS.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Overview DM for Business Intelligence.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Analyzing and Interpreting Quantitative Data
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CANE 2007 Spring Meeting Visualizing Predictive Modeling Results Chuck Boucek (312)
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
SW388R6 Data Analysis and Computers I Slide 1 Comparing Central Tendency and Variability across Groups Impact of Missing Data on Group Comparisons Sample.
Data Mining and Decision Support
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Overview of the Data Mining Process
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Preparing to collect data. Make sure you have your materials Surveys –All surveys should have a unique numerical identifier on each page –You can write.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Chapter 0: Why Study Statistics? Chapter 1: An Introduction to Statistics and Statistical Inference 1
Connecting to External Data. Financial data can be obtained from a number of different data sources.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
Data Mining Functionalities
Data Quality Data Exploration
Lecture Notes for Chapter 2 Introduction to Data Mining
Classification with Perceptrons Reading:
Essential Statistics (a.k.a: The statistical bare minimum I should take along from STAT 101)
Features & Decision regions
Lecture Slides Elementary Statistics Eleventh Edition
Homework: Frequency & Histogram worksheet
What’s New in Colectica 5.3 Part 2
Machine Learning with Weka
CSCI N317 Computation for Scientific Applications Unit Weka
Course Introduction CSC 576: Data Mining.
Data Quality Data Exploration
Evaluating Classifiers
Presentation transcript:

Some Key Questions about you Data Damian Gordon Brendan Tierney Brian Mac Namee

Some students will be using a dataset as part of their research. This is typically thousands of rows of data. We are not talking about the data you might be collecting from surveys and interviews, but rather a pre-existing set of data. If the data is the key consideration in your research (although not all projects will necessarily be concerned with large datasets) it is important to consider several questions. The Data

How suitable is the data? What is the type of the data? Where will you get it from? What size is the dataset? What format is it in? How much cleaning is required? What is the quality of the data? How do you deal with missing data? How will you evaluate your analysis? etc. Overview

Determining the suitability of the data is a vital consideration, it is not sufficient to simply locate a dataset that is thematically linked to your research question, it must be appropriate to explore the questions that you want to ask. For example, just because you want to do Credit Card Fraud detection and you have a dataset that contains Credit Card transactions or was used in another Credit Card Fraud project, does not mean that it will be suitable for your project. Suitability: Dataset

Is the data already labelled? This is very important for supervised learning problems. To take the credit card fraud example again, you can probably get as many credit card transactions as you like but you probably won't be able to get them marked up as fraudulent and non-fraudulent. Suitability: Labelling

The same thing goes for a lot of text analytics problems - can you get people to label thousands of documents as being interesting or non-interesting to them so that you can train a predictive model? The availability of labelled data is a key consideration for any supervised learning problem. The areas of semi-supervised learning and active learning try to address this problem and have some very interesting open research questions. Suitability: Labelling

Two important considerations:  The Curse of Dimensionality – When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. In order to obtain a statistically sound result, the amount of data you need often grows exponentially with the dimensionality.  The No Free Lunch Theorem - Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems. Suitability: Labelling

Also remember for labelling, you might be aiming for one of three goals:  Binary classifications – classifying each data item to one of two categories.  Multiclass classifications - classifying each data item to more than two categories.  Multi-label classifications - classifying each data item to multiple target labels. Suitability: Labelling

Federated data High dimensional data Descriptive data Longitudinal data Streaming data Web (scraped) data Numeric vs. categorical vs. text data etc. Types of Data

ogspot.com/2011/11/dataset-sites.html ogspot.com/2011/11/dataset-sites.html e.g Locating Datasets

What is a reasonable size of a dataset? Obviously it vary a lot from problem to problem, but in general we would recommend at least 10 features (columns) in the dataset, and we’d like to see thousands of instances. Size of the Dataset

TXT (Text file) MIME (Multipurpose Internet Mail Extensions) XML (Extensible Markup Language) CSV (Comma-Separated Values) ACSII (American Standard Code for Information Interchange) etc. Format of the Data

Parsing Correcting Standardizing Matching Consolidating Cleaning of Data

Frequency counts Descriptive statistics (mean, standard deviation, median) Normality (skewness, kurtosis, frequency histograms, normal probability plots) Associations (correlations, scatter plots) Quality of the Data

Imputation Partial imputation Partial deletion Full analysis Also consider database nullology Missing Data?

Training Dataset (Build dataset) Test Dataset Apply Dataset (Scoring Dataset) Dataset types

How do we evaluate the research project? Evaluation

What about stuff like?  Area under the Curve  Misclassification Error  Confusion Matrix  N-fold Cross Validation  ROC Graph  Log-Loss and Hinge-Loss Evaluation

These are good for evaluating the analysis, so they are good for checking how good the model is based on the dataset, and are definitely part of the evaluation, but if you want to discuss the findings with respect to the real-world (and to the research question) you must do the following: Test predictions using the real-world Evaluation

Other questions? The Data