Classical Techniques: Statistics, Neighborhoods, and Clustering.

Slides:



Advertisements
Similar presentations
Chapter 11 Automatic Cluster Detection. 2 Data Mining Techniques So Far… Chapter 5 – Statistics Chapter 6 – Decision Trees Chapter 7 – Neural Networks.
Advertisements

Experiments and Variables
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Linear Regression t-Tests Cardiovascular fitness among skiers.
Slide 1-1 Copyright © 2004 Pearson Education, Inc. Stats Starts Here Statistics gets a bad rap, and Statistics courses are not necessarily chosen as fun.
Chapter 8: Probability: The Mathematics of Chance Lesson Plan
Chapter 9 Business Intelligence Systems
Data Mining.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Stat Today: Multiple comparisons, diagnostic checking, an example After these notes, we will have looked at (skip figures 1.2 and 1.3, last.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Data Mining Techniques
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Problem: Assume that among diabetics the fasting blood level of glucose is approximately normally distributed with a mean of 105mg per 100ml and an SD.
Introduction to Science Unit 1. The Nature of Science Attempt to answer questions about the natural world by: Exploring the unknown Explaining the known.
Scientific Inquiry & Skills
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Chapter 8: Probability: The Mathematics of Chance Lesson Plan Probability Models and Rules Discrete Probability Models Equally Likely Outcomes Continuous.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Big Ideas Differentiation Frames with Icons. 1. Number Uses, Classification, and Representation- Numbers can be used for different purposes, and numbers.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
DATA MINING Patrick J. Gallagher March 21, What is Data Mining?
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Copyright © 2009 Cengage Learning 18.1 Chapter 20 Model Building.
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
Geo479/579: Geostatistics Ch4. Spatial Description.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.1 One-Way ANOVA: Comparing.
Lecture V Probability theory. Lecture questions Classical definition of probability Frequency probability Discrete variable and probability distribution.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
What Are Statistics and What are They Used For?. Statistics is the science of collecting, organizing, summarizing, analyzing, and making inferences from.
April 1 st, Bellringer-April 1 st, 2015 Video Link Worksheet Link
Data Mining Brandon Leonardo CS157B (Spring 2006).
So, what’s the “point” to all of this?….
Chapter 8: Probability: The Mathematics of Chance Lesson Plan Probability Models and Rules Discrete Probability Models Equally Likely Outcomes Continuous.
MIS2502: Data Analytics Advanced Analytics - Introduction.
ANOVA, Regression and Multiple Regression March
Residual Plots Unit #8 - Statistics.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Lean Six Sigma: Process Improvement Tools and Techniques Donna C. Summers © 2011 Pearson Higher Education, Upper Saddle River, NJ All Rights Reserved.
Statistical methods for real estate data prof. RNDr. Beáta Stehlíková, CSc
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
Chapter 8: Probability: The Mathematics of Chance Lesson Plan Probability Models and Rules Discrete Probability Models Equally Likely Outcomes Continuous.
Residuals, Influential Points, and Outliers
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Scatterplots & Correlations Chapter 4. What we are going to cover Explanatory (Independent) and Response (Dependent) variables Displaying relationships.
TOOLS OF ENVIRONMENTAL SCIENCE. STATISTICS AND MODELS Objectives 1.Explain how scientists use statistics. 2.Explain why the size of a statistical sample.
Data Mining Introduction to data mining concepts.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining 101 with Scikit-Learn
Adrian Tuhtan CS157A Section1
Unit 6 Probability.
Data Science introduction.
COUNTING AND PROBABILITY
Presentation transcript:

Classical Techniques: Statistics, Neighborhoods, and Clustering

What is Statistics? Statistics is a branch of mathematics concerning the collection and the description of data Statistics was in fact, born from very humble beginnings of real-world problems from business, biology and gambling! Knowing statistics in everyday life will help the average business person make better decisions by allowing them to figure out risk and uncertainty when all the facts either aren’t known or can’t be collected. Today, data mining has been defined independently of statistics.

Data, Counting and Probability One thing that is always true about statistics is that there is always data involved. Statistics can help greatly in this process by helping to answer several important questions about the data: –What patterns are there in my database? –What is the chance that an event will occur? –Which patterns are significant? –What is the high-level summary of the data that gives me some idea of what is contained in my database? One of the great values of statistics is in presenting a high-level view of the database that provides some useful information without requiring every record to be understood in details

Histogram The first step then in understanding statistics is to understand how the data is collected into a higher-level form – one of the most notable ways of doing this is with the histogram.

Histogram (cont’d)

Figure 6-1 An example database of customers with different Predictor types

Statistics for Prediction Regression is a powerful and commonly used tool in statistics and is discussed here Linear Regression: –The simplest form of regression is simple linear regression that contains only one predictor and a prediction –The relationship between the two can be mapped on a two-dimensional space and the records plotted for the prediction values along the Y-axis and the predictor values along the X-axis.

Statistic for prediction (cont’d)

Nearest Neighbor Clustering and the nearest neighbor prediction techniques are among the oldest techniques used in data mining Clustering is – namely, that like records are grouped or clustered together. Nearest neighbor is a prediction technique that is quite similar to clustering – in order to predict what a prediction value is in one record, look for records with similar predictor values in the historical database and use the prediction value from the record that is ‘nearest’ to the unclassified record

How to use Nearest Neighbor for Prediction One of the essential elements underlying the concept of clustering is that one particular object (whether it is a car, a food item, or a customer) can be closer to another object than can some third object. The nearest neighbor prediction algorithm simply stated is as follows: –Objects that are ‘near’ each other will also have similar prediction values. Thus, if you know the prediction value of the objects, you can predict it for its nearest neighbors.