Data Mining ICCM - 2017.

Slides:



Advertisements
Similar presentations
Based on slides by Pierre Dönnes and Ron Meir Modified by Longin Jan Latecki, Temple University Ch. 5: Support Vector Machines Stephen Marsland, Machine.
Advertisements

Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Data Mining – Intro.
CIS 674 Introduction to Data Mining
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Presented By Wanchen Lu 2/25/2013
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
© Prentice Hall1 CIS 674 Introduction to Data Mining Srinivasan Parthasarathy Office Hours: TTH 4:30-5:25PM DL693.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Data Mining and Decision Support
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Unsupervised Learning
Data Mining Functionalities
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Data Mining – Intro.
Support Vector Machine
What Is Cluster Analysis?
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 7. Classification and Prediction
By Arijit Chatterjee Dr
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining, Neural Network and Genetic Programming
Data Mining K-means Algorithm
Data Mining 101 with Scikit-Learn
Support Vector Machines
Basic machine learning background with Python scikit-learn
Waikato Environment for Knowledge Analysis
Sangeeta Devadiga CS 157B, Spring 2007
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Statistical Learning Dong Liu Dept. EEIS, USTC.
DATA MINING Introductory and Advanced Topics Part II - Clustering
COSC 4335: Other Classification Techniques
Data Mining: Introduction
Support Vector Machines
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning
Presentation transcript:

Data Mining ICCM - 2017

Knowledge Discovery Process Interpretation Data Mining Task-relevant Data Data transformations Selection Preprocessed Data Data Cleaning Data Integration Databases

Remember: Domain Expertise Strong understanding of the business problem. Understands subtle relationships. Helps in reducing data dimensions.

Cleaning and Preparing Data What you do with missing values depends on how many there are, and whether they’re missing randomly or systematically. When in doubt, assume that missing values are missing systematically. Appropriate data transformations can make the data easier to understand and easier to model. Normalization and rescaling are important when relative changes are more important than absolute ones. Data provenance records help reduce errors as you iterate over data collection, data treatment, and modeling. Zumel and Mount, Practical Data Science with R, 2014, Manning

Data Mining Objective: Fit data to a model Potential Result: Higher-level meta information that may not be obvious when looking at raw data Similar terms Exploratory data analysis Data driven discovery Deductive learning

Query Examples Database Data Mining Find all credit card applications with last name of “Smith” Find customers who have purchased milk. Data Mining Find all credit card applications that are a poor risk (classification) Find all items that are frequently purchased with milk.

Data Mining Models and Tasks

Machine Learning Algorithms Some machines use training, which requires sample sets (models) of the data. Then later predictions are based on how the machine was trained. Unsupervised training does not require prior classification of data. Supervised training requires that the data be evaluated prior to training (e.g. records of people who purchased and did not purchase a product)

Basic Data Mining Tasks Classification maps data into predefined groups or classes Supervised learning Pattern recognition Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. Unsupervised learning Segmentation Partitioning

Basic Data Mining Tasks (cont’d) Summarization maps data into subsets with associated simple descriptions. Characterization Generalization Link Analysis uncovers relationships among data. Affinity Analysis Association Rules Sequential Analysis determines sequential patterns.

Linear regression In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

What is “Linear”? Remember this: Y=mX+B? m B

Linear Correlation Strong relationships Weak relationships Y Y X X Y Y Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Support Vector Machine (SVM) Max-Margin Classifier Formalize notion of the best linear separator Lagrangian Multipliers Way to convert a constrained optimization problem to one that is easier to solve Kernels Projecting data into higher-dimensional space makes it linearly separable Complexity Depends only on the number of training examples, not on dimensionality of the kernel space!

Linear Separators Which of the linear separators is optimal?

Tennis example Temperature Humidity = play tennis = do not play tennis

Linear Support Vector Machines Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} x2 =+1 =-1 x1

Non-linear SVMs Datasets that are linearly separable with some noise are OK: But what are we going to do if the dataset is like this? How about… mapping data to a higher-dimensional space: x x x2 x

Kernel Trick (Raise to higher dimension)

Confusion Matrix Measures the performance of a classification model. Type I error: False positives (FP) Type II error: False negatives (FN) http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

Clustering Partitioning Clustering Approach e.g., Euclidean distance a typical clustering analysis approach via iteratively partitioning training data set to learn a partition of the given data space learning a partition on a data set to produce several non-empty clusters (usually, the number of clusters given in advance) in principle, optimal partition achieved via minimizing the sum of squared distance to its “representative object” in each cluster e.g., Euclidean distance

Illustrating Clustering Intracluster distances are minimized Intercluster distances are maximized

K-means Clustering User set up the number of clusters they’d like. (e.g. K=5) Randomly guess K cluster centre locations Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) Each centre finds the centroid of the points it owns …and jumps there …Repeat until terminated!

Dendrogram a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.[1] Dendrograms are often used in computational biology to illustrate the clustering of genes or samples. https://en.wikipedia.org/wiki/Dendrogram http://www.instantr.com/2013/02/12/performing-a-cluster-analysis-in-r/

There has been a considerable amount of research in the area of Market Basket Analysis. Its appeal comes from the clarity and utility of its results, which are expressed in the form association rules. Given A database of transactions Each transaction contains a set of items Example: When a customer buys bread and butter, they buy milk 85% of the time +

Market Basket (Association Rules) ? Where should detergents be placed in the store to maximize their sales? ? Are window cleaning products purchased when detergents and orange juice are bought together? ? Is soda typically purchased with bananas? Does the brand of soda make a difference? ? How are the demographics of the neighborhood affecting what customers are buying?