Introduction to Data Mining- CMPT 741 Instructor: Ke Wang

Slides:



Advertisements
Similar presentations
1 Data Mining Introductions What Is It? Cultures of Data Mining.
Advertisements

QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics.
CPS : Information Management and Mining Shivnath Babu.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
DATA MINING Introductory
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
University of Minnesota
© Vipin Kumar CSci 8980 (Data Mining) Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Data Mining: Introduction
Data Mining – Intro.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Introduction to Data Mining. Why Mine the Data? Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CSE4334/5334 DATA MINING CSE 4334/5334 Data Mining, Fall 2011 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
1 Data Mining: Introduction Chapter 1 of Introduction to Data Mining by Tan, Steinbach, Kumar.
MIS2502: Data Analytics Advanced Analytics - Introduction.
COMSATS Institute of Information Technology Department of Computer Science Databases and Information Systems Dr. Ramzan Talib Databases and Information.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
An Introduction to Data Mining
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Lecture Notes for Chapter 1 Introduction to Data Mining.
Data Mining: Introduction
Introduction to Game Data Mining
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining Introduction
Data Mining: Introduction
Statistics 202: Statistical Aspects of Data Mining
Data Mining: Introduction
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining: Introduction
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Data Mining: Introduction
Introduction C.Eng 714 Spring 2010.
Data Mining: Introduction
Data Mining: Introduction
CPS216: Advanced Database Systems Data Mining
Data Mining: Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Presentation transcript:

Introduction to Data Mining- CMPT 741 Instructor: Ke Wang wangk@cs.sfu.ca

Teaching Resource Teaching materials Pang-Ning Tan, Michael Steinbach, Vipin Kumar, “Introduction to Data Mining”, Addison Wesley, 2006. Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, “Mining of Massive Data Sets”: http://i.stanford.edu/~ullman/mmds/book.pdf; Course website (slides): http://www.cs.sfu.ca/CourseCentral/741/wangk Reference: Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Grading: Assignment/project 40%, midterm 20%, Final 40%. Instructor office hours: Tuesday 3-4pm, TASC1, 9235 TA: Jiax Tang, jiaxit@sfu.ca, Thursday 2-3pm.

Topics Data Mining Introduction (1,3) Classification (supervised learning) (1,3) Association Rule Mining (1,3) Clustering (unsupervised learning) (1,2,3) Big Data Deep Learning (3) Recommendation Systems (2,3) Mining User Behavior Data (3) Link Analysis (2,3) numbers indicate the sources on the previous slide

Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused (web data, online shopping, social media data …) Computers are cheaper and more powerful Competitive pressure (profit driven, CRM, Loyalty program) Scientific Viewpoint Data collected at enormous speeds (remote sensors, telescopes scanning, microarrays generating gene expression data …) Help scientists: classifying and segmenting data, hypothesis formation, summarizing

What is Data Mining? Many definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Patterns, rules, trends, exceptions, etc. One of the many steps in knowledge discovery

Database vs Data Mining Retrieve stored information Look up phone number in phone directory Query a Web search engine for information about “Amazon” Data Mining Extract implied information Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Origins of Data Mining Traditional techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Search complex model that fits small data Model verification Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Knowledge discovery from large, dynamic, and diverse data Database systems Large data but simple queries

Data Mining Tasks Prediction Use observed variables to predict unknown or future values of other variables. Eg, classification, regression, recommendation Description Find human-interpretable patterns that describe the observed data. Eg, clustering, segmenting, association rules, From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Task 1: Classification (supervised learning) Input: a collection of observed records (training set ) over a set of attributes, one being the class attribute (either discrete or continuous). Output: a model (or classifier) for predicting the class of future records using the remaining attributes. Assumption: the class of future records follow the same distribution as the training set.

Classification Example categorical categorical continuous class prediction Model Training Set Learning

Classification: Application Direct Marketing Using historical data to predict future consumers who likely buy a target new product (e.g., cell-phone). Predict Friendships Using a social network and user profiles to predict new friends for users. Classify articles into topics (e.g., Yahoo hierarchy and Open directory). Predict the popularity of a blog or the citation of a paper (continuous class attribute). Detect fraudulent credit card transactions (real time prediction). Predict a user’s rating of movies, books, POIs, etc.

Task 2: Clustering (unsupervised learning) Input: A set of data points in d-dimensional space (i.e., records over d attributes), and a similarity (or distance) measure among data points. Output: a set of clusters such that data points in the same cluster are more similar to one another, and data points in separate clusters are less similar to one another. Assumption: Unlike classification, there is no class attribute, but a similarity measure is required instead.

Clustering Example Intracluster distances are minimized Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized

Clustering: Application Market Segmentation: Subdivide a market into distinct subsets of customers so that each subset is targeted by a distinct marketing strategy. Document Clustering: Cluster documents into subsets according to by topics (i.e., those containing similar terms likely have similar topics). Group URLs returned by a search by clustering their web pages.

Task 3: Association Rule Discovery Input: A collection of sets of items (each set is called a transaction). Output: Rules to describe relationships between subsets of items. Assumption: no knowledge of what subsets may be on either ends of rules Rules: {Diaper} --> {Beer} - read “if a customer buys Diaper, he/she likely buys Beer (with 2/3 chance)”

Association Rule Discovery: Application Sales Promotion: {Diaper} --> {Beer}: Run a sale on Diaper and raise the price on Beer. Which products would be affected if the store discontinues selling Diaper. Keyword and query completion: Recommend the rest of keyword or query for a searcher. Web link or page recommendation. Item recommendation: Amazon’s success

Challenges of Data Mining (3Vs and more) (Volume) Scalability: number of objects and dimensionality: number of variables (Variety) Complex and Heterogeneous: text, set, string, graph. (Velocity) Dynamic data: streaming data, changing rapidly. Data Quality: missing, incomplete, noisy data Data Privacy: privacy of sensitive information must be preserved Meaningfulness/validation

Meaningfulness of Answers A big data-mining risk is that you will “discover” patterns that are meaningless. Statisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

Rhine Paradox – (1) A parapsychologist in the 1950’s hypothesized that some people had Extra-Sensory Perception (ESP). He devised an experiment where subjects were asked to guess 10 hidden cards – red or blue. He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! He told these people they had ESP and called them in for another test of the same type. This time he discovered that almost all of them had lost their ESP. What did he conclude? Answer on next slide.

Rhine Paradox – (2) He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.