Introduction to Data Mining- CMPT 741 Instructor: Ke Wang

Introduction to Data Mining- CMPT 741 Instructor: Ke Wang wangk@cs.sfu.ca

Teaching Resource Teaching materials
Pang-Ning Tan, Michael Steinbach, Vipin Kumar, “Introduction to Data Mining”, Addison Wesley, 2006. Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, “Mining of Massive Data Sets”: Course website (slides): Reference: Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Grading: Assignment/project 40%, midterm 20%, Final 40%. Instructor office hours: Tuesday 3-4pm, TASC1, 9235 TA: Jiax Tang, Thursday 2-3pm.

Topics Data Mining Introduction (1,3)
Classification (supervised learning) (1,3) Association Rule Mining (1,3) Clustering (unsupervised learning) (1,2,3) Big Data Deep Learning (3) Recommendation Systems (2,3) Mining User Behavior Data (3) Link Analysis (2,3) numbers indicate the sources on the previous slide

Why Mine Data? Commercial Viewpoint
Lots of data is being collected and warehoused (web data, online shopping, social media data …) Computers are cheaper and more powerful Competitive pressure (profit driven, CRM, Loyalty program) Scientific Viewpoint Data collected at enormous speeds (remote sensors, telescopes scanning, microarrays generating gene expression data …) Help scientists: classifying and segmenting data, hypothesis formation, summarizing

What is Data Mining? Many definitions
Non-trivial extraction of implicit, previously unknown and potentially useful information from data Patterns, rules, trends, exceptions, etc. One of the many steps in knowledge discovery

Database vs Data Mining
Retrieve stored information Look up phone number in phone directory Query a Web search engine for information about “Amazon” Data Mining Extract implied information Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Origins of Data Mining Traditional techniques may be unsuitable due to
Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Search complex model that fits small data Model verification Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Knowledge discovery from large, dynamic, and diverse data Database systems Large data but simple queries

Data Mining Tasks Prediction
Use observed variables to predict unknown or future values of other variables. Eg, classification, regression, recommendation Description Find human-interpretable patterns that describe the observed data. Eg, clustering, segmenting, association rules, From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Task 1: Classification (supervised learning)
Input: a collection of observed records (training set ) over a set of attributes, one being the class attribute (either discrete or continuous). Output: a model (or classifier) for predicting the class of future records using the remaining attributes. Assumption: the class of future records follow the same distribution as the training set.

Classification Example
categorical categorical continuous class prediction Model Training Set Learning

Classification: Application
Direct Marketing Using historical data to predict future consumers who likely buy a target new product (e.g., cell-phone). Predict Friendships Using a social network and user profiles to predict new friends for users. Classify articles into topics (e.g., Yahoo hierarchy and Open directory). Predict the popularity of a blog or the citation of a paper (continuous class attribute). Detect fraudulent credit card transactions (real time prediction). Predict a user’s rating of movies, books, POIs, etc.

Task 2: Clustering (unsupervised learning)
Input: A set of data points in d-dimensional space (i.e., records over d attributes), and a similarity (or distance) measure among data points. Output: a set of clusters such that data points in the same cluster are more similar to one another, and data points in separate clusters are less similar to one another. Assumption: Unlike classification, there is no class attribute, but a similarity measure is required instead.

Clustering Example Intracluster distances are minimized
Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized

Clustering: Application
Market Segmentation: Subdivide a market into distinct subsets of customers so that each subset is targeted by a distinct marketing strategy. Document Clustering: Cluster documents into subsets according to by topics (i.e., those containing similar terms likely have similar topics). Group URLs returned by a search by clustering their web pages.

Task 3: Association Rule Discovery
Input: A collection of sets of items (each set is called a transaction). Output: Rules to describe relationships between subsets of items. Assumption: no knowledge of what subsets may be on either ends of rules Rules: {Diaper} --> {Beer} - read “if a customer buys Diaper, he/she likely buys Beer (with 2/3 chance)”

Association Rule Discovery: Application
Sales Promotion: {Diaper} --> {Beer}: Run a sale on Diaper and raise the price on Beer. Which products would be affected if the store discontinues selling Diaper. Keyword and query completion: Recommend the rest of keyword or query for a searcher. Web link or page recommendation. Item recommendation: Amazon’s success

Challenges of Data Mining (3Vs and more)
(Volume) Scalability: number of objects and dimensionality: number of variables (Variety) Complex and Heterogeneous: text, set, string, graph. (Velocity) Dynamic data: streaming data, changing rapidly. Data Quality: missing, incomplete, noisy data Data Privacy: privacy of sensitive information must be preserved Meaningfulness/validation

Meaningfulness of Answers
A big data-mining risk is that you will “discover” patterns that are meaningless. Statisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

Rhine Paradox – (1) A parapsychologist in the 1950’s hypothesized that some people had Extra-Sensory Perception (ESP). He devised an experiment where subjects were asked to guess 10 hidden cards – red or blue. He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! He told these people they had ESP and called them in for another test of the same type. This time he discovered that almost all of them had lost their ESP. What did he conclude? Answer on next slide.

Rhine Paradox – (2) He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.

Introduction to Data Mining- CMPT 741 Instructor: Ke Wang

Similar presentations

Presentation on theme: "Introduction to Data Mining- CMPT 741 Instructor: Ke Wang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Data Mining- CMPT 741 Instructor: Ke Wang

Similar presentations

Presentation on theme: "Introduction to Data Mining- CMPT 741 Instructor: Ke Wang"— Presentation transcript:

Similar presentations

About project

Feedback