Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Advertisements

1 Data Mining Introductions What Is It? Cultures of Data Mining.
CS583 – Data Mining and Text Mining
Advanced Data Mining: Introduction
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
CS583 – Data Mining and Text Mining
Data Mining: Concepts and Techniques
1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Data Mining By Archana Ketkar.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Data mining By Aung Oo.
CS 5941 CS583 – Data Mining and Text Mining Course Web Page 05/cs583.html.
Ch. Eick: Course Information COSC Introduction --- Part2 1. Another Introduction to Data Mining 2. Course Information.
Ch. Eick: Introduction Data Mining and Course Information 1 Introduction --- Part2 1. Another Introduction to Data Mining 2. Course Information.
CS583 – Data Mining and Text Mining
Data Mining: A Closer Look
Instructor: Jinze Liu Spring 2009 CS 685 Special Topics in Data mining.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Data Mining Techniques
Introduction 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 1: Introduction Mining Massive Datasets.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Instructor: Jinze Liu Fall 2010 CS 685 Special Topics in Data mining.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
Han: Introduction to KDD 1 Introduction to Knowledge Discovery and Data Mining ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab.
MIS2502: Data Analytics Advanced Analytics - Introduction.
CSCE 5073 Section 001: Data Mining Spring Overview Class hour 12:30 – 1:45pm, Tuesday & Thur, JBHT 239 Office hour 2:00 – 4:00pm, Tuesday & Thur,
Instructor: Dr. Jinze Liu CS 485G – Spring 2016 Special Topics in Data mining.
Mining of Massive Datasets Edited based on Leskovec’s from
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
CSC 4740 / 6740 Fall 2016 Data Mining Instructor: Yubao Wu Fall 2016.
Book web site:
CS583 – Data Mining and Text Mining
Term Project Proposal By J. H. Wang Apr. 7, 2017.
MIS2502: Data Analytics Advanced Analytics - Introduction
CS583 – Data Mining and Text Mining
Statistics 202: Statistical Aspects of Data Mining
Introduction to Data Mining- CMPT 741 Instructor: Ke Wang
CS583 – Data Mining and Text Mining
Introduction C.Eng 714 Spring 2010.
Data Mining: Concepts and Techniques Course Outline
CPS216: Advanced Database Systems Data Mining
CS583 – Data Mining and Text Mining
Data Mining Modified from
CS 685G – Spring 2017 Special Topics in Data mining
Introduction --- Part2 Another Introduction to Data Mining
CSE591: Data Mining by H. Liu
CS583 – Data Mining and Text Mining
Data Mining: Introduction
CS583 – Data Mining and Text Mining
Welcome! Knowledge Discovery and Data Mining
CSCE 4143 Section 001: Data Mining Spring 2019.
CS583 – Data Mining and Text Mining
CSE591: Data Mining by H. Liu
Promising “Newer” Technologies to Cope with the
First 2-3 Lectures (Intro to DS/DM)
Presentation transcript:

Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining

Welcome! Instructor: Jinze Liu Homepage: Office: 235 Hardymon Building 2

Overview Time: TR 2pm-3:15pm Office hour: by appointment Credit: 3 Preferred Prerequisite: Data structure, Algorithms, Database, AI, Machine Learning, Statistics. 3

Overview Textbook: Mining of Massive Datasets. Can be accessed for free at A collection of papers in recent conferences and journals Other References Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann. (ISBN: ) Introduction to Data Mining, by Tan, Steinbach, and Kumar, Addison Wesley. (ISBN: ) Principles of Data Mining, by Hand, Mannila, and Smyth, MIT Press. (ISBN: X) The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by Hastie, Tibshirani, and Friedman. (ISBN: ) 4

Overview Grading scheme Homeworks30% 1 Exam20% 1 Presentation20% 1 Project30%

Overview Project Individual project or team project (no more than 2 students) Options Development of original algorithms Application of existing algorithms to solve a real world problem. 6

Examples MIT big data challenge Data visualization SAP Lumira 1 Dataset examples: http://scn.sap.com/docs/DOC Digging into reviews? view+Dataset 7

Overview 8 Paper presentation One per student A talk on one of the latest research development in data mining. It should also be related to your project. Research paper(s) Your own pick (upon approval) Three parts Motivation for the research Review of data mining methods Discussion Questions and comments from audience Class participation: One question/comment per student Order of presentation: will be arranged according to the topics.

Introduction to Data Mining 9

Why Mine Data? Lots of data is being collected and warehoused Public web data Social Networks Reviews Purchases at department/ grocery stores Bank/Credit Card transactions …… Storage and computing have become cheaper and more powerful

The Growth of Data 11

The Big Data Economy The basis for competitive advantage Customer profiling and targeting as well as predictive analytics Optimize service and maintenance (e.g. GE) New products, features and value-adding services (e.g. LinkedIn, Facebook and others. ) 12

Data Scientist Data scientist – the sexiest job of the 21 st century (Harvard business review) To find insight in data Skills needed? writing code being curious about discovery solid foundation in math, statistics, probability and computer science communication 13

What is Data Mining? Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

15 Cultures Databases: concentrate on large-scale (non- main-memory) data. AI (machine-learning): concentrate on complex methods, small data. Statistics: concentrate on models.

16 Models vs. Analytic Processing To a database person, data-mining is an extreme form of analytic processing – queries that examine large amounts of data. Result is the query answer. To a statistician, data-mining is the inference of models. Result is the parameters of the model.

(Way too Simple) Example Given a billion numbers, a DB person would compute their average and standard deviation. A statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation of that distribution. 17

Examples Discuss whether or not each of the following activities is a data mining task. (a) Dividing the customers of a company according to their gender. (b) Dividing the customers of a company according to their profitability. (c) Predicting the future stock price of a company using historical records.

Examples (a) Dividing the customers of a company according to their gender. No. This is a simple database query. (b) Dividing the customers of a company according to their profitability. No. This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be data mining. (c) Predicting the future stock price of a company using historical records. Yes. We would attempt to create a model that can predict the continuous value of the stock price. This is an example of the area of data mining known as predictive modelling. We could use regression for this modelling, although researchers in many fields have developed a wide variety of techniques for predicting time series.

Meaningfulness of Answers A big data-mining risk is that you will “ discover ” patterns that are meaningless. Statisticians call it Bonferroni ’ s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap. 20

Examples of Bonferroni ’ s Principle A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents ’ privacy. The Rhine Paradox: a great example of how not to conduct scientific research. 21

The “ TIA ” Story Suppose we believe that certain groups of evil- doers are meeting occasionally in hotels to plot doing evil. We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day. 22

The Details 10 9 people being tracked days. Each person stays in a hotel 1% of the time (10 days out of 1000). Hotels hold 100 people (so 10 5 hotels). If everyone behaves randomly (I.e., no evil- doers) will the data mining detect anything suspicious? 23

24 Calculations – (1) Probability that given persons p and q will be at the same hotel on given day d : 1/100  1/100  = Probability that p and q will be at the same hotel on given days d 1 and d 2 :  = Pairs of days: 5  p at some hotel q at some hotel Same hotel

25 Calculations – (2) Probability that p and q will be at the same hotel on some two days: 5  10 5  = 5  Pairs of people: 5  Expected number of “ suspicious ” pairs of people: 5   5  = 250,000.

26 Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. Analysts have to sift through 250,010 candidates to find the 10 real cases. Not gonna happen. But how can we improve the scheme?

27 Moral When looking for a property (e.g., “ two people stayed at the same hotel twice ” ), make sure that the property does not allow so many possibilities that random data will surely produce facts “ of interest. ”

28 Rhine Paradox – (1) Joseph Rhine was a parapsychologist in the 1950 ’ s who hypothesized that some people had Extra- Sensory Perception. He devised (something like) an experiment where subjects were asked to guess 10 hidden cards – red or blue. He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right!

29 Rhine Paradox – (2) He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude? Answer on next slide.

30 Rhine Paradox – (3) He concluded that you shouldn ’ t tell people they have ESP; it causes them to lose it.

31 Moral Understanding Bonferroni ’ s Principle will help you look a little less stupid than a parapsychologist.

Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Regression [Predictive] Semi-supervised Learning Semi-supervised Clustering Semi-supervised Classification

Data Mining Tasks Cover in this Course Classification [Predictive] Association Rule Discovery [Descriptive] Clustering [Descriptive] Deviation Detection [Predictive] Semi-supervised Learning Semi-supervised Clustering Semi-supervised Classification

Survey Why are you taking this course? What would you like to gain from this course? What topics are you most interested in learning about from this course? Any other suggestions?

KDD References 36 Data mining and KDD (SIGKDD: CDROM) Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations Database systems (SIGMOD: CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc. AI & Machine Learning Conferences: Machine learning (ICML), AAAI, IJCAI, COLT (Learning Theory), etc. Journals: Machine Learning, Artificial Intelligence, etc.

KDD References 37 Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Bioinformatics Conferences: ISMB, RECOMB, PSB, CSB, BIBE, etc. Journals: J. of Computational Biology, Bioinformatics, etc. Visualization Conference proceedings: InfoVis, CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.