1 CS345 --- Data Mining Introductions What Is It? Cultures of Data Mining.

Slides:



Advertisements
Similar presentations
A Data Mining Course for Computer Science and non Computer Science Students Jamil Saquer Computer Science Department Missouri State University Springfield,
Advertisements

1 CPS : Information Management and Mining Association Rules and Frequent Itemsets.
1 Data Mining Introductions What Is It? Cultures of Data Mining.
Advanced Data Mining: Introduction
Mining of Massive Datasets: Course Introduction
Chapter 14 Comparing two groups Dr Richard Bußmann.
Agile Software Development Lab Dr. Günter Kniesel, Daniel Speicher, Tobias Rho, Pascal Bihler Spring 2008 Planning and Tracking Sina Golesorkhi Alexis.
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining, Frequent-Itemset Mining
Chapter 9 Business Intelligence Systems
Data Mining.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
Data Mining, Frequent-Itemset Mining. Data Mining Some mining problems Find frequent itemsets in "market-basket" data – "50% of the people who buy hot.
University of Minnesota
1 On-Line Application Processing Warehousing Data Cubes Data Mining.
Lecture 2: Data Mining.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Recommender systems Ram Akella November 26 th 2008.
Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona Bad64Lisa Good38Mary Bad56Quirri Good62Paula Bad710Abdul.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Introduction 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 1: Introduction Mining Massive Datasets.
1 Sampling Distributions Presentation 2 Sampling Distribution of sample proportions Sampling Distribution of sample means.
Data Mining An Introduction.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
 The situation in a statistical problem is that there is a population of interest, and a quantity or aspect of that population that is of interest. This.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Understanding the Variability of Your Data: Dependent Variable Two "Sources" of Variability in DV (Response Variable) –Independent (Predictor/Explanatory)
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
ITGS Databases.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Frequent-Itemset Mining. Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small.
DATA MINING By Cecilia Parng CS 157B.
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Survey – extra credits (1.5pt)! Study investigating general patterns of college students’ understanding of astronomical topics There will be 3~4 surveys.
COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong
1 CPS216: Advanced Database Systems Data Mining Slides created by Jeffrey Ullman, Stanford.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
MIS2502: Data Analytics Advanced Analytics - Introduction.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. PPSS The situation in a statistical problem is that there is a population of interest, and a quantity or.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Mining of Massive Datasets Edited based on Leskovec’s from
CENG 514. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Book web site:
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Introduction to Data Mining- CMPT 741 Instructor: Ke Wang
Introduction C.Eng 714 Spring 2010.
CPS216: Advanced Database Systems Data Mining
Data Mining Modified from
Sangeeta Devadiga CS 157B, Spring 2007
Data Science introduction.
MIS2502: Data Analytics Introduction to Advanced Analytics
CAP6778: Advanced Data Mining Fall 2010 Dr
Presentation transcript:

1 CS Data Mining Introductions What Is It? Cultures of Data Mining

2 Course Staff uInstructors: wAnand Rajaraman wJeff Ullman uTA: wJeff Klingner

3 Requirements uHomework (Gradiance and other) 20% wGradiance class code DD uProject 40% uFinal Exam 40%

4 Project uSoftware implementation related to course subject matter. uShould involve an original component or experiment. uMore later about available data and computing resources.

5 Team Projects uWorking in pairs OK, but … 1.We will expect more from a pair than from an individual. 2.The effort should be roughly evenly distributed.

6 What is Data Mining? uDiscovery of useful, possibly unexpected, patterns in data. uSubsidiary issues: wData cleansing: detection of bogus data. E.g., age = 150. Entity resolution. wVisualization: something better than megabyte files of output. wWarehousing of data (for retrieval).

7 Typical Kinds of Patterns 1.Decision trees: succinct ways to classify by testing properties. 2.Clusters: another succinct classification by similarity of properties. 3.Bayes models, hidden-Markov models, frequent-itemsets: expose important associations within data.

8 Example: Clusters x x x x x x x xx x x x x x x x x x x x x x x x x x x

9 Example: Frequent Itemsets uA common marketing problem: examine what people buy together to discover patterns. 1.What pairs of items are unusually often found together at Safeway checkout? Answer: diapers and beer. 2.What books are likely to be bought by the same Amazon customer?

10 Applications (Among Many) uIntelligence-gathering. wTracking terrorists, e.g. uWeb Analysis. wPageRank, spam detection. uMarketing. wRun a sale on diapers; raise the price of beer.

11 Cultures uDatabases: concentrate on large-scale (non-main-memory) data. uAI (machine-learning): concentrate on complex methods, small data. uStatistics: concentrate on models.

12 Models vs. Analytic Processing uTo a database person, data-mining is an extreme form of analytic processing --- queries that examine large amounts of data. wResult is the data that answers the query. uTo a statistician, data-mining is the inference of models. wResult is the parameters of the model.

13 (Way too Simple) Example uGiven a billion numbers, a DB person would compute their average. uA statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.

14 Meaningfulness of Answers uA big risk when data mining is that you will “discover” patterns that are meaningless. uStatisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

15 Examples uA big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy. uThe Rhine Paradox: a great example of how not to conduct scientific research.

16 Story Behind the Story uI gave these two examples last year. uThe “hotels” example got picked up by a newspaper reporter who spun it as wSTANFORD PROFESSOR PROVES TRACKING TERRORISTS IS IMPOSSIBLE uI was also corrected in the story about Joseph Rhine (whom I called David).

17 Rhine Paradox --- (1) uJoseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. uHe devised (something like) an experiment where subjects were asked to guess 10 hidden cards --- red or blue. uHe discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!

18 Rhine Paradox --- (2) uHe told these people they had ESP and called them in for another test of the same type. uAlas, he discovered that almost all of them had lost their ESP. uWhat did he conclude? wAnswer on next slide.

19 Rhine Paradox --- (3) uHe concluded that you shouldn’t tell people they have ESP; it causes them to lose it.

20 Example: Bonferroni’s Principle uThis example illustrates a problem with intelligence-gathering. uSuppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. uWe want to find people who at least twice have stayed at the same hotel on the same day.

21 The Details u10 9 people being tracked. u1000 days. uEach person stays in a hotel 1% of the time (10 days out of 1000). uHotels hold 100 people (so 10 5 hotels). uIf everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious?

22 Calculations --- (1) uProbability that persons p and q will be at the same hotel on day d : w1/100 * 1/100 * = uProbability that p and q will be at the same hotel on two given days: w10 -9 * = uPairs of days: w5*10 5.

23 Calculations --- (2) uProbability that p and q will be at the same hotel on some two days: w5*10 5 * = 5* uPairs of people: w5* uExpected number of suspicious pairs of people: w5*10 17 * 5* = 250,000.

24 Conclusion uSuppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. uAnalysts have to sift through 250,010 candidates to find the 10 real cases. wNot gonna happen. wBut how can we improve the scheme?

25 Moral uWhen looking for a property (e.g., “two people stayed at the same hotel twice”), make sure that there are not so many possibilities that random data will not produce facts “of interest.”