1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

The Subject-Matter of Ethics
Copyright 2003, Christine L. Abela, M.Ed. I’m failing… help! Straight facts to help you try to rebound!
1 CPS : Information Management and Mining Association Rules and Frequent Itemsets.
1 Data Mining Introductions What Is It? Cultures of Data Mining.
Advanced Data Mining: Introduction
Mining of Massive Datasets: Course Introduction
G. Alonso, D. Kossmann Systems Group
THERE IS NO GENERAL METHOD OR FORMULA WHICH IS ‘CORRECT’. YOU CAN PROBABLY IGNORE SOME OF THIS ADVICE AND STILL WRITE A GOOD ESSAY… BUT FOLLOWING IT MAY.
Sampling Distributions
Lecture 2 Page 1 CS 236, Spring 2008 Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher Spring, 2008.
CS 197 Computers in Society Fall, Welcome, Freshmen!
1 Introduction to Communications Professor R. C. T. Lee Dept. of Information Management Dept. of Computer Science Department of Communications Department.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Lecture 2: Data Mining.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Introduction 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 1: Introduction Mining Massive Datasets.
READING FOR HIGH SCHOOL AND BEYOND. WHAT I DO NOW: *BEFORE I READ *WHEN I READ *AFTER I FINISHBEFORE I READWHEN I READAFTER I FINISH Reading Actively.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Two Way ANOVA ©2005 Dr. B. C. Paul. ANOVA Application ANOVA allows us to review data and determine whether a particular effect is changing our results.
Length- The length for this genre depends on the author’s preference. The topic of the story impacts how long it will be. A story that has a lot of.
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining.
ITGS Databases.
Research Paper Outline 1. Parameters are LARGE SECTIONS of your paper that contain like information to prove your thesis. Don’t be confused – parameters.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
What is Physics? BBranch of science concerning with the interaction of matter, space, time and energy ○d○definition provided by 
1 CPS216: Advanced Database Systems Data Mining Slides created by Jeffrey Ullman, Stanford.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
1 Welcome to CptS 317 Background Course Outline Textbook Syllabus (see class web site to important information on disabilities, cheating and safety) Grades.
Internet Literacy Evaluating Web Sites. Objective The Student will be able to evaluate internet web sites for accuracy and reliability The Student will.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Mining of Massive Datasets Edited based on Leskovec’s from
Introduction to Arrays. Learning Objectives By the end of this lecture, you should be able to: – Understand what an array is – Know how to create an array.
Chance We will base on the frequency theory to study chances (or probability).
Hypothesis Tests for 1-Proportion Presentation 9.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Book web site:
Assessment.
What is the Scientific Method?
A guide to scoring well on Free Response Questions
Assessment.
The Stream Model Sliding Windows Counting 1’s
Introduction to Data Mining- CMPT 741 Instructor: Ke Wang
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Introduction C.Eng 714 Spring 2010.
Exam Review Session William Cohen.
CPS216: Advanced Database Systems Data Mining
Data Mining Modified from
Sequence comparison: Multiple testing correction
HITS Hypertext Induced Topic Selection
Why Academic Integrity Matters
Internet Literacy Evaluating Web Sites.
Finding Trends with Visualizations
HITS Hypertext Induced Topic Selection
CS5112: Algorithms and Data Structures for Applications
Graph and Link Mining.
Intro to Epidemiology - Investigation 2-6: The Journey
Why Academic Integrity Matters
Research Paper Step-by-step Process.
CAP6778: Advanced Data Mining Fall 2010 Dr
Presentation transcript:

1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle

2 Course Staff uInstructors: wAnand Rajaraman wJeff Ullman uReach us as lists.stanford.edu. uMore info on

3 Requirements uHomework (Gradiance and other) 20% wGo to wEnter class code 83769DC9. wIf you took CS145 or CS245 in the past year, you should have free access; otherwise you will have to purchase access from Pearson Ed. uProject 40% uFinal Exam 40%

4 Project uSoftware implementation related to course subject matter. uShould involve an original component or experiment. uMore later about available data and computing resources.

5 Possible Projects uMany past projects have dealt with collaborative filtering (advice based on what similar people do). wE.g., Netflix Challenge. uOthers have dealt with engineering solutions to “machine-learning” problems.

6 ML-Replacement Projects uML generally requires a large “training set” of correctly classified data. wExample: classifying Web pages by topic. uHard to find well-classified data. wException: Open Directory works for page topics, because work is collaborative and shared by many. wOther good exceptions?

7 ML-Replacement – (2) uMany problems require thought rather than ML: 1.Tell important pages from unimportant (PageRank). 2.Tell real news from publicity (how?). 3.Distinguish positive from negative product reviews (how?). 4.Etc., etc.

8 Team Projects uWorking in pairs OK, but … 1.No more than two per project. 2.We will expect more from a pair than from an individual. 3.The effort should be roughly evenly distributed.

9 What is Data Mining? uDiscovery of useful, possibly unexpected, patterns in data. uSubsidiary issues: wData cleaning: detection of bogus data. E.g., age = 150. Entity resolution. wVisualization: something better than megabyte files of output.

10 Cultures uDatabases: concentrate on large-scale (non-main-memory) data. uAI (machine-learning): concentrate on complex methods, small data. uStatistics: concentrate on models.

11 Models vs. Analytic Processing uTo a database person, data-mining is an extreme form of analytic processing – queries that examine large amounts of data. wResult is the query answer. uTo a statistician, data-mining is the inference of models. wResult is the parameters of the model.

12 (Way too Simple) Example uGiven a billion numbers, a DB person would compute their average and standard deviation. uA statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation of that distribution.

13 Outline of Course uMap-Reduce and Hadoop. uAssociation rules, frequent itemsets. uPageRank and related measures of importance on the Web (link analysis ). wSpam detection. wTopic-specific search. uRecommendation systems. wCollaborative filtering.

14 Outline – (2) uFinding similar sets. wMinhashing, Locality-Sensitive hashing. uExtracting structured data (relations) from the Web. uClustering data. uManaging Web advertisements. uMining data streams.

15 Meaningfulness of Answers uA big data-mining risk is that you will “discover” patterns that are meaningless. uStatisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

16 Examples of Bonferroni’s Principle 1.A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy. 2.The Rhine Paradox: a great example of how not to conduct scientific research.

17 Stanford Professor Proves Tracking Terrorists Is Impossible! uThree years ago, the example I am about to give you was picked up from my class slides by a reporter from the LA Times. uDespite my talking to him at length, he was unable to grasp the point that the story was made up to illustrate Bonferroni’s Principle, and was not real.

18 The “TIA” Story uSuppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. uWe want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.

19 The Details u10 9 people being tracked. u1000 days. uEach person stays in a hotel 1% of the time (10 days out of 1000). uHotels hold 100 people (so 10 5 hotels). uIf everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious?

20 Calculations – (1) uProbability that given persons p and q will be at the same hotel on given day d : w1/100  1/100  = uProbability that p and q will be at the same hotel on given days d 1 and d 2 : w10 -9  = uPairs of days: w5  p at some hotel q at some hotel Same hotel

21 Calculations – (2) uProbability that p and q will be at the same hotel on some two days: w5  10 5  = 5  uPairs of people: w5  uExpected number of “suspicious” pairs of people: w5   5  = 250,000.

22 Conclusion uSuppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. uAnalysts have to sift through 250,010 candidates to find the 10 real cases. wNot gonna happen. wBut how can we improve the scheme?

23 Moral uWhen looking for a property (e.g., “two people stayed at the same hotel twice”), make sure that the property does not allow so many possibilities that random data will surely produce facts “of interest.”

24 Rhine Paradox – (1) uJoseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. uHe devised (something like) an experiment where subjects were asked to guess 10 hidden cards – red or blue. uHe discovered that almost 1 in 1000 had ESP – they were able to get all 10 right!

25 Rhine Paradox – (2) uHe told these people they had ESP and called them in for another test of the same type. uAlas, he discovered that almost all of them had lost their ESP. uWhat did he conclude? wAnswer on next slide.

26 Rhine Paradox – (3) uHe concluded that you shouldn’t tell people they have ESP; it causes them to lose it.

27 Moral uUnderstanding Bonferroni’s Principle will help you look a little less stupid than a parapsychologist.