1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

The Subject-Matter of Ethics
1 CPS : Information Management and Mining Association Rules and Frequent Itemsets.
1 Data Mining Introductions What Is It? Cultures of Data Mining.
Advanced Data Mining: Introduction
Mining of Massive Datasets: Course Introduction
Sampling Distributions
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
Lecture 2 Page 1 CS 236, Spring 2008 Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher Spring, 2008.
Improvements to A-Priori
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Lecture 2: Data Mining.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 The Sample Mean rule Recall we learned a variable could have a normal distribution? This was useful because then we could say approximately.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.
Introduction to the design (and analysis) of experiments James M. Curran Department of Statistics, University of Auckland
Stock Market Project. STEP 1: Pick no fewer than 7 stocks to include in your investment portfolio Print a current graph (past 6 months of activity). Use.
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Hypotheses STAT 101 Dr. Kari Lock Morgan SECTION 4.1 Statistical test Null and alternative.
Analysis of websites Austin Lesak. Amazon and Ally Bank.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Introduction 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 1: Introduction Mining Massive Datasets.
Chapter 13: Inference in Regression
READING FOR HIGH SCHOOL AND BEYOND. WHAT I DO NOW: *BEFORE I READ *WHEN I READ *AFTER I FINISHBEFORE I READWHEN I READAFTER I FINISH Reading Actively.
Stock Market Project. STEP 1: Pick no fewer than 7 stocks to include in your investment portfolio Print a current graph (past 6 months of activity). Use.
Lecture 1 Page 1 CS 111 Summer 2015 Introduction CS 111 Operating System Principles.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Two Way ANOVA ©2005 Dr. B. C. Paul. ANOVA Application ANOVA allows us to review data and determine whether a particular effect is changing our results.
Psychology Research Methods. There are a variety of ways of validating truth Personal experience Intuition Social or cultural consensus Religious scripture.
Length- The length for this genre depends on the author’s preference. The topic of the story impacts how long it will be. A story that has a lot of.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
1 The Scientist Game Chris Slaughter, DrPH (courtesy of Scott Emerson) Dept of Biostatistics Vanderbilt University © 2002, 2003, 2006, 2008 Scott S. Emerson,
 Finding Scholarly Research on Your Topic. Your Research Journey…  You have, at this point, found information on your topic from general sources – news.
Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining.
CHAPTER 10: CORE MECHANICS Definitions and Mechanisms.
Confidence intervals and hypothesis testing Petter Mostad
Distributions of the Sample Mean
ITGS Databases.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Human Centric Computing (COMP106) Assignment 2 PROPOSAL 23.
1 CPS216: Advanced Database Systems Data Mining Slides created by Jeffrey Ullman, Stanford.
Nancy Sloan PSY 1010 Mr. Reed E.S.P. Extra Sensory Perception-refers to the ability to gain insight on a subject without using the physical senses. This.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. PPSS The situation in a statistical problem is that there is a population of interest, and a quantity or.
Internet Literacy Evaluating Web Sites. Objective The Student will be able to evaluate internet web sites for accuracy and reliability The Student will.
Biological Model Engineering Peter Saffrey, Department of Medicine Cakes Talk Monday, October 20, 2008.
Fundamentals/ICY: Databases 2013/14 Week 11 – Monday – relations, ended. John Barnden Professor of Artificial Intelligence School of Computer Science University.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Psy B07 Chapter 3Slide 1 THE NORMAL DISTRIBUTION.
Mining of Massive Datasets Edited based on Leskovec’s from
Hypothesis Tests for 1-Proportion Presentation 9.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Book web site:
The Stream Model Sliding Windows Counting 1’s
Introduction to Data Mining- CMPT 741 Instructor: Ke Wang
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Introduction C.Eng 714 Spring 2010.
CPS216: Advanced Database Systems Data Mining
Data Mining Modified from
HITS Hypertext Induced Topic Selection
Internet Literacy Evaluating Web Sites.
HITS Hypertext Induced Topic Selection
Graph and Link Mining.
Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
CAP6778: Advanced Data Mining Fall 2010 Dr
Research for Your Presentation
Research for Your Presentation
Presentation transcript:

1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle

2 Course Staff uInstructors: wAnand Rajaraman wJeff Ullman uTA: wBabak Pahlavan

3 Requirements uHomework (Gradiance and other) 20% wGradiance class code B0E9AA66 wNote URL for class: (not /pearson). uProject 40% uFinal Exam 40%

4 Project uSoftware implementation related to course subject matter. uShould involve an original component or experiment. uMore later about available data and computing resources.

5 Team Projects uWorking in pairs OK, but … 1.We will expect more from a pair than from an individual. 2.The effort should be roughly evenly distributed.

6 What is Data Mining? uDiscovery of useful, possibly unexpected, patterns in data. uSubsidiary issues: wData cleansing: detection of bogus data. E.g., age = 150. Entity resolution. wVisualization: something better than megabyte files of output. wWarehousing of data (for retrieval).

7 Cultures uDatabases: concentrate on large-scale (non-main-memory) data. uAI (machine-learning): concentrate on complex methods, small data. uStatistics: concentrate on models.

8 Models vs. Analytic Processing uTo a database person, data-mining is an extreme form of analytic processing -- queries that examine large amounts of data. wResult is the data that answers the query. uTo a statistician, data-mining is the inference of models. wResult is the parameters of the model.

9 (Way too Simple) Example uGiven a billion numbers, a DB person would compute their average and standard deviation. uA statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.

10 Web Mining uMuch of the course will be devoted to ways to data mining on the Web. 1.Mining to discover things about the Web. uE.g., PageRank, finding spam sites. 2.Mining data from the Web itself. uE.g., analysis of click streams, similar products at Amazon.

11 Outline of Course uPageRank and related measures of importance on the Web (link analysis ). wSpam detection. wTopic-specific search. uAssociation rules, frequent itemsets. uRecommendation systems. wE.g., what should Amazon suggest you buy?

12 Outline – (2) uMinhashing/Locality-Sensitive Hashing. wFinding similar Web pages, e.g. uExtracting structured data (relations) from the Web. uClustering data. uManaging Web advertisements. uMining data streams.

13 Relationship to CME 340 uCME340 is taught by Sep Kamvar. wTime will be Monday afternoons before CS345A in Rm uTitle is very similar to CS345A, but overlap is actually PageRank and extensions.

14 Regarding CME 340 – (2) uStyles are very different: wCS345A: conventional course. wCME340: reading papers + optional project. uBy agreement among the instructors: wYou can take both, but register for 1 unit of CME340 and do the project for CS345A.

15 Meaningfulness of Answers uA big risk when data mining is that you will “discover” patterns that are meaningless. uStatisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

16 Examples: Bonferroni’s Principle 1.A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy. 2.The Rhine Paradox: a great example of how not to conduct scientific research.

17 Stanford Professor Proves Tracking Terrorists Is Impossible! uTwo years ago, the example I am about to give you was picked up from my class slides by a reporter from the LA Times. uDespite my talking to him at length, he was unable to grasp the point that the story was made up to illustrate Bonferroni’s Principle, and was not real.

18 Example: Bonferroni’s Principle uThis example illustrates a problem with intelligence-gathering. uSuppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. uWe want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.

19 The Details u10 9 people being tracked. u1000 days. uEach person stays in a hotel 1% of the time (10 days out of 1000). uHotels hold 100 people (so 10 5 hotels). uIf everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious?

20 Calculations – (1) uProbability that persons p and q will be at the same hotel on day d : w1/100 * 1/100 * = uProbability that p and q will be at the same hotel on given days d 1 and d 2 : w10 -9 * = uPairs of days: w5*10 5.

21 Calculations – (2) uProbability that p and q will be at the same hotel on some two days: w5*10 5 * = 5* uPairs of people: w5* uExpected number of “suspicious” pairs of people: w5*10 17 * 5* = 250,000.

22 Conclusion uSuppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. uAnalysts have to sift through 250,010 candidates to find the 10 real cases. wNot gonna happen. wBut how can we improve the scheme?

23 Moral uWhen looking for a property (e.g., “two people stayed at the same hotel twice”), make sure that there are not so many possibilities that random data will surely produce facts “of interest.”

24 Rhine Paradox – (1) uJoseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. uHe devised (something like) an experiment where subjects were asked to guess 10 hidden cards --- red or blue. uHe discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!

25 Rhine Paradox – (2) uHe told these people they had ESP and called them in for another test of the same type. uAlas, he discovered that almost all of them had lost their ESP. uWhat did he conclude? wAnswer on next slide.

26 Rhine Paradox – (3) uHe concluded that you shouldn’t tell people they have ESP; it causes them to lose it.

27 Moral uUnderstanding Bonferroni’s Principle will help you look a little less stupid than a parapsychologist.