Introduction 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 1: Introduction Mining Massive Datasets.

Slides:



Advertisements
Similar presentations
1 Data Mining Introductions What Is It? Cultures of Data Mining.
Advertisements

Advanced Data Mining: Introduction
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
Data warehouse example
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
1 CS Data Mining Introductions What Is It? Cultures of Data Mining.
Chapter 14 The Second Component: The Database.
July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.
Data Mining – Intro.
1 Introduction Introduction to database systems Database Management Systems (DBMS) Type of Databases Database Design Database Design Considerations.
Data mining By Aung Oo.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Lecture-8/ T. Nouf Almujally
Lesson Outline Introduction: Data Flood
Data Mining.
Big Data A big step towards innovation, competition and productivity.
Enterprise systems infrastructure and architecture DT211 4
Data Mining Knowledge Discovery: An Introduction
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
CIS 9002 Kannan Mohan Department of CIS Zicklin School of Business, Baruch College.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
1 CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle.
Instructor: Jinze Liu Spring 2014 CS 685 Special Topics in Data mining.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Consumer Goods and Retail in The Digital Age ALESIMO MWANGA KOMALIN CHETTY.
Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Understanding the field & setting expectations.  Personal  International  UNT Alumni (Mathematics)  Academic  Economics & Mathematics  Professional.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Mining of Massive Datasets Edited based on Leskovec’s from
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Book web site:
Data mining in web applications
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
Introduction to Data Mining- CMPT 741 Instructor: Ke Wang
Introduction C.Eng 714 Spring 2010.
Mohammad J. Mansourzadeh
Data and Applications Security Introduction to Data Mining
CPS216: Advanced Database Systems Data Mining
MIS5101: Data Analytics Advanced Analytics - Introduction
Data Mining Modified from
Data Warehousing and Data Mining
Data Mining: Introduction
MIS2502: Data Analytics Introduction to Advanced Analytics
McGraw-Hill Technology Education
Big DATA.
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

Introduction 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 1: Introduction Mining Massive Datasets

Introduction 2 Outline  Data intensive scalable computing (DISC)  Data mining 2

Introduction 3 Examples of Massive Data Sources  Wal-Mart  267 million items/day, sold at 6,000 stores  HP building them 4PB data warehouse  Mine data to manage supply chain, understand market trends, formulate pricing strategies  Sloan Digital Sky Survey  New Mexico telescope captures 200 GB image data / day  Latest dataset release: 10 TB, 287 million celestial objects  SkyServer provides SQL access DISC

Introduction 4 Our Data-Driven World  Science  Data bases from astronomy, genomics, natural languages, seismic modeling, …  Humanities  Scanned books, historic documents, …  Commerce  Corporate sales, stock market transactions, census, airline traffic, …  Entertainment  Internet images, Hollywood movies, MP3 files, …  Medicine  MRI & CT scans, patient records, … DISC

Introduction 5 Why So Much Data?  We Can Get It  Automation + Internet  We Can Keep It  1 $159 (16 ¢ / GB)  We Can Use It  Scientific breakthroughs  Business process efficiencies  Realistic special effects  Better health care  Could We Do More?  Apply more computing power to this data DISC

Introduction 6 Google ’ s Computing Infrastructure  200+ processors  200+ terabyte database  total clock cycles  0.1 second response time  5 ¢ average advertising revenue DISC

Introduction 7 Google ’ s Computing Infrastructure  System  ~ 3 million processors in clusters of ~2000 processors each  Commodity parts  x86 processors, IDE disks, Ethernet communications  Gain reliability through redundancy & software management  Partitioned workload  Data: Web pages, indices distributed across processors  Function: crawling, index generation, index search, document retrieval, Ad placement  A Data-Intensive Scalable Computer (DISC)  Large-scale computer centered around data  Collecting, maintaining, indexing, computing  Similar systems at Microsoft & Yahoo Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 DISC

Introduction 8 DISC: Beyond Web Search  Data-Intensive Application Domains  Rely on large, ever-changing data sets  Collecting & maintaining data is major effort  Many possibilities  Computational Requirements  From simple queries to large-scale analyses  Require parallel processing  Want to program at abstract level  Hypothesis  Can apply DISC to many other application domains DISC

Introduction 9 Data-Intensive System Challenge  For Computation That Accesses 1 TB in 5 minutes  Data distributed over 100+ disks  Assuming uniform data partitioning  Compute using 100+ processors  Connected by gigabit Ethernet (or equivalent)  System Requirements  Lots of disks  Lots of processors  Located in close proximity  Within reach of fast, local-area network DISC

Introduction 10 Desiderate for DISC Systems  Focus on Data  Terabytes, not tera-FLOPS  Problem-Centric Programming  Platform-independent expression of data parallelism  Interactive Access  From simple queries to massive computations  Robust Fault Tolerance  Component failures are handled as routine events  Contrast to existing supercomputer / HPC systems DISC

Introduction 11 Topics of DISC  Architecture  Cloud computing  Operating Systems  Hadoop  Apsara ( 飞天) by Aliyun (  Programming Models  MapReduce  Data Analysis (Data Mining) DISC

Introduction 12 What is Data Mining?  Non-trivial discovery of implicit, previously unknown, and useful knowledge from massive data. Data Mining

Introduction 13 Cultures  Databases:  concentrate on large-scale (non-main-memory) data.  AI (machine-learning):  concentrate on complex methods, small data.  Statistics:  concentrate on models. Data Mining Databases Statistics AI/ Machine Learning Data Mining

Introduction 14 Models vs. Analytic Processing  To a database person, data-mining is an extreme form of analytic processing – queries that examine large amounts of data.  Result is the query answer.  To a statistician, data-mining is the inference of models.  Result is the parameters of the model. Data Mining

Introduction 15 (Way too Simple) Example  Given a billion numbers, a DB person would compute their average and standard deviation.  A statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation of that distribution. Data Mining

Introduction 16 Data Mining Tasks  Association rule discovery  Classification  Clustering  Recommendation systems  Collaborative filtering  Link analysis and graph mining  Managing Web advertisements  … … Data Mining

Introduction 17 Association Rule Discovery Data Mining

Introduction 18 Classification Government Science Arts Data Mining

Introduction 19 Clustering Data Mining

Introduction 20 Recommender Systems  Netflix  Movie recommendation  Amazon  Book recommendation Data Mining

Introduction 21 Link Analysis and Graph mining  PageRank  Link prediction  Community detection Data Mining

Introduction 22 Meaningfulness of Answers  A big data-mining risk is that you will “ discover ” patterns that are meaningless.  Statisticians call it Bonferroni ’ s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap. Data Mining

Introduction 23 Examples of Bonferroni ’ s Principle 1.A big objection to Total Information Awareness (TIA) was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents ’ privacy. 2.The Rhine Paradox: a great example of how not to conduct scientific research. Data Mining

Introduction 24 The “ TIA ” Story  Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day. Data Mining

Introduction 25 The “ TIA ” Story  10 9 people being tracked.  1000 days.  Each person stays in a hotel 1% of the time (10 days out of 1000).  Hotels hold 100 people (so 10 5 hotels).  If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious? Data Mining

Introduction 26 The “ TIA ” Story  Probability that p and q will be at the same hotel on one specific day:  (1/100)  (1/100)  (1/ 10 5 )=  Probability that p and q will be at the same hotel on some two days:  5  10 5  (10 -9  ) = 5   (Pairs of days is 5  10 5 )  Pairs of people:  5   Expected number of “ suspicious ” pairs of people:  5   5  = 250,000. Data Mining

Introduction 27 Conclusion  Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice.  Analysts have to sift through 250,010 candidates to find the 10 real cases.  Not gonna happen.  But how can we improve the scheme? Data Mining

Introduction 28 Moral  When looking for a property (e.g., “ two people stayed at the same hotel twice ” ), make sure that the property does not allow so many possibilities that random data will surely produce facts “ of interest. ” Data Mining

Introduction 29 Rhine Paradox – (1)  Joseph Rhine was a parapsychologist in the 1950 ’ s who hypothesized that some people had Extra- Sensory Perception (ESP).  He devised (something like) an experiment where subjects were asked to guess 10 hidden cards – red or blue.  He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! Data Mining

Introduction 30 Rhine Paradox – (2)  He told these people they had ESP and called them in for another test of the same type.  Alas, he discovered that almost all of them had lost their ESP.  What did he conclude?  Answer on next slide. Data Mining

Introduction 31 Rhine Paradox – (3)  He concluded that you shouldn ’ t tell people they have ESP; it causes them to lose it. Data Mining

Introduction 32 Moral  Understanding Bonferroni ’ s Principle will help you look a little less stupid than a parapsychologist. Data Mining

Introduction 33 Applications  Banking: loan/credit card approval  Predict good customers based on old customers  Customer relationship management  Identify those who are likely to leave for a competitor  Targeted marketing  Identify likely responders to promotions  Fraud detection:  From an online stream of event identify fraudulent events  Manufacturing and production  Automatically adjust knobs when process parameter changes Data Mining

Introduction 34 Applications (continued)  Medicine: disease outcome, effectiveness of treatments  Analyze patient disease history: find relationship between disease  Scientific data analysis  Gene analysis  Web site/store design and promotion  Find affinity of visitor to pages and modify layout Data Mining

Introduction 35 Questions?

Introduction 36 Acknowledgement  Some slides are from:  Prof. Jeffrey D. Ullman  Dr. Jure Leskovec  Prof. Randal E. Bryant