CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P0-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
CMU SCS Big (graph) data analytics Christos Faloutsos CMU.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
CMU SCS Mining Large Social Networks: Patterns and Anomalies Christos Faloutsos CMU.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
CMU SCS Panel: Social Networks Christos Faloutsos CMU.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Large Graph Mining: Power Tools and a Practitioner’s guide
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
New Ideas Track: Testing MapReduce-Style Programs
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Large Graph Mining: Power Tools and a Practitioner’s guide
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Christos Faloutsos CMU
CS 345A Data Mining MapReduce This presentation has been altered.
Large Graph Mining: Power Tools and a Practitioner’s guide
Presentation transcript:

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs Faloutsos, Miller, Tsourakakis CMU

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-2 Outline Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs Task 5: Mining graphs over time Task 6: Virus/influence propagation Task 7: Spectral graph theory Task 8: Tera/peta graph mining: hadoop Observations – patterns of real graphs Conclusions

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-3 Scalability How about if graph/tensor does not fit in core? How about handling huge graphs?

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-4 Scalability How about if graph/tensor does not fit in core? [‘MET’: Kolda, Sun, ICMD’08, best paper award] How about handling huge graphs?

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-5 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then?

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-6 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone)

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-7 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-8 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id map reduce

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-9 User Program Reducer Master Mapper fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data (on HDFS) By default: 3-way replication; Late/dead machines: ignored, transparently (!)

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-10 D.I.S.C. ‘Data Intensive Scientific Computing’ [R. Bryant, CMU] –‘big data’ – 128.pdf

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-11 ~200Gb (Yahoo crawl) - Degree Distribution: in 12 minutes with 50 machines Many link spams at out-degree 1200 Analysis of a large graph

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-12 Conclusions Hadoop: promising architecture for Tera/Peta scale graph mining Resources: Higher-level language for data processing

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-13 References Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04Jeffrey DeanSanjay Ghemawat Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD 2008: Benjamin ReedUtkarsh SrivastavaRavi KumarAndrew TomkinsSIGMOD 2008