Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Tyson Condie.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
COMP 2903 A34s – Google and the Wisdom of Clouds Danny Silver JSOCS, Acadia University.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Hadoop Ali Sharza Khan High Performance Computing 1.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matthew Winter and Ned Shawa
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
MapReduce. Google and MapReduce Google searches billions of web pages very, very quickly How? It uses a technique called “MapReduce” to distribute the.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
An Introduction To Big Data For The SQL Server DBA.
BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Big Data Analytics on Large Scale Shared Storage System
SAS users meeting in Halifax
Hadoop Aakash Kag What Why How 1.
Tutorial: Big Data Algorithms and Applications Under Hadoop
Central Florida Business Intelligence User Group
Introduction to Spark.
CS110: Discussion about Spark
MapReduce.
Overview of big data tools
MAPREDUCE TYPES, FORMATS AND FEATURES
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie Melon University, Los Alamos National Laboratory

As the computational scale of scientific applications grows, so does the amount of data. Dealing with that amount of data becomes difficult. Data analytics become difficult The data becomes too large to move Applications become resource intensive More difficult to program for Do the older existing solutions scale? Scientific Applications

Bioinformatics (Basic Local Alignment Search Tool) Genomics machines generate large datasets (GB~TB) Data is manually distributed in parallel through an adhoc job manager script The method of parallelizing BLAST is conceptually a manual MapReduce operation Using Hadoop would abstract away the manual parallelization of tasks and would provide task resiliency Scientific Applications

Cyber-Security: Real-time network analysis In a massively multi-user network environment, petabytes of information can pass of the network in a matter of months Need a scalable FS that can accommodate the large streaming datasets Network events are data independent A programming model that abstracts parallelization from the user is convenient Scientific Applications

Astrophysics: Halo Finding Scientific Applications Current issues Ad Hoc: The approach is unique Too much data movement Parallel halo finding tasks are unreliable Hadoop Solutions Provides a standard approach No data movement Hadoop ensures task resilience

Method used to find clusters of particles in large astrophysics datasets. Halo Finding

Algorithm used to perform halo finding Friends of Friends

MapReduce model for Halo-Finding M R FoF HDFS

There is a reason why people think that Hadoop and is only good for data mining applications There exists little to no functionality for data types beyond text Learning curve for the language is steep for applications that deal with different data types such as binary The programmer has to deal with the new programming model and write their own input classes The Hadoop community is very active and incredibly helpful/prompt with responding to issues/bugs Experiences

Hadoop can be used as a viable resource for large data intensive computing Hadoop runs on an inexpensive commodity computing platform, but provides powerful tools for large scale data analytics The Hadoop architecture provides for task resiliency that other scientific computing methods cannot Hadoop allows for a strict model in which to parallelize a task and the parallelization has been shown to scale to node cluster environments (Amazon’s S3 cluster) Hadoop needs more functionality in its API for other data formats Conclusion

Grant Mackey: Julio Lopez: Saba Sehrish: John Bent: Jun Wang: Contact

Insert text here HEADER

Insert text here HEADER

Insert text here HEADER