GSICS Annual Meeting Laurie Rokke, DPhil NOAA/NESDIS/STAR March 26, 201.

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Hadoop: The Definitive Guide Chap. 2 MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
MapReduce 資工碩一 黃威凱. Outline Purpose Example Method Advanced 資工碩一 黃威凱.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Image taken from: slideshare
Big Data is a Big Deal!.
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
An Open Source Project Commonly Used for Processing Big Data Sets
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Tutorial: Big Data Algorithms and Applications Under Hadoop
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Chapter 14 Big Data Analytics and NoSQL
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
Big Data - in Performance Engineering
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Distributed Systems CS
Zoie Barrett and Brian Lam
MAPREDUCE TYPES, FORMATS AND FEATURES
Apache Hadoop and Spark
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

GSICS Annual Meeting Laurie Rokke, DPhil NOAA/NESDIS/STAR March 26, 201

Phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data DATA VELOCITY Social Tech SMS Automated Demand Mobile Apps DATA VARIETY KB MB GB TB PB DATA VOLUME 1.8 zetabytes (10 ^21 GB) Batch Near Real Time Periodic Real Time Sensors Audio Database Video Table Web Reports

Characteristics of Big Data. Volume Sheer amount -ingested -analyzed -managed. Velocity Speed produced/changed -received -understood -processed Variety New Sources -integration -governance -architecture Veracity Quality & Provenance -tracibility -justification

Free and Open Source Software (FOSS) Hadoop Provides a reliable shared storage and analysis system: Storage provided by HDFS Analysis provided by MapReduce R Software package to perform statistical analysis on data: regression, clustering, classification, text analysis Registered under GNU

USE CASE: Mining Weather Data Data: National Climatic Data Center – Data files organized by date and weather station Directory for each year Each directory contains gzipped files for every weather station with records for that year % ls raw/1900 |head gz gz gz

Data Record Line orientated ASCII format Each line is a record Many meteorological elements -optional variables -highly variable data length # USAF Weather station identifier # WBAN weather station identifier # observation date 300 # observation time : # air temperature (degrees Celsius x 10) 1 # quality code # dew point temperature (degrees Celsius x10) 1 # quality code

What is the highest recorded global temperature for each year in the database? Example with AWK then HADOOP 10s thousands of files – Concatenate by year

Find Max Temp/year from NCDC Data #!/usr/bin/env/bash for year in all/* do echo –ne ‘basename $year.gz’”\t” gunzip –c $year | \ awk ‘{ temp= substr($0, 88, 5) + 0; q= substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp} END {print max}’ done

Speed up processing (~40 min) Run parts in parallel process – A) Divide each year into different process File sizes differ widely Whole run dominated by longest file – B) Divide the input into fixed sized chucks Assign each chunk a process

Combine Results From Independent Processes A) Result for each year is independent of the other years – Combine by concatenating all results and then sorting by year B) (Fixed size) – Max temp for each chunk – Final step look for highest temp per year Still limited by process capacity of a single machine Data set beyond capacity of a single machine Problems: Coordination reliability Who runs the overall job? How do you deal with failed processes?

Parallel Computation & Data files replicated Rack Aware file system Partitions data and comp across hosts Map Operational Function (Key, value) Reduce Collects (Key, Value) Writes Solution HDFS Hadoop Distributed File System MapReduce Heart of Hadoop Apache Hadoop Impact: Large Scale Computing Enables: scalable, flexible, fault- tolerant, cost-effective

Map NCDC Data Map Function simple: Pull out year and airtemp … N … … N … … N … … N … … N …. - Data Preparation phase – Key, Value (0, … N … (106, … N …. (212, … N …. (318, … N …. (424, … N …. Filters temps missing, suspect or erroneous Keys are line offsets in file, ignore in map function Extract year and air temp (integer)

MapReduce Framework Map Output Sorts and Groups key-value pairs by key Reduce Function iterates through list and picks max reading: (1950, 0) (1950, 22) (1950, -11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, -11]) (1949, 111) (1950, 22)

MapReduce logical Data flow Input … Map (1950, 0) (1950,22) (1950,111) Shuffle (1949, [111,78]) (1950, [0,22,-11) Reduce (1949,111) (1950,22) Output 1949, ,22

Hadoop divides MapReduce job into fixed-size pieces called splits run where HDF blocks resides A) Data Local B) Rack Local C) Off Rack AB C Map task HDFS Block Node Rack

Data Flow with Multiple Reduce Tasks sort copy Split 0 map merge reduce part 0 HDFS replication Split 1 map merge reduce part 1 HDFS replication Split 2 map Input HDFS Output HDFS

Time-series analysis, classification, clustering Machine Learning Predictive modeling Linear and non linear modeling StatisticsVisualization R Big Data Analytics

Understanding the architecture of R and Hadoop Client TaskTracker Submit Job R Initialize Job Map Reduce Job Tracker Write/Read Submit Job Resources HDFS

Practical Road Map Forward Event Log/Trending Define Potential Use Case Big Data Unique Value Tactical phases (time to value) Assess 3V requirements Currently capability Data & Sources Technical requirements to support accessing, managing and analyzing Plan Identify entry point (3V) Success Criteria Roadmap Use Case Policy /Security issues Iterative Phase deployment Execute Deploy current phase Build out big data platform as plan requires (flexibility) Deploy scalable technologies Continual Review Review progress, adjust deployment plans, test processes

Fall 2014 –Initiate: Data Science Commons -Improve cross organizational information -Discussion Topics -DOI -Providence -Visualization Questions? More Information?