Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1

Data!  We live in the data age.  Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB 1 ZB = 10 21 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB  The flood of data is coming from many sources  New York Stock Exchange generates 1 TB of new trade data per day  Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage  Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month  ‘Big Data’ can affects smaller organizations or individuals  Digital photos, individual’s interactions – phone calls, emails, documents – are captured and stored for later access  The amount of data generated by machines will be even greater than that generated by people  Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions 2

Data!  Data can be shared for anyone to download and analyze  Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org  Astrometry.net project Watches the astrometry group on Flickr for new photos of the night sky Analyzes each image and identifies the sky  The project shows that are possible when data is made available and used for something that was not anticipated by the creator  Big Data is here. We are struggling to store and analyze it. 3

Data Storage and Analysis  The storage capacities have increased but access speeds haven’t kept up Writing is even slower!  Solution : Read and write data in parallel to/from multiple disks  Problem  To solve hardware failure  replication RAID : Redundant copies of the data are kept in case of failure  To combine the data in a disk with the others  What Hadoop provides  A reliable shared storage (HDFS)  Efficient analysis (MapReduce) 4 19902010 1 drive stores1,370 MB1 TB transfers4.4 MB/s100 MB/s reads all the data from a full drive5 minutes2 hours and 30 minutes

Comparison with Other Systems - RDBMS  RDBMS  B-Tree index  Optimized for accessing and updating a small proportion of records  MapReduce  Efficient for updating the large data, uses Sort/Merge to rebuild the DB  Good for the needs to analyze the whole dataset in a batch fashion  Structured vs. Semi- or Unstructured Data  Structured data : particular predefined schema  RDBMS  Semi- or Unstructured data : looser or no particular internal structure  MapReduce  Normalization  To retain the integrity and remove redundancy, relational data is often normalized  MapReduce performs high-speed streaming reads and writes, and records that is not normalized are well-suited to analysis with MapReduce. 5

Comparison with Other Systems - RDBMS  RDBMS vs. MapReduce  Co-evolution of RDBMS and MapReduce systems  RDBs start incorporating some of the ideas from MapReduce  Higher-level query languages built on MapReduce Making MapReduce systems more approachable to traditional database programmers 6

Comparison with Other Systems – Grid Computing  Grid Computing  High Performance Computing(HPC) and Grid Computing communities have been doing large-scale data processing Using APIs as Message Passing Interface(MPI)  HPC Distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN Works well for compute-intensive jobs Meets a problem when nodes need to access larger data volumes – hundreds of GB, since the network bandwidth is the bottleneck and compute nodes become idle  Data locality, the heart of MapReduce  MapReduce collocates the data with the compute node, so data access is fast since it is local  MPI vs. MapReduce  MPI programmers need to handle the mechanics of the data flow  MapReduce programmers think in terms of functions of key and value pairs, and the data flow is implicit 7

Comparison with Other Systems – Grid Computing  Partial failure  MapReduce is a shared-nothing architecture  tasks have no dependence on one other.  the order in which the tasks run doesn’t matter.  MPI programs have to manage the check-pointing and recovery 8

Comparison with Other Systems – Volunteer Computing  Volunteer computing projects  Breaking the problem into chunks called work units  Sending to computers around the world to be analyzed  The Results are sent back to the server when the analysis is completed  The client gets another work unit  SETI@home  to analyze radio telescope data for signs of intelligent life outside earth  SETI@home vs. MapReduce  SETI@home very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Volunteers are donating CPU cycles, not bandwidth Runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality  MapReduce Designed to run jobs that last minutes or hours on HW running in a single data center with very high aggregate bandwidth interconnects 9

A Brief History of Hadoop  Hadoop  Created by Doug Cutting, the creator of Apache Lucene, text search library  Has its origin in Apache Nutch, an open source web search engine, a part of the Lucene project  ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy  History  In 2002, Nutch was started A working crawler and search system emerged Its architecture wouldn’t scale to the billions of pages on the Web  In 2003, Google published a paper describing the architecture of Google’s distributed filesystem, GFS  In 2004, Nutch project implemented the GFS idea into the Nutch Distributed Filesystem, NDFS  In 2004, Google published the paper introducing MapReduce  In 2005, Nutch had a working MapReduce implementation in Nutch By the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS 10

A Brief History of Hadoop  History  In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop In Jan. 2006, Doug Cutting joined Yahoo! Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale  In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core Hadoop cluster  In Apr. 2008, Hadoop broke a world record to sort a terabytes of data  In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes in 68 seconds.  In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds 11

Apache Hadoop and the Hadoop Ecosystem  The Hadoop projects that are covered in this book are following  Common – a set of components and interfaces for filesystems and I/O.  Avro – a serialization system for RPC and persistent data storage.  MapReduce – a distributed data processing model.  HDFS – a distributed filesystem running on large clusters of machines.  Pig – a data flow language and execution environment for large datasets.  Hive – a distributed data warehouse providing SQL-like query language.  HBase – a distributed, column-oriented database.  ZooKeeper – a distributed, highly available coordination service.  Sqoop – a tool for efficiently moving data between relational DB and HDFS. 12

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Similar presentations

Presentation on theme: "Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Similar presentations

Presentation on theme: "Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1."— Presentation transcript:

Similar presentations

About project

Feedback