Download presentation
Presentation is loading. Please wait.
Published byTodd Hudson Modified over 9 years ago
1
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1
2
Data! We live in the data age. Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB 1 ZB = 10 21 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB The flood of data is coming from many sources New York Stock Exchange generates 1 TB of new trade data per day Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month ‘Big Data’ can affects smaller organizations or individuals Digital photos, individual’s interactions – phone calls, emails, documents – are captured and stored for later access The amount of data generated by machines will be even greater than that generated by people Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions 2
3
Data! Data can be shared for anyone to download and analyze Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org Astrometry.net project Watches the astrometry group on Flickr for new photos of the night sky Analyzes each image and identifies the sky The project shows that are possible when data is made available and used for something that was not anticipated by the creator Big Data is here. We are struggling to store and analyze it. 3
4
Data Storage and Analysis The storage capacities have increased but access speeds haven’t kept up Writing is even slower! Solution : Read and write data in parallel to/from multiple disks Problem To solve hardware failure replication RAID : Redundant copies of the data are kept in case of failure To combine the data in a disk with the others What Hadoop provides A reliable shared storage (HDFS) Efficient analysis (MapReduce) 4 19902010 1 drive stores1,370 MB1 TB transfers4.4 MB/s100 MB/s reads all the data from a full drive5 minutes2 hours and 30 minutes
5
Comparison with Other Systems - RDBMS RDBMS B-Tree index Optimized for accessing and updating a small proportion of records MapReduce Efficient for updating the large data, uses Sort/Merge to rebuild the DB Good for the needs to analyze the whole dataset in a batch fashion Structured vs. Semi- or Unstructured Data Structured data : particular predefined schema RDBMS Semi- or Unstructured data : looser or no particular internal structure MapReduce Normalization To retain the integrity and remove redundancy, relational data is often normalized MapReduce performs high-speed streaming reads and writes, and records that is not normalized are well-suited to analysis with MapReduce. 5
6
Comparison with Other Systems - RDBMS RDBMS vs. MapReduce Co-evolution of RDBMS and MapReduce systems RDBs start incorporating some of the ideas from MapReduce Higher-level query languages built on MapReduce Making MapReduce systems more approachable to traditional database programmers 6
7
Comparison with Other Systems – Grid Computing Grid Computing High Performance Computing(HPC) and Grid Computing communities have been doing large-scale data processing Using APIs as Message Passing Interface(MPI) HPC Distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN Works well for compute-intensive jobs Meets a problem when nodes need to access larger data volumes – hundreds of GB, since the network bandwidth is the bottleneck and compute nodes become idle Data locality, the heart of MapReduce MapReduce collocates the data with the compute node, so data access is fast since it is local MPI vs. MapReduce MPI programmers need to handle the mechanics of the data flow MapReduce programmers think in terms of functions of key and value pairs, and the data flow is implicit 7
8
Comparison with Other Systems – Grid Computing Partial failure MapReduce is a shared-nothing architecture tasks have no dependence on one other. the order in which the tasks run doesn’t matter. MPI programs have to manage the check-pointing and recovery 8
9
Comparison with Other Systems – Volunteer Computing Volunteer computing projects Breaking the problem into chunks called work units Sending to computers around the world to be analyzed The Results are sent back to the server when the analysis is completed The client gets another work unit SETI@home to analyze radio telescope data for signs of intelligent life outside earth SETI@home vs. MapReduce SETI@home very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Volunteers are donating CPU cycles, not bandwidth Runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality MapReduce Designed to run jobs that last minutes or hours on HW running in a single data center with very high aggregate bandwidth interconnects 9
10
A Brief History of Hadoop Hadoop Created by Doug Cutting, the creator of Apache Lucene, text search library Has its origin in Apache Nutch, an open source web search engine, a part of the Lucene project ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy History In 2002, Nutch was started A working crawler and search system emerged Its architecture wouldn’t scale to the billions of pages on the Web In 2003, Google published a paper describing the architecture of Google’s distributed filesystem, GFS In 2004, Nutch project implemented the GFS idea into the Nutch Distributed Filesystem, NDFS In 2004, Google published the paper introducing MapReduce In 2005, Nutch had a working MapReduce implementation in Nutch By the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS 10
11
A Brief History of Hadoop History In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop In Jan. 2006, Doug Cutting joined Yahoo! Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core Hadoop cluster In Apr. 2008, Hadoop broke a world record to sort a terabytes of data In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes in 68 seconds. In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds 11
12
Apache Hadoop and the Hadoop Ecosystem The Hadoop projects that are covered in this book are following Common – a set of components and interfaces for filesystems and I/O. Avro – a serialization system for RPC and persistent data storage. MapReduce – a distributed data processing model. HDFS – a distributed filesystem running on large clusters of machines. Pig – a data flow language and execution environment for large datasets. Hive – a distributed data warehouse providing SQL-like query language. HBase – a distributed, column-oriented database. ZooKeeper – a distributed, highly available coordination service. Sqoop – a tool for efficiently moving data between relational DB and HDFS. 12
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.