Hadoop: The Definitive Guide Chap. 4 Hadoop I/O

Slides:



Advertisements
Similar presentations
A Guide to Advanced Java Faculty:Nguyen Ngoc Tu. 2 Operating System Application #1 Application #2 Java Virtual Machine #1 Local Memory Shared Memory Threads.
Advertisements

I/O Basics 12 January 2014Smitha N. Pai, CSE Dept.1.
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Introduction to Java 2 Programming Lecture 7 IO, Files, and URLs.
MOD III. Input / Output Streams Byte streams Programs use byte streams to perform input and output of 8-bit bytes. This Stream handles the 8-bit.
Streams Dwight Deugo Nesa Matic Portions of the notes for this lecture include excerpts from.
Slide 10.1 Advanced Programming 2004, based on LY Stefanus’s Slides Object Serialization Object serialization: the process of converting an object into.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
1 CMSC 132: Object-Oriented Programming II Nelson Padua-Perez William Pugh Department of Computer Science University of Maryland, College Park.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved L08 (Chapter 18) Binary I/O.
1 CMSC 132: Object-Oriented Programming II Java Constructs Department of Computer Science University of Maryland, College Park.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Distributed and Parallel Processing Technology Chapter 4 Hadoop I/O
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Prepared by : A.Alzubair Hassan Kassala university Dept. Computer Science Lecture 2 I/O Streams 1.
Streams Reading: 2 nd Ed: , rd Ed: 11.1, 19.1, 19.4
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
Serialization. Serialization is the process of converting an object into an intermediate format that can be stored (e.g. in a file or transmitted across.
Java Programming: Advanced Topics 1 Input/Output and Serialization Chapter 3.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Object Serialization.  When the data was output to disk, certain information was lost, such as the type of each value.  If the value "3" is read from.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
Arrays Chapter 7. MIS Object Oriented Systems Arrays UTD, SOM 2 Objectives Nature and purpose of an array Using arrays in Java programs Methods.
Bigtable: A Distributed Storage System for Structured Data
Chapter 5 Record Storage and Primary File Organizations
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Java Programming: Advanced Topics 1 Input/Output and Serialization.
Creating Java Applications (Software Development Life Cycle) 1. specify the problem requirements - clarify 2. analyze the problem - Input? Processes? Output.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
TerarkDB Introduction Peng Lei Terark Inc ( ). All rights reserved 1.
1 CSE 331 Memento Pattern and Serialization slides created by Marty Stepp based on materials by M. Ernst, S. Reges, D. Notkin, R. Mercer, Wikipedia
Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
BINARY I/O IN JAVA CSC 202 November What should be familiar concepts after this set of topics: All files are binary files. The nature of text files.
Lecture 5 Text File I/O; Parsing.
Module 11: File Structure
MapReduce Types, Formats and Features
I/O Basics.
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 13: File Input and Output
Operation System Program 4
The Basics of Apache Hadoop
Chapter 17 Binary I/O Dr. Clincy - Lecture.
Hadoop Distributed Filesystem
Data processing with Hadoop
Chapter 2: Operating-System Structures
Introduction to Data Structure
MAPREDUCE TYPES, FORMATS AND FEATURES
Chapter 2: Operating-System Structures
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
OO Java Programming Input Output.
Map Reduce, Types, Formats and Features
Presentation transcript:

Hadoop: The Definitive Guide Chap. 4 Hadoop I/O Kisung Kim

Contents Integrity Compression Serialization File-based Data Structure

Data Integrity When the volumes of data flowing through the system are as large as the ones Hadoop is capable of handling, the chance of data corruption occurring is high Checksum Usual way of detecting corrupted data Technique for only error detection (cannot fix the corrupted data) CRC-32 (cyclic redundancy check) Compute a 32-bit integer checksum for input of any size

Data Integrity in HDFS HDFS transparently checksums all data written to it and by default verifies checksums when reading data io.bytes.per.checksum Data size to compute checksums Default is 512 bytes Datanodes are responsible for verifying the data they receive before storing the data and its checksum If it detects an error, the client receives a ChecksumException, a subclass of IOException When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanode Checksum verification log Each datanode keeps a persistent log to know the last time each of its blocks was verified When a client successfully verifies a block, it tells the datanode who sends the block Then, the datanode updates its log

Data Integrity in HDFS DataBlockScanner Healing corrupted blocks Background thread that periodically verifies all the blocks stored on the datanode Guard against corruption due to “bit rot” in the physical storage media Healing corrupted blocks If a client detects an error when reading a block, it reports the bad block and the datanode to the namenode Namenode marks the block replica as corrupt Namenode schedules a copy of the block to be replicated on another datanode The corrupt replica is deleted Disabling verification of checksum Pass false to the setVerifyCheckSum() method on FileSystem -ignoreCrc option

Data Integrity in HDFS LocalFileSystem RawLocalFileSystem Performes client-side checksumming When you write a file called filename, the FS client transparently creates a hidden file, .filename.crc, in the same directory containing the checksums for each chunk of the file RawLocalFileSystem Disable checksums Use when you don’t need checksums ChecksumFileSystem Wrapper around FileSystem Make it easy to add checksumming to other (nonchecksummed) FS Underlying FS is called the raw FS FileSystem rawFs = ... FileSystem checksummedFs = new ChecksumFileSystem(rawFs);

Compression Two major benefits of file compression Reduce the space needed to store files Speed up data transfer across the network When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop

Compression Formats Compression formats “Splittable” column Indicates whether the compression format supports splitting Whether you can seek to any point in the stream and start reading from some point further on Splittable compression formats are especially suitable for MapReduce Compression Format Tool Algorithm Filename Extension Multiple Files Splittable DEFLATE N/A .deflate NO gzip .gz ZIP zip .zip YES YES, at file boundaries bzip2 .bz2 LZO lzop .lzo

Codes Implementation of a compression-decompression algorithm The LZO libraries are GPL-licensed and may not be included in Apache distributions CompressionCodec createOutputStream(OutputStream out): create a CompressionOutputStream to which you write your uncompressed data to have it written in compressed form to the underlying stream createInputStream(InputStream in): obtain a CompressionInputStream, which allows you to read uncompressed data from the underlying stream Compression Format Hadoop Compression Codec DEFLATE org.apache.hadoop.io.compression.DefaultCodec gzip org.apache.hadoop.io.compression.GzipCodec Bzip2 org.apache.hadoop.io.compression.BZip2Codec LZO com.hadoop.compression.lzo.LzopCodec

Example finish() Tell the compressor to finish writing to the compressed stream, but doesn’t close the stream public class StreamCompressor { public static void main(String[] args) throws Exception { String codecClassname = args[0]; Class<?> codecClass = Class.forName(codecClassname); Configuration conf = new Configuration(); CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf); CompressionOutputStream out = codec.createOutputStream(System.out); IOUtils.copyBytes(System.in, out, 4096, false); out.finish(); } % echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec \ | gunzip - Text

Compression and Input Splits When considering how to compress data that will be processed by MapReduce, it is important to understand whether the compression format supports splitting Example of not-splitable compression problem A file is a gzip-compressed file whose compressed size is 1 GB Creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream, and therefore impossible for a map task to read its split independently of the others

Serialization Process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage Deserialization is the reverse process of serialization Requirements Compact To make efficient use of storage space Fast The overhead in reading and writing of data is minimal Extensible We can transparently read data written in an older format Interoperable We can read or write persistent data using different language

Writable Interface Writable interface defines two methods write() for writing its state to a DataOutput binary stream readFields() for reading its state from a DataInput binary stream Example: IntWritable public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException; } IntWritable writable = new IntWritable(); writable.set(163); public static byte[] serialize(Writable writable) throws IOException { ByteArrayOutputStream out = new ByteArrayOutputStream(); DataOutputStream dataOut = new DataOutputStream(out); writable.write(dataOut); dataOut.close(); return out.toByteArray(); } byte[] bytes = serialize(writable); assertThat(bytes.length, is(4)); assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));

WritableComparable and Comparator IntWritable implements the WritableComparable interface Comparison of types is crucial for MapReduce Optimization: RawComparator Compare records read from a stream without deserializing them into objects WritableComparator is a general-purpose implementation of RawComparator Provide a default implementation of the raw compare() method Deserialize the objects and invokes the object compare() method Act as a factory for RawComparator instances public interface WritableComparable<T> extends Writable, Comparable<T> { } RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class); IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertThat(comparator.compare(w1, w2), greaterThan(0)); byte[] b1 = serialize(w1); byte[] b2 = serialize(w2); assertThat(comparator.compare(b1, 0, b1.length, b2, 0, b2.length), greaterThan(0));

<<interface>> Writable Classes Writable class hierarchy <<interface>> Writable org.apache.hdaoop.io WritableComparable BooleanWritable ByteWritable IntWritable VIntWritable FloatWritable LongWritable VLongWritable DoubleWritable NullWritable Text BytesWritable MD5Hash ObjectWritable GenericWritable ArrayWritable TwoDArrayWritable AbstractMapWritable MapWritable SortedMapWritable Primitives Others

Writable Wrappers for Java Primitives There are Writable wrappers for all the Java primitive types except shot and char(both of which can be stored in an IntWritable) get() for retrieving and set() for storing the wrapped value Variable-length formats If a value is between -122 and 127, use only a single byte Otherwise, use first byte to indicate whether the value is positive or negative and how many bytes follow Java Primitive Writable Implementation Serialized Size (bytes) boolean BooleanWritable 1 byte ByteWritable int IntWritable 4 VIntWritable 1~5 float FloatWritable long LongWritable 8 VLongWritable 1~9 double DoubleWritable 163 VIntWritable: 8fa3 1000 1111 1010 0011 -123 (2’s complement) 163

Text Writable for UTF-8 sequences Can be thought of as the Writable equivalent of java.lang.String Replacement for the org.apache.hadoop.io.UTF8 class (deprecated) Maximum size is 2GB Use standard UTF-8 org.apache.hadoop.io.UTF8 used Java’s modified UTF-8 Indexing for the Text class is in terms of position in the encoded byte sequence Text is mutable (like all Writable implementations, except NullWritable) You can reuse a Text instance by calling one of the set() method Text t = new Text("hadoop"); t.set("pig"); assertThat(t.getLength(), is(3)); assertThat(t.getBytes().length, is(3));

Etc. BytesWritable NullWritable ObjectWritable Writable collections Wrapper for an array of binary data NullWritable Zero-length serialization Used as a placeholder A key or a value can be declared as a NullWritable when you don’t need to use that position ObjectWritable General-purpose wrapper for Java primitives, String, enum, Writable, null, arrays of any of these types Useful when a field can be of more than one type Writable collections ArrayWritable TwoDArrayWritable MapWritable SortedMapWritable

Serialization Frameworks Using Writable is not mandated by MapReduce API Only requirement Mechanism that translates to and from a binary representation of each type Hadoop has an API for pluggable serialization frameworks A serialization framework is represented by an implementation of Serialization (in org.apache.hadoop.io.serializer package) A Serialization defines a mapping from types to Serializer instances and Deserializer instances Set the io.serializations property to a comma-separated list of classnames to register Serialization implementations

SequenceFile Persistent data structure for binary key-value pairs Usage example Binary log file Key: timestamp Value: log Container for smaller files The keys and values stored in a SequenceFile do not necessarily need to be Writable Any types that can be serialized and deserialized by a Serialization may be used

Writing a SequenceFile

Reading a SequenceFile

Sync Point Point in the stream which can be used to resynchronize with a record boundary if the reader is “lost”—for example, after seeking to an arbitrary position in the stream sync(long position) Position the reader at the next sync point after position Do not confuse with sync() method defined by the Syncable interface for synchronizing buffers to the underlying device

SequenceFile Format Header contains the version number, the names of the key and value classes, compression details, user-defined metadata, and the sync marker Record format No compression Record compression Block compression

MapFile Sorted SequenceFile with an index to permit lookups by key Keys must be instances of WritableComparable and values must be Writable

Reading a MapFile Call the next() method until it returns false Random access lookup can be performed by calling the get() method Read the index file into memory Perform a binary search on the in-memory index Very large MapFile index Reindex to change the index interval Load only a fraction of the index keys into memory by setting the io.map.index.skip property