Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Apache Avro Apache Parquet

APACHE AVRO

Overview Avro is a data serialization system Implemented in C, C++, C#, Java, Perl, PHP, Python, and Ruby

Avro Provides Rich data structures Compact, fast, binary data format A container file to store persistent data Remote Procedure Call (RPC) Simple integration with dynamic languages

Schema Declaration A JSON string A JSON object – {"type": "typeName"...attributes...} A JSON array, representing a union of types

Primitive Types Null Boolean Int Long Float Double Bytes String

Complex Types Records Enums Arrays Maps Unions Fixed

Record Example - LinkedList { "type": "record", "name": "LongList", // old name for this "aliases": ["LinkedLongs"], "fields" : [ // each element has a long {"name": "value", "type": "long"}, // optional next element {"name": "next", "type": ["LongList", "null"]} ] } Comments are here for descriptive purposes only – there are no comments in JSON

Enum Example – Playing Cards { "type": "enum", "name": "Suit", "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"] }

Array { "type": "array", "items": "string" }

Maps { "type": "map", "values": "long" }

Unions Represented using JSON arrays – ["string", "null"] declares a schema which may be a string or null May not contain more than one schema with the same type, except in the case of named types like record, fixed, and enum. – Two arrays or maps? No. But two record types? Yes! Cannot contain other unions

Fixed { "type": "fixed", "size": 16, "name": "md5" }

A bit on Naming Records, enums, and fixed types are all named The full name is composed of the name and a namespace – Names start with [A-Za-z_] and can only contain [A-Za-z0-9_] – Namespaces are dot-separated sequence of names Named types can be aliased to map a writer’s schema to a reader

Encodings! Binary JSON One is more readable by the machines, one is more readable by the humans Details of how they are encoded can be found at http://avro.apache.org/docs/current/spec.html http://avro.apache.org/docs/current/spec.html

Compression Null Deflate Snappy (optional)

Other Features RPC via Protocols – Message passing between readers and writers Schema Resolution – When schema and data don’t align Parsing Canonical Form – Transform schemas into PCF to determine “sameness” between schemas Schema Fingerprints – To “uniquely” identify schemas

Code Generation! [shadam1@491vm ~]$ cat user.avsc { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Code Generation! [shadam1@491vm ~]$ java -jar avro-tools-1.7.6.jar compile \ schema user.avsc. Input files to compile: user.avsc [shadam1@491vm ~]$ vi example/avro/User.java

Java and Python Demo! See my VM AWS https://github.com/adamjshook/hadoop- demos https://github.com/adamjshook/hadoop- demos

APACHE PARQUET

Overview Parquet is an Apache open-source columnar storage format for Hadoop Based off the Google Dremel paper and created largely by Twitter and Cloudera Supports very efficient compression and encoding schemes

Serialization Objects are serialized to Parquet format by ReadSupport and WriteSupport implementations Support for Avro, Thrift, Pig, Hive SerDe, MapReduce Can write your own, but it’s easier to leverage what exists today

File Hierarchy Row Group – logical horizontal partitioning of data into rows Column Chunk – Chunk of the data for a particular column, living in a row group and contiguous in the file Page – Chunks are divided up into pages One or more Row Groups per file, exactly one Column Chunk per column

File Format 4-byte magic number "PAR1............ File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"

File Format 4-byte magic number "PAR1"............ File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"

Data Types Boolean Int 32, 64, 96 Float Double Byte Array

Parquet Example - Avro See my VM AWS https://github.com/adamjshook/hadoop- demos https://github.com/adamjshook/hadoop- demos

References http://avro.apache.org http://parquet.io

Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Similar presentations

Presentation on theme: "Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Similar presentations

Presentation on theme: "Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."— Presentation transcript:

Similar presentations

About project

Feedback