Lab #2 - Create a movies dataset

Slides:



Advertisements
Similar presentations
SQOOP HCatalog Integration
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Hadoop Ecosystem Overview
Hadoop File Formats and Data Ingestion
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Hadoop File Formats and Data Ingestion
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Tree and Graph Processing On Hadoop Ted Malaska.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
  Choice Hotels’ journey to better understand its customers through self-service analytics Narasimhan Sampath & Avinash Ramineni Strata Hadoop World |
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Hadoop and Analytics at CERN IT
Software Systems Development
File Format Benchmark - Avro, JSON, ORC, & Parquet
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
HADOOP ADMIN: Session -2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Collecting heterogeneous data into a central repository
Spark Presentation.
Scaling SQL with different approaches
Hadoop.
Getting Data into Hadoop
SQOOP.
Central Florida Business Intelligence User Group
Powering real-time analytics on Xfinity using Kudu
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Microsoft Dumps PDF Cloudera CCA175 Dumps PDF CCA Spark and Hadoop Developer Exam - Performance Based Scenarios RealExamCollection.com.
Ministry of Higher Education
Big Data - in Performance Engineering
Johannes Peter MediaMarktSaturn Retail Group
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
SETL: Efficient Spark ETL on Hadoop
Introduction to Apache
Overview of big data tools
Cloud Data Replication with SQL Data Sync
Setup Sqoop.
Data processing with Hadoop
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Map Reduce, Types, Formats and Features
Pig Hive HBase Zookeeper
Presentation transcript:

Lab #2 - Create a movies dataset Lab #1 - VM setup http://tiny.cloudera.com/StrataLab1 Lab #2 - Create a movies dataset http://tiny.cloudera.com/StrataLab2 Use approximately 10 minutes for this lab, but get started early to consume slip time if people need help

Strata New York 2015 Building an Apache Hadoop Data Application Ryan Blue and Tom White

Content for today’s tutorial The Hadoop Ecosystem Storage on Hadoop Movie ratings app: Data ingest Movie ratings app: Data analysis

The Hadoop Ecosystem Time: 1 hour (including lab #1 and 10 minutes slip)

A Hadoop Stack

Processing frameworks Code: MapReduce, Crunch, Spark, Flink SQL: Hive, Impala, Phoenix, Trafodian, Drill, Presto Tuples: Cascading, Pig Streaming: Spark streaming (micro-batch), Flink streaming, Storm, Samza I’m not an expert on Cascading or Streaming processing. Mention Samza?

Coding frameworks Crunch A layer around MR (or Spark) that simplifies writing pipelines Spark A completely new framework for processing pipelines Takes advantage of memory, runs a DAG without extra map phases Note that the last example is done in Crunch. It is similar to Spark, but can run on MR or Spark.

SQL on Hadoop Hive for batch processing Impala for low-latency queries SparkSQL to use SQL with Spark applications Phoenix and Trafodion for transactional queries on HBase Run queries with Hive and Impala over the dataset loaded in the lab and compare the execution times

Ingest tools Relational: Sqoop, Sqoop2 Record channel: Kafka, Flume, NiFi Files: NiFi Numerous commercial options

Ingest tools Relational: Sqoop, Sqoop2 Record channel: Kafka, Flume Files: NiFi App HDFS Database Flume Flume Hadoop

Relational DB to Hadoop Sqoop CLI to run MR-based import jobs Sqoop2 Fixes configuration problems with Sqoop with credentials service More flexible to run on non-MR frameworks Under development Run queries with Hive and Impala over the dataset loaded in the lab and compare the execution times

Ingest tools Relational: Sqoop, Sqoop2 Record channel: Kafka, Flume, NiFi Files: NiFi Flume App Flume Channels HDFS Database Hadoop

Record streams to Hadoop Flume - source, channel, sink architecture Well-established and integrated with other tools No order guarantee, duplicates are possible Kafka - pub-sub model for low latencies Partitioned, provides ordering guarantees, easier to eliminate duplicates More resilient to node failure with consumer groups Flume is used in the example - because it is the established and more supported route

Files to Hadoop NiFi Web GUI for drag & drop configuration of a data flow Enterprise features: back-pressure, monitoring, lineage, etc. Read from local disk, HTTP, FTP, SFTP, HDFS, . . . Originally for files, also handles record-based streams

NiFi

Data storage in Hadoop Go through the solution to Lab #2

HDFS Blocks Blocks Increase parallelism Balance work Replicated Configured by dfs.blocksize Client-side setting data1.avro ... data2.avro

Splittable File Formats Splittable: Able to process part of a file independently of others Process blocks in parallel Avro is splittable Gzipped content is not splittable CSV is effectively not splittable Ask about splittability: Which of the following would be splittable? XML JSON

File formats Existing formats: XML, JSON, Protobuf, Thrift Designed for Hadoop: SequenceFile, RCFile, ORC Never use: Delimited text Recommended: Avro or Parquet Note that ORC is a well-designed format as well, but not on the recommended list because it is Hive-specific.

Avro Recommended row-oriented format Broken into blocks with sync markers for splitting Binary encoding with block-level compression Avro schema Required to read any binary-encoded data! Written in the file header Flexible object models Time: 10 minutes for Avro overview and tools lab

Avro in-memory object models generic Object model that can be used with any schema specific - compile schema to java object Generates type-safe runtime objects reflect - java object to schema Uses existing classes and objects

Optional Lab #3 - Using avro-tools http://tiny.cloudera.com/StrataLab3 Inspect one of the Avro files in the dataset created for lab #2

Row- and column-oriented formats Able to reduce I/O when projecting columns Better encoding and compression A B C D a1 b1 c1 d1 a2 b2 c2 d2 ... Columns: Rows: Images © Twitter, Inc. https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Parquet Recommended column-oriented format Splittable by organizing into row groups Efficient binary encoding, supports compression Uses other object models Record construction API rather than object model parquet-avro - Use Avro schemas with generic or specific records parquet-protobuf, parquet-thrift, parquet-hive, etc. Time: 10 minutes for Parquet overview and tools lab

Parquet trade-offs Rows are buffered into groups that target a final size Row group size Memory consumption grows with row group size Larger groups get more I/O benefit and better encoding Memory consumption grows for each open file

Optional Lab #4 Using parquet-tools http://tiny.cloudera.com/StrataLab4 Copy the Avro dataset from lab #2 to a new Parquet dataset and inspect one of its files. Things to note: Parquet schema, not Avro schema Row group info Encoding and compression results Page-level information

Partitioning Splittable file formats aren’t enough Not processing data is better than processing in parallel Organize data to avoid processing: Partitioning Use HDFS paths for a coarse index: data/y=2015/m=03/d=14/

Partitioning day= 30 31 1 2 ... 3 month= November December January February year= 2014 2015 events

Partitioning Caution Partitioning in HDFS is the primary index to data Should reflect the most common access pattern Test partition strategies for multiple workloads Should balance file size with workload Lots of small files are bad for HDFS - partitioning should be more coarse Larger files take longer to find data - partitioning should be more specific

Implementing partitioning Build your own - not recommended Hive and Impala managed Partitions are treated as data columns Insert statements must include partition calculations Kite managed Partition strategy configuration file Compatible with Hive and Impala

Kite High-level data API for Hadoop Built around datasets, not files Tasks like partitioning are done internally Tools built around the data API Command-line Integration in Flume, Sqoop, NiFi, etc.

Lab #5 - Create a partitioned dataset http://tiny.cloudera.com/StrataLab5

Movie ratings app: Data ingest pipeline Time: 1 hour including Lab #6, 10 minutes slip

Movie ratings scenario Your company runs a web application where users can rate movies You want to use Hadoop to analyze ratings over time Avoid scraping the production database for changes Instead, you want to log every rating submitted Ratings App Database Time: 20 minutes to setup the scenario and describe the Hadoop architecture Also cover: Where other tools might be used Why the dataset is Avro and not Parquet Why there is a division between Hadoop and the application Security and impersonation with Flume

Movie ratings app Log ratings to Flume Otherwise unchanged HDFS Flume dataset Database Hadoop

Why Flume? Database and application constraints Can the application change? Can the database handle exports? Why not Kafka? Kafka is a channel or queue Listener and HDFS writer need to run somewhere Flume Ratings App Flume Flume Time: 20 minutes to setup the scenario and describe the Hadoop architecture Also cover: Where other tools might be used Why the dataset is Avro and not Parquet Why there is a division between Hadoop and the application Security and impersonation with Flume

Lab #6 - Create a Flume pipeline http://tiny.cloudera.com/StrataLab6 Time: 10 minutes for users to get through the application setup 5 minutes to run through the demo up front 5 minutes to go let everyone submit queries to the application and discuss 10 minutes to go over relevant code and configuration, discuss trade-offs, and note what could be changed

Movie ratings app: Analyzing ratings data

Movie ratings analysis Now you have several months of data You can query it in Hive and Impala for most cases Some questions are difficult to formulate as SQL Are there any movies that people either love or hate? Time: 5 minutes to setup and describe task

Analyzing ratings Map Extract key, movie_id, and value, rating Reduce: Reduce groups all of the ratings by movie_id Count the number of ratings for each movie If there are two peaks, output the movie_id and counts Peak detection: difference between counts goes from negative to positive

Crunch background Stack up functions until a group-by operation to make a map phase Similarly, stack up functions after a group-by to make a reduce phase Additional group-by operations set up more MR rounds automatically PTable<Long, Double> table = collection .by(new GetMovieID(), Avros.longs()) .mapValues(new GetRating(), Avros.ints()) .groupByKey() .mapValues(new AverageRating(), Avros.doubles()); Time: 10 minutes to discuss crunch background and show application

Lab #7 - Analyze ratings with Crunch http://tiny.cloudera.com/StrataLab7 Time: 10 minutes for students to run the lab 5 minutes to run the lab up front 10 minutes to discuss the results, add an output dataset

blue@cloudera.com tom@cloudera.com Thank you blue@cloudera.com tom@cloudera.com