Lab #2 - Create a movies dataset

Lab #2 - Create a movies dataset
Lab #1 - VM setup Lab #2 - Create a movies dataset Use approximately 10 minutes for this lab, but get started early to consume slip time if people need help

Strata New York 2015 Building an Apache Hadoop Data Application
Ryan Blue and Tom White

Content for today’s tutorial
The Hadoop Ecosystem Storage on Hadoop Movie ratings app: Data ingest Movie ratings app: Data analysis

The Hadoop Ecosystem Time: 1 hour (including lab #1 and 10 minutes slip)

A Hadoop Stack

Processing frameworks
Code: MapReduce, Crunch, Spark, Flink SQL: Hive, Impala, Phoenix, Trafodian, Drill, Presto Tuples: Cascading, Pig Streaming: Spark streaming (micro-batch), Flink streaming, Storm, Samza I’m not an expert on Cascading or Streaming processing. Mention Samza?

Coding frameworks Crunch
A layer around MR (or Spark) that simplifies writing pipelines Spark A completely new framework for processing pipelines Takes advantage of memory, runs a DAG without extra map phases Note that the last example is done in Crunch. It is similar to Spark, but can run on MR or Spark.

SQL on Hadoop Hive for batch processing Impala for low-latency queries
SparkSQL to use SQL with Spark applications Phoenix and Trafodion for transactional queries on HBase Run queries with Hive and Impala over the dataset loaded in the lab and compare the execution times

Ingest tools Relational: Sqoop, Sqoop2
Record channel: Kafka, Flume, NiFi Files: NiFi Numerous commercial options

Ingest tools Relational: Sqoop, Sqoop2 Record channel: Kafka, Flume
Files: NiFi App HDFS Database Flume Flume Hadoop

Relational DB to Hadoop
Sqoop CLI to run MR-based import jobs Sqoop2 Fixes configuration problems with Sqoop with credentials service More flexible to run on non-MR frameworks Under development Run queries with Hive and Impala over the dataset loaded in the lab and compare the execution times

Ingest tools Relational: Sqoop, Sqoop2
Record channel: Kafka, Flume, NiFi Files: NiFi Flume App Flume Channels HDFS Database Hadoop

Record streams to Hadoop
Flume - source, channel, sink architecture Well-established and integrated with other tools No order guarantee, duplicates are possible Kafka - pub-sub model for low latencies Partitioned, provides ordering guarantees, easier to eliminate duplicates More resilient to node failure with consumer groups Flume is used in the example - because it is the established and more supported route

Files to Hadoop NiFi Web GUI for drag & drop configuration of a data flow Enterprise features: back-pressure, monitoring, lineage, etc. Read from local disk, HTTP, FTP, SFTP, HDFS, . . . Originally for files, also handles record-based streams

Data storage in Hadoop Go through the solution to Lab #2

HDFS Blocks Blocks Increase parallelism Balance work Replicated
Configured by dfs.blocksize Client-side setting data1.avro ... data2.avro

Splittable File Formats
Splittable: Able to process part of a file independently of others Process blocks in parallel Avro is splittable Gzipped content is not splittable CSV is effectively not splittable Ask about splittability: Which of the following would be splittable? XML JSON

File formats Existing formats: XML, JSON, Protobuf, Thrift
Designed for Hadoop: SequenceFile, RCFile, ORC Never use: Delimited text Recommended: Avro or Parquet Note that ORC is a well-designed format as well, but not on the recommended list because it is Hive-specific.

Avro Recommended row-oriented format
Broken into blocks with sync markers for splitting Binary encoding with block-level compression Avro schema Required to read any binary-encoded data! Written in the file header Flexible object models Time: 10 minutes for Avro overview and tools lab

Avro in-memory object models
generic Object model that can be used with any schema specific - compile schema to java object Generates type-safe runtime objects reflect - java object to schema Uses existing classes and objects

Optional Lab #3 - Using avro-tools http://tiny.cloudera.com/StrataLab3
Inspect one of the Avro files in the dataset created for lab #2

Row- and column-oriented formats
Able to reduce I/O when projecting columns Better encoding and compression A B C D a1 b1 c1 d1 a2 b2 c2 d2 ... Columns: Rows: Images © Twitter, Inc.

Parquet Recommended column-oriented format
Splittable by organizing into row groups Efficient binary encoding, supports compression Uses other object models Record construction API rather than object model parquet-avro - Use Avro schemas with generic or specific records parquet-protobuf, parquet-thrift, parquet-hive, etc. Time: 10 minutes for Parquet overview and tools lab

Parquet trade-offs Rows are buffered into groups that target a final size Row group size Memory consumption grows with row group size Larger groups get more I/O benefit and better encoding Memory consumption grows for each open file

Optional Lab #4 Using parquet-tools
Copy the Avro dataset from lab #2 to a new Parquet dataset and inspect one of its files. Things to note: Parquet schema, not Avro schema Row group info Encoding and compression results Page-level information

Partitioning Splittable file formats aren’t enough
Not processing data is better than processing in parallel Organize data to avoid processing: Partitioning Use HDFS paths for a coarse index: data/y=2015/m=03/d=14/

Partitioning day= 30 31 1 2 ... 3 month= November December January
February year= 2014 2015 events

Partitioning Caution Partitioning in HDFS is the primary index to data
Should reflect the most common access pattern Test partition strategies for multiple workloads Should balance file size with workload Lots of small files are bad for HDFS - partitioning should be more coarse Larger files take longer to find data - partitioning should be more specific

Implementing partitioning
Build your own - not recommended Hive and Impala managed Partitions are treated as data columns Insert statements must include partition calculations Kite managed Partition strategy configuration file Compatible with Hive and Impala

Kite High-level data API for Hadoop Built around datasets, not files
Tasks like partitioning are done internally Tools built around the data API Command-line Integration in Flume, Sqoop, NiFi, etc.

Lab #5 - Create a partitioned dataset

Movie ratings app: Data ingest pipeline
Time: 1 hour including Lab #6, 10 minutes slip

Movie ratings scenario
Your company runs a web application where users can rate movies You want to use Hadoop to analyze ratings over time Avoid scraping the production database for changes Instead, you want to log every rating submitted Ratings App Database Time: 20 minutes to setup the scenario and describe the Hadoop architecture Also cover: Where other tools might be used Why the dataset is Avro and not Parquet Why there is a division between Hadoop and the application Security and impersonation with Flume

Movie ratings app Log ratings to Flume Otherwise unchanged HDFS Flume
dataset Database Hadoop

Why Flume? Database and application constraints
Can the application change? Can the database handle exports? Why not Kafka? Kafka is a channel or queue Listener and HDFS writer need to run somewhere Flume Ratings App Flume Flume Time: 20 minutes to setup the scenario and describe the Hadoop architecture Also cover: Where other tools might be used Why the dataset is Avro and not Parquet Why there is a division between Hadoop and the application Security and impersonation with Flume

Lab #6 - Create a Flume pipeline http://tiny.cloudera.com/StrataLab6
Time: 10 minutes for users to get through the application setup 5 minutes to run through the demo up front 5 minutes to go let everyone submit queries to the application and discuss 10 minutes to go over relevant code and configuration, discuss trade-offs, and note what could be changed

Movie ratings app: Analyzing ratings data

Movie ratings analysis
Now you have several months of data You can query it in Hive and Impala for most cases Some questions are difficult to formulate as SQL Are there any movies that people either love or hate? Time: 5 minutes to setup and describe task

Analyzing ratings Map Extract key, movie_id, and value, rating Reduce:
Reduce groups all of the ratings by movie_id Count the number of ratings for each movie If there are two peaks, output the movie_id and counts Peak detection: difference between counts goes from negative to positive

Crunch background Stack up functions until a group-by operation to make a map phase Similarly, stack up functions after a group-by to make a reduce phase Additional group-by operations set up more MR rounds automatically PTable<Long, Double> table = collection .by(new GetMovieID(), Avros.longs()) .mapValues(new GetRating(), Avros.ints()) .groupByKey() .mapValues(new AverageRating(), Avros.doubles()); Time: 10 minutes to discuss crunch background and show application

Lab #7 - Analyze ratings with Crunch
Time: 10 minutes for students to run the lab 5 minutes to run the lab up front 10 minutes to discuss the results, add an output dataset

blue@cloudera.com tom@cloudera.com
Thank you

Lab #2 - Create a movies dataset

Similar presentations

Presentation on theme: "Lab #2 - Create a movies dataset"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lab #2 - Create a movies dataset

Similar presentations

Presentation on theme: "Lab #2 - Create a movies dataset"— Presentation transcript:

Similar presentations

About project

Feedback