Lab #2 - Create a movies dataset Lab #1 - VM setup http://tiny.cloudera.com/StrataLab1 Lab #2 - Create a movies dataset http://tiny.cloudera.com/StrataLab2 Use approximately 10 minutes for this lab, but get started early to consume slip time if people need help
Strata New York 2015 Building an Apache Hadoop Data Application Ryan Blue and Tom White
Content for today’s tutorial The Hadoop Ecosystem Storage on Hadoop Movie ratings app: Data ingest Movie ratings app: Data analysis
The Hadoop Ecosystem Time: 1 hour (including lab #1 and 10 minutes slip)
A Hadoop Stack
Processing frameworks Code: MapReduce, Crunch, Spark, Flink SQL: Hive, Impala, Phoenix, Trafodian, Drill, Presto Tuples: Cascading, Pig Streaming: Spark streaming (micro-batch), Flink streaming, Storm, Samza I’m not an expert on Cascading or Streaming processing. Mention Samza?
Coding frameworks Crunch A layer around MR (or Spark) that simplifies writing pipelines Spark A completely new framework for processing pipelines Takes advantage of memory, runs a DAG without extra map phases Note that the last example is done in Crunch. It is similar to Spark, but can run on MR or Spark.
SQL on Hadoop Hive for batch processing Impala for low-latency queries SparkSQL to use SQL with Spark applications Phoenix and Trafodion for transactional queries on HBase Run queries with Hive and Impala over the dataset loaded in the lab and compare the execution times
Ingest tools Relational: Sqoop, Sqoop2 Record channel: Kafka, Flume, NiFi Files: NiFi Numerous commercial options
Ingest tools Relational: Sqoop, Sqoop2 Record channel: Kafka, Flume Files: NiFi App HDFS Database Flume Flume Hadoop
Relational DB to Hadoop Sqoop CLI to run MR-based import jobs Sqoop2 Fixes configuration problems with Sqoop with credentials service More flexible to run on non-MR frameworks Under development Run queries with Hive and Impala over the dataset loaded in the lab and compare the execution times
Ingest tools Relational: Sqoop, Sqoop2 Record channel: Kafka, Flume, NiFi Files: NiFi Flume App Flume Channels HDFS Database Hadoop
Record streams to Hadoop Flume - source, channel, sink architecture Well-established and integrated with other tools No order guarantee, duplicates are possible Kafka - pub-sub model for low latencies Partitioned, provides ordering guarantees, easier to eliminate duplicates More resilient to node failure with consumer groups Flume is used in the example - because it is the established and more supported route
Files to Hadoop NiFi Web GUI for drag & drop configuration of a data flow Enterprise features: back-pressure, monitoring, lineage, etc. Read from local disk, HTTP, FTP, SFTP, HDFS, . . . Originally for files, also handles record-based streams
NiFi
Data storage in Hadoop Go through the solution to Lab #2
HDFS Blocks Blocks Increase parallelism Balance work Replicated Configured by dfs.blocksize Client-side setting data1.avro ... data2.avro
Splittable File Formats Splittable: Able to process part of a file independently of others Process blocks in parallel Avro is splittable Gzipped content is not splittable CSV is effectively not splittable Ask about splittability: Which of the following would be splittable? XML JSON
File formats Existing formats: XML, JSON, Protobuf, Thrift Designed for Hadoop: SequenceFile, RCFile, ORC Never use: Delimited text Recommended: Avro or Parquet Note that ORC is a well-designed format as well, but not on the recommended list because it is Hive-specific.
Avro Recommended row-oriented format Broken into blocks with sync markers for splitting Binary encoding with block-level compression Avro schema Required to read any binary-encoded data! Written in the file header Flexible object models Time: 10 minutes for Avro overview and tools lab
Avro in-memory object models generic Object model that can be used with any schema specific - compile schema to java object Generates type-safe runtime objects reflect - java object to schema Uses existing classes and objects
Optional Lab #3 - Using avro-tools http://tiny.cloudera.com/StrataLab3 Inspect one of the Avro files in the dataset created for lab #2
Row- and column-oriented formats Able to reduce I/O when projecting columns Better encoding and compression A B C D a1 b1 c1 d1 a2 b2 c2 d2 ... Columns: Rows: Images © Twitter, Inc. https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Parquet Recommended column-oriented format Splittable by organizing into row groups Efficient binary encoding, supports compression Uses other object models Record construction API rather than object model parquet-avro - Use Avro schemas with generic or specific records parquet-protobuf, parquet-thrift, parquet-hive, etc. Time: 10 minutes for Parquet overview and tools lab
Parquet trade-offs Rows are buffered into groups that target a final size Row group size Memory consumption grows with row group size Larger groups get more I/O benefit and better encoding Memory consumption grows for each open file
Optional Lab #4 Using parquet-tools http://tiny.cloudera.com/StrataLab4 Copy the Avro dataset from lab #2 to a new Parquet dataset and inspect one of its files. Things to note: Parquet schema, not Avro schema Row group info Encoding and compression results Page-level information
Partitioning Splittable file formats aren’t enough Not processing data is better than processing in parallel Organize data to avoid processing: Partitioning Use HDFS paths for a coarse index: data/y=2015/m=03/d=14/
Partitioning day= 30 31 1 2 ... 3 month= November December January February year= 2014 2015 events
Partitioning Caution Partitioning in HDFS is the primary index to data Should reflect the most common access pattern Test partition strategies for multiple workloads Should balance file size with workload Lots of small files are bad for HDFS - partitioning should be more coarse Larger files take longer to find data - partitioning should be more specific
Implementing partitioning Build your own - not recommended Hive and Impala managed Partitions are treated as data columns Insert statements must include partition calculations Kite managed Partition strategy configuration file Compatible with Hive and Impala
Kite High-level data API for Hadoop Built around datasets, not files Tasks like partitioning are done internally Tools built around the data API Command-line Integration in Flume, Sqoop, NiFi, etc.
Lab #5 - Create a partitioned dataset http://tiny.cloudera.com/StrataLab5
Movie ratings app: Data ingest pipeline Time: 1 hour including Lab #6, 10 minutes slip
Movie ratings scenario Your company runs a web application where users can rate movies You want to use Hadoop to analyze ratings over time Avoid scraping the production database for changes Instead, you want to log every rating submitted Ratings App Database Time: 20 minutes to setup the scenario and describe the Hadoop architecture Also cover: Where other tools might be used Why the dataset is Avro and not Parquet Why there is a division between Hadoop and the application Security and impersonation with Flume
Movie ratings app Log ratings to Flume Otherwise unchanged HDFS Flume dataset Database Hadoop
Why Flume? Database and application constraints Can the application change? Can the database handle exports? Why not Kafka? Kafka is a channel or queue Listener and HDFS writer need to run somewhere Flume Ratings App Flume Flume Time: 20 minutes to setup the scenario and describe the Hadoop architecture Also cover: Where other tools might be used Why the dataset is Avro and not Parquet Why there is a division between Hadoop and the application Security and impersonation with Flume
Lab #6 - Create a Flume pipeline http://tiny.cloudera.com/StrataLab6 Time: 10 minutes for users to get through the application setup 5 minutes to run through the demo up front 5 minutes to go let everyone submit queries to the application and discuss 10 minutes to go over relevant code and configuration, discuss trade-offs, and note what could be changed
Movie ratings app: Analyzing ratings data
Movie ratings analysis Now you have several months of data You can query it in Hive and Impala for most cases Some questions are difficult to formulate as SQL Are there any movies that people either love or hate? Time: 5 minutes to setup and describe task
Analyzing ratings Map Extract key, movie_id, and value, rating Reduce: Reduce groups all of the ratings by movie_id Count the number of ratings for each movie If there are two peaks, output the movie_id and counts Peak detection: difference between counts goes from negative to positive
Crunch background Stack up functions until a group-by operation to make a map phase Similarly, stack up functions after a group-by to make a reduce phase Additional group-by operations set up more MR rounds automatically PTable<Long, Double> table = collection .by(new GetMovieID(), Avros.longs()) .mapValues(new GetRating(), Avros.ints()) .groupByKey() .mapValues(new AverageRating(), Avros.doubles()); Time: 10 minutes to discuss crunch background and show application
Lab #7 - Analyze ratings with Crunch http://tiny.cloudera.com/StrataLab7 Time: 10 minutes for students to run the lab 5 minutes to run the lab up front 10 minutes to discuss the results, add an output dataset
blue@cloudera.com tom@cloudera.com Thank you blue@cloudera.com tom@cloudera.com