CPS 216: Data-intensive Computing Systems Information about Project 1 Shivnath Babu
Project 1: Overview Project 1 (Sept to late Nov): 1.Processing collections of records: Systems like Pig, Hive, Jaql, Cascading, Cascalog, HadoopDB 2.Matrix and graph computations: Systems like Rhipe, Ricardo, SystemML, Mahout, Pregel, Hama 3.Data stream processing: Systems like Flume, FlumeJava, S4, STREAM, Scribe, STORM 4.Data serving systems: Systems like BigTable/HBase, Dynamo/Cassandra, CouchDB, MongoDB, Riak, VoltDB Project 1 will have regular milestones. The final report will include: 1.What are properties of the data encountered? 2.What are concrete examples of workloads that are run? Develop a benchmark workload that you will implement and use in Step 5. 3.What are typical goals and requirements? 4.What are typical systems used, and how do they compare with each other? 5.Install some of these systems and do an experimental evaluation of 1, 2, 3, & 4 Project 2 (Late Nov to end of class). Of your own choosing. Could be a significant new feature added to Project 1
Group 1: Processing Collections of Records 1.Workloads: 1.See the “The Case for Evaluating MapReduce Performance Using Workload Suites” for pointers to a number of possible MapReduce workloads: ( html) html 2.Citation 12 in the paper: Pavlo, Paulson, and others (comes with data) 3.TPC-H: (comes with data) 4.If things work out: A real Hadoop+HBase workload that Akamai uses 2.Systems: 1.Hadoop 2.Pig 3.Hive 4.A hybrid system like: HadoopDB
Group 2: Matrix and Graph Computations 1.Workloads: 1.Matrix computations, e.g., PLSA 2.Graph computations, e.g., PageRank 2.Machine-learning workloads (Are of interest to Groups 1 and 2) 3.Systems: 1.Hadoop 2.Spark / Twister 3.RHIPE 4.(Mahout)
Group 3: Data Stream Processing 1.Workloads: 1.Behavioral Targeting: 2.Linear Road Benchmark: 2.Systems: 1.Hadoop 2.Flume and FlumeBase 3.Hadoop + HBase
Group 4: Data Serving Systems 1.Workloads: 1.YCSB: 2.YCSB++ 2.Systems (no need to do them all): 1.HDFS (not the full Hadoop) or MapR 2.HBase (Original design comes from Google BigTable) 3.Cassandra / Riak (Original design comes from Amazon Dynamo) 4.VoltDB (Parallel in-memory database) 5.CouchDB / MongoDB (Document Stores)
Upcoming Milestones 1. Read about the workloads, performance goals, etc. Discuss within your group. Pick one workload or come up with your own. Write a report by Sept 23. You can do it as part of a group or on your own. 2. One part of programming assignment 2 will involve writing and running the workload using Hadoop/HDFS/MapR. This assignment will be done on Amazon EC2. Done individually. Group discussion is fine. 3. As part of Project 1 later on, you will compare the performance on Hadoop/HDFS/MapR seen in Step 2 Vs. the other systems you will use.