CPS 216: Data-intensive Computing Systems Information about Project 1 Shivnath Babu.

Slides:



Advertisements
Similar presentations
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Advertisements

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Milestone 1 Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.
BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)
Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
CPS 216: Data-intensive Computing Systems Shivnath Babu.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
NoSQL for the SQL Server Pro
SQL vs NOSQL Discussion
資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium on Computer Modeling.
Final Report Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Zhang Gang Big data High scalability One time write, multi times read …….(to be add )
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
1/3/2016B.Ramamurthy1 Final Review CSE487/587 B.Ramamurthy.
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Nov 2006 Google released the paper on BigTable.
CPS 216: Advanced Database Systems Shivnath Babu.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
1 Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sears Yahoo! Research.
Performance Comparison of Clustered Systems Yugandhar Maram, # Anjana Vadivel, # Stuthi Balaji, #
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
1 Tree and Graph Processing On Hadoop Ted Malaska.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
An Introduction to Big Data (With a strong focus on Apache) Nick Burch Senior Developer, Alfresco Software VP ConCom, ASF Member.
Mastering Spark Data Masters. Special Thanks To…
Data Analytics (CS40003) Introduction to Data Lecture #1
OMOP CDM on Hadoop Reference Architecture
CS 405G: Introduction to Database Systems
Apache hadoop & Mapreduce
CS122B: Projects in Databases and Web Applications Winter 2017
NoSQL Systems Overview (as of November 2011).
انجمن انفورماتيک ايران
Sample Projects.
Introduction to Apache
Charles Tappert Seidenberg School of CSIS, Pace University
Big Data, Bigger Data & Big R Data
Presentation transcript:

CPS 216: Data-intensive Computing Systems Information about Project 1 Shivnath Babu

Project 1: Overview Project 1 (Sept to late Nov): 1.Processing collections of records: Systems like Pig, Hive, Jaql, Cascading, Cascalog, HadoopDB 2.Matrix and graph computations: Systems like Rhipe, Ricardo, SystemML, Mahout, Pregel, Hama 3.Data stream processing: Systems like Flume, FlumeJava, S4, STREAM, Scribe, STORM 4.Data serving systems: Systems like BigTable/HBase, Dynamo/Cassandra, CouchDB, MongoDB, Riak, VoltDB Project 1 will have regular milestones. The final report will include: 1.What are properties of the data encountered? 2.What are concrete examples of workloads that are run? Develop a benchmark workload that you will implement and use in Step 5. 3.What are typical goals and requirements? 4.What are typical systems used, and how do they compare with each other? 5.Install some of these systems and do an experimental evaluation of 1, 2, 3, & 4 Project 2 (Late Nov to end of class). Of your own choosing. Could be a significant new feature added to Project 1

Group 1: Processing Collections of Records 1.Workloads: 1.See the “The Case for Evaluating MapReduce Performance Using Workload Suites” for pointers to a number of possible MapReduce workloads: ( html) html 2.Citation 12 in the paper: Pavlo, Paulson, and others (comes with data) 3.TPC-H: (comes with data) 4.If things work out: A real Hadoop+HBase workload that Akamai uses 2.Systems: 1.Hadoop 2.Pig 3.Hive 4.A hybrid system like: HadoopDB

Group 2: Matrix and Graph Computations 1.Workloads: 1.Matrix computations, e.g., PLSA 2.Graph computations, e.g., PageRank 2.Machine-learning workloads (Are of interest to Groups 1 and 2) 3.Systems: 1.Hadoop 2.Spark / Twister 3.RHIPE 4.(Mahout)

Group 3: Data Stream Processing 1.Workloads: 1.Behavioral Targeting: 2.Linear Road Benchmark: 2.Systems: 1.Hadoop 2.Flume and FlumeBase 3.Hadoop + HBase

Group 4: Data Serving Systems 1.Workloads: 1.YCSB: 2.YCSB++ 2.Systems (no need to do them all): 1.HDFS (not the full Hadoop) or MapR 2.HBase (Original design comes from Google BigTable) 3.Cassandra / Riak (Original design comes from Amazon Dynamo) 4.VoltDB (Parallel in-memory database) 5.CouchDB / MongoDB (Document Stores)

Upcoming Milestones 1. Read about the workloads, performance goals, etc. Discuss within your group. Pick one workload or come up with your own. Write a report by Sept 23. You can do it as part of a group or on your own. 2. One part of programming assignment 2 will involve writing and running the workload using Hadoop/HDFS/MapR. This assignment will be done on Amazon EC2. Done individually. Group discussion is fine. 3. As part of Project 1 later on, you will compare the performance on Hadoop/HDFS/MapR seen in Step 2 Vs. the other systems you will use.