Project Project mid-term report due on 25th October at midnight Format

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Systems analysis and design, 6th edition Dennis, wixom, and roth
MapReduce VS Parallel DBMSs
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG.
© 2012 Unisys Corporation. All rights reserved. 1 Unisys Corporation. Proprietary and Confidential.
Nov 2006 Google released the paper on BigTable.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Big Data Yuan Xue CS 292 Special topics on.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
NoSql An alternative option in the DevEvenings ORM Smackdown Tarn Barford
Scaling PostgreSQL with GridSQL. Who Am I? Jim Mlodgenski – Co-organizer of NYCPUG – Founder of Cirrus Technologies – Former Chief Architect of EnterpriseDB.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Databases and DBMSs Todd S. Bacastow January
CS 405G: Introduction to Database Systems
NoSQL: Graph Databases
BigData - NoSQL Hadoop - Couchbase
Column-Based.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
Spark SQL.
Ramya Kandasamy CS 147 Section 3
Hadoopla: Microsoft and the Hadoop Ecosystem
Operational & Analytical Database
Dremel.
NOSQL.
Gowtham Rajappan.
NOSQL databases and Big Data Storage Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
NoSQL Systems Overview (as of November 2011).
Introduction lecture1.
Massively Parallel Cloud Data Storage Systems
1 Demand of your DB is changing Presented By: Ashwani Kumar
NOSQL and CAP Theorem.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Physical Database Design
20 Questions with Azure SQL Data Warehouse
NoSQL Databases Antonino Virgillito.
Overview of big data tools
Interpret the execution mode of SQL query in F1 Query paper
Chapter 1: Introduction
April 13th – Semi-structured data
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Big Data, Bigger Data & Big R Data
Applying Data Warehouse Techniques
Using Columnstore indexes in Azure DevOps Services. Lessons learned
MapReduce: Simplified Data Processing on Large Clusters
Applying Data Warehouse Techniques
overview today’s ideas relational databases
Presentation transcript:

Project Project mid-term report due on 25th October at midnight Format Up to 4 pages in sig-alternate format, addressing What is the problem Why is it important Why is it hard What is your key idea or approach Any initial results already shown If not, provide a sketch of idea/algorithm If not, provide a architecture diagram of the system The more detailed the better What is the important related work and why is it insufficient for your problem Counts for 30% of your project grade.

Dremel

Some History Parallel DB Systems have been around for 20-30 years prior Historical DB companies supporting parallelism include: Teradata, Tandem, Informix, Oracle, RedBrick, Sybase, DB2

NoSQL Along came NoSQL (early-mid 2000s) The idea that databases are slow slow slow Complaints included Too slow Too much loading time Too monolithic and complex Instruction manuals of ~500 pages Too much heft for “internet scale” applications Too expensive Too hard to understand

NoSQL The story of NoSQL and its intimate relationship with Google This is the OLAP story, not the OLTP story OLTP story BigTable (06) => MegaStore (11) => Spanner, F1 (12) Less consistency => More consistency Contemporaries: PNUTS, Cassandra, HBase, CouchDB, Dynamo

NoSQL OLAP story MapReduce (04) => Dremel (10) Less using pdb principles => More using pdb principles By 2010, Google had restricted MapReduce to complex batch processing, with Dremel for interactive analytics Contemporaries: MapReduce: Hadoop (Yahoo) PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook) PSQL-not-on-MapReduce: Impala Newest “in-memory” parallel analytics platform: Scuba (Facebook), PowerDrill (Google) Memory is the new disk

Map-Reduce 2004: Google published MapReduce. Parallel programming paradigm Pros: Fast fast fast Imperative Many real use-cases Cons: Checkpointing all intermediate results No real logic or optimization Very “rigid”, no room for improvement Many bottlenecks

Along comes Dremel 2010: Still not a full-fledged parallel database PROJECT-SELECT-AGGREGATE What does it lack?

Along comes Dremel 2010: Still not a full-fledged parallel database What does it lack? Support for joins Support for transactions (it is read-only) Support for intelligent partitioning?

Column Stores For OLAP, column stores are a lot better than row stores Idea from the 80s, commercialized as Vertica in 2005. Key idea: store values for a single column together Why is this better for aggregation/OLAP?

Column Stores For OLAP, column stores are a lot better than row stores Key idea: store values for a single column together Why is this better for aggregation? Better compression; can pack similar values together better Can skip over unnecessary columns Much less data read from disk

Column Stores When can column stores suffer relative to row stores?

Column Stores When can column stores suffer relative to row stores? Want to point at a certain data item (e.g., find me the year where company XXX was established) Transactions can be bad: Insertions, deletions can be quite terrible Writes require multiple accesses

Dremel: Column Encoding Turns out, this has been open-sourced by Twitter as “Parquet” Are there cases where the column encoding scheme proposed doesn’t make much sense?

Dremel: Column Encoding Turns out, this has been open-sourced by Twitter as “Parquet” Would this column encoding make sense if: All records have a rigid schema? Not all records obey the schema? Often the case in json/xml – mistakes in data generation If most data “looks the same” with a few exceptions?

Hierarchical Trees What factors would you take into account while deciding the fanout for the hierarchical trees?

Hierarchical Trees What factors would you take into account while deciding the fanout for the hierarchical trees? Too small fanout may do too much unnecessary network bandwidth for too little gain Too large fanout may end up overwhelming one node Network bandwidth CPU capability Local Memory Local Disk