MAD SKILLS NEW ANALYSIS PRACTICES FOR BIG DATA XXXXXXXXXX JEFF COHENGREENPLUM BRIAN DOLANFOX AUDIENCE NETWORK MARK DUNLAPEVERGREEN TECHNOLOGIES JOE HELLERSTEINUC.

Slides:



Advertisements
Similar presentations
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke
Advertisements

Supervisor : Prof . Abbdolahzadeh
C6 Databases.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
High Performance Analytical Appliance MPP Database Server Platform for high performance Prebuilt appliance with HW & SW included and optimally configured.
CPS 516: Data-intensive Computing Systems Instructor: Shivnath Babu TA: Jie Li.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Hive: A data warehouse on Hadoop
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
1 6/29/2015 XLDB ‘09 Luke Lonergan
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
CPS 216: Data-intensive Computing Systems Shivnath Babu.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu.
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.
Ch 4. The Evolution of Analytic Scalability
MAD SKILLS NEW ANALYSIS PRACTICES FOR BIG DATA BRIAN DOLANDISCOVIX JOE HELLERSTEINUC BERKELEY.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Database Systems – Data Warehousing
Systems analysis and design, 6th edition Dennis, wixom, and roth
MapReduce VS Parallel DBMSs
CIS 9002 Kannan Mohan Department of CIS Zicklin School of Business, Baruch College.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Introduction to Hadoop and HDFS
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Information Explosion. Reality: New Machine-Generated Data Non-relational and relational data outside of the EDW † Source: Analytics Platforms – Beyond.
1 Data Warehouses BUAD/American University Data Warehouses.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Understanding the field & setting expectations.  Personal  International  UNT Alumni (Mathematics)  Academic  Economics & Mathematics  Professional.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
PANEL SENIOR BIG DATA ARCHITECT BD-COE
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
BIG DATA/ Hadoop Interview Questions.
1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Supervisor : Prof . Abbdolahzadeh
Data Platform and Analytics Foundational Training
Big Data is a Big Deal!.
Data warehouse and OLAP
Spark Presentation.
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Hadoop Clusters Tess Fulkerson.
Pig Latin - A Not-So-Foreign Language for Data Processing
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
OMIS 665, Big Data Analytics
Ch 4. The Evolution of Analytic Scalability
Parallel Analytic Systems
Overview of big data tools
Big Data, Bigger Data & Big R Data
Dep. of Information Technology By: Raz Dara Mohammad Amin
Big DATA.
Big-Data Analytics with Azure HDInsight
Data Wrangling for ETL enthusiasts
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

MAD SKILLS NEW ANALYSIS PRACTICES FOR BIG DATA XXXXXXXXXX JEFF COHENGREENPLUM BRIAN DOLANFOX AUDIENCE NETWORK MARK DUNLAPEVERGREEN TECHNOLOGIES JOE HELLERSTEINUC BERKELEY CALEB WELTONGREENPLUM

MADGENDA Warehousing and the New Practitioners Getting MAD A Taste of Some Data-Parallel Statistics Engine Design Priorities

IN THE DAYS OF KINGS AND PRIESTS Computers and Data: Crown Jewels Executives depend on computers But cannot work with them directly The DBA “Priesthood” And their Acronymia EDW, BI, OLAP

THE ARCHITECTED EDW Rational behavior … for a bygone era “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse, 2005

NEW REALITIES TB disks < $100 Everything is data Rise of data-driven culture Very publicly espoused by Google, Wired, etc. Sloan Digital Sky Survey, Terraserver, etc. The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data. Welcome to the Petabyte Age.

MAD SKILLS Magnetic attract data and practitioners Agile rapid iteration: ingest, analyze, productionalize Deep sophisticated analytics in Big Data

MAD SKILLS FOR ANALYTICS

THE NEW PRACTITIONERS Hal Varian, UC Berkeley, Chief Google “Looking for a career where your services will be in high demand? … Provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s ubiquitous and cheap? Data. And what is complementary to data? Analysis. the sexy job in the next ten years will be statisticians

THE NEW PRACTITIONERS Aggressively Datavorous Statistically savvy Diverse in training, tools

FOX AUDIENCE NETWORK Greenplum DB 42 Sun X4500s (“Thumper”) each with: GB drives 16GB RAM 2 dual-core Opterons Big and growing 200 TB data (mirrored) Fact table of 1.5 trillion rows Growing 5TB per day 4-7 Billion rows per day Variety of data Ad logs, CRM, User data Research & Reporting Diversity of users from Sales Acct Mgrs to Research Scientists Microstrategy to command-line SQL Also extensive use of R and Hadoop As reported by FAN, Feb, 2009

MADGENDA Warehousing and the New Practitioners Getting MAD A Taste of Some Data-Parallel Statistics Engine Design Priorities

run analytics to improve performance change practices to suit acquire new data to be analyzed VIRTUOUS CYCLE OF ANALYTICS Analysts trump DBAs They are data magnets They tolerate and clean dirty data They like all the data (no samples/extracts) They produce data Figure 1: A Healthy Organization

MAD MODELING

MADGENDA Warehousing and the New Practitioners Getting MAD A Taste of Some Data-Parallel Statistics Engine Design Priorities

A SCENARIO FROM FAN Open-ended question about statistical densities (distributions) How many female WWF fans under the age of 30 visited the Toyota community over the last 4 days and saw a Class A ad? How are these people similar to those that visited Nissan?

DOLAN’S VOCABULARY OF STATISTICS Data Mining focused on individual items Statistical analysis needs more Focus on density methods! Need to be able to utter statistical sentences And run massively parallel, on Big Data! 1. (Scalar) Arithmetic 2. Vector Arithmetic I.e. Linear Algebra 3. Functions E.g. probability densities 4. Functionals i.e. functions on functions E.g., A/B testing: a functional over densities 5. Misc Statistical methods E.g. resampling may all your sequences converge

Paper includes parallelizable, statistical SQL for Linear algebra (vectors/matrices) Ordinary Least Squares (multiple linear regression) Conjugate Gradiant (iterative optimization, e.g. for SVM classifiers) Functionals including Mann-Whitney U test, Log-likelihood ratios Resampling techniques, e.g. bootstrapping Encapsulated as stored procedures or UDFs Significantly enhance the vocabulary of the DBMS! These are examples. Related stuff in NIPS ’06, using MapReduce syntax Plenty of research to do here!! ANALYTICS IN FAN

MADGENDA Warehousing and the New Practitioners Getting MAD A Taste of Some Data-Parallel Statistics Engine Design Priorities

PARALLELISM AND PLURALISM MAD scale and efficiency: achievable only via parallelism And pluralism for the new practitioners Multilingual Flexible storage Commodity hardware Greenplum a leader in both dimensions

ANOTHER EXAMPLE Greenplum DB, 96 nodes 4.5 petabytes of storage 6.5 Petabytes of user data 70% compression 17 trillion records 150 billion new records/day As reported by Curt Monash, dbms2.com. April, 2009

PLURALISTIC STORAGE IN GREENPLUM Internal storage 1. Standard “heap” tables 2. Greenplum “append-only” tables Optimized for fast scans Multiple levels of compression supported 3. Column-oriented tables 4. Partitioned tables: combinations of the above storage types. External data sources

SG STREAMING Parallel many-to-many loading architecture Automatic repartitioning of data from external sources Performance scales with number of nodes Negligible impact on concurrent database operations Transformation in flight using SQL or other languages 4 Tb/hour on FAN production system

MULTILINGUAL DEVELOPMENT SQL or MapReduce Sequential code in a variety of languages Perl Python Java R Mix and Match! SE HABLA MAPREDUCE SQL SPOKEN HERE QUI SI PARLA PYTHON HIER JAVA GESPROCKEN R PARLÉ ICI

Unified execution of SQL, MapReduce on a common parallel execution engine Analyze structured or unstructured data, inside or outside the database Scale out parallelism on commodity hardware ODBC JDBC etc MapReduce Code (Perl, Python, etc) Parallel DataFlow Engine Transaction Manager & Log Files External Storage Query Planner and Optimizer (SQL) Database Storage SQL & MAPREDUCE

BACKUP

TIME FOR ONE? BOOTSTRAPPING A Resampling technique: sample k out of N items with replacement compute an aggregate statistic   resample another k items (with replacement) compute an aggregate statistic   … repeat for t trials The resulting set of  i ’s is normally distributed The mean  is a good approximation of  Avoids overfitting: Good for small groups of data, or for masking outliers

BOOTSTRAP IN PARALLEL SQL Tricks: Given: dense row_IDs on the table to be sampled Identify all data to be sampled during bootstrapping: The view Design(trial_id, row_id) easy to construct using SQL functions Join Design to the table to be sampled Group by trial_id and compute estimate All resampling steps performed in one parallel query! Estimator is an aggregation query over the join A dozen lines of SQL, parallelizes beautifully

SQL BOOTSTRAP: HERE YOU GO! 1. CREATE VIEW design AS SELECT a.trial_id, floor (N * random()) AS row_id FROM generate_series(1,t) AS a (trial_id), generate_series(1,k) AS b (subsample_id); 2. CREATE VIEW trials AS SELECT d.trial_id, theta(a.values) AS avg_value FROM design d, T WHERE d.row_id = T.row_id GROUP BY d.trial_id; 3. SELECT AVG(avg_value), STDDEV(avg_value) FROM trials;