D4M-1 Jeremy Kepner & Vijay Gadepally IPDPS Graph Algorithm Building Blocks Adjacency Matrices, Incidence Matrices, Database Schemas, and Associative Arrays.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,
Fundamentals, Design, and Implementation, 9/e Chapter 12 ODBC, OLE DB, ADO, and ASP.
Introduction to Structured Query Language (SQL)
Chapter 3. 2 Chapter 3 - Objectives Terminology of relational model. Terminology of relational model. How tables are used to represent data. How tables.
Chapter 14 The Second Component: The Database.
Introduction to Structured Query Language (SQL)
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Concepts of Database Management Sixth Edition
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
D4M – Signal Processing On Databases
Ch 4. The Evolution of Analytic Scalability
Overview of SQL Server Alka Arora.
Chapter 4 The Relational Model Pearson Education © 2014.
Chapter 4 The Relational Model.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
Chapter 3 The Relational Model. 2 Chapter 3 - Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Session 1 Module 1: Introduction to Data Integrity
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
CIS 601 Fall 2003 Introduction to MATLAB Longin Jan Latecki Based on the lectures of Rolf Lakaemper and David Young.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
1. 2  Introduction  Array Operations  Number of Elements in an array One-dimensional array Two dimensional array Multi-dimensional arrays Representation.
What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”
The Relational Model © Pearson Education Limited 1995, 2005 Bayu Adhi Tama, M.T.I.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
1 CENG 351 CENG 351 Introduction to Data Management and File Structures Department of Computer Engineering METU.
Lustre, Hadoop, Accumulo Jeremy Kepner 1,2,3, William Arcand 1, David Bestor 1, Bill Bergeron 1, Chansup Byun 1, Lauren Edwards 1, Vijay Gadepally 1,2,
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
Neo4j: GRAPH DATABASE 27 March, 2017
Image taken from: slideshare
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Chapter 14: System Protection
SQL Server 2017 Graph Database Inside-Out
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Database Performance Tuning and Query Optimization
Evaluation of Relational Operations
Introduction to MATLAB
Chapter 11: Indexing and Hashing
Chapter 14: Protection.
Physical Database Design
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Chapter 11 Database Performance Tuning and Query Optimization
Chapter 11: Indexing and Hashing
Slides based on those originally by : Parminder Jeet Kaur
Presentation transcript:

D4M-1 Jeremy Kepner & Vijay Gadepally IPDPS Graph Algorithm Building Blocks Adjacency Matrices, Incidence Matrices, Database Schemas, and Associative Arrays This work is sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract #FA C Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

D4M-2 Introduction Associative Arrays & Adjacency Matrices Database Schemas & Incidence Matrices Examples: Twitter & DNA Summary Outline

LLGrid- 3 Common Big Data Challenge Commanders OperatorsAnalysts Users Maritime Ground Space C2Cyber OSINT Data Air HUMINT Weather Data Users Gap & Beyond Rapidly increasing - Data volume - Data velocity - Data variety

LLGrid- 4 Common Big Data Architecture Commanders OperatorsAnalysts Users Maritime Ground Space C2Cyber OSINT Data Air HUMINT Weather Analytics Computing Web Files Scheduler Ingest & Enrichment Ingest Databases

LLGrid- 5 Common Big Data Architecture - Data Volume: Cloud Computing - Commanders OperatorsAnalysts Users Maritime Ground Space C2Cyber OSINT Data Air HUMINT Weather Analytics Computing Web Files Scheduler Ingest & Enrichment Ingest Databases OperatorsAnalysts MIT SuperCloud Enterprise Cloud Big Data CloudDatabase Cloud Compute Cloud MIT SuperCloud merges four clouds LLSuperCloud: Sharing HPC Systems for Diverse Rapid Prototyping, Reuther et al, IEEE HPEC 2013

LLGrid- 6 Commanders OperatorsAnalysts Users Maritime Ground Space C2Cyber OSINT Data Air HUMINT Weather Analytics Computing Web Files Scheduler Ingest & Enrichment Ingest Databases Lincoln benchmarking validated Accumulo performance Common Big Data Architecture - Data Velocity: Accumulo Database -

LLGrid- 7 Commanders OperatorsAnalysts Users Maritime Ground Space C2Cyber OSINT Data Air HUMINT Weather Analytics Computing Web Files Scheduler Ingest & Enrichment Ingest Databases D4M demonstrated a universal approach to diverse data columns rows Σ raw Common Big Data Architecture - Data Variety: D4M Schema - intel reports, DNA, health records, publication citations, web logs, social media, building alarms, cyber, … all handled by a common 4 table schema D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et al, IEEE HPEC 2013

LLGrid- 8 Database Discovery Workshop 3 day hands-on workshop on: Systems Parse, ingest, query, analysis & display Optimization Files vs. database, chunking & query planning Fusion Integrating diverse data Technology selection Knowing what to use is as important as knowing how to use it Using state-of-the-art technologies: Reference & Database Workshop Hadoop Python SciDB

D4M-9 Introduction Associative Arrays & Adjacency Matrices Database Schemas & Incidence Matrices Examples: Twitter & DNA Summary Outline

D4M-10 High Level Language: D4M d4m.mit.edu Accumulo Distributed Database Query: Alice Bob Cathy David Earl Query: Alice Bob Cathy David Earl Associative Arrays Numerical Computing Environment D4M Dynamic Distributed Dimensional Data Model A C D E B A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB or GNU Octave D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System, Kepner et al, ICASSP 2012

D4M-11 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or('alice ','bob ',47.0) xATxATx ATAT  alice bob alice carl bob carl cited

D4M-12 Key innovation: mathematical closure –All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A|B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Composable Associative Arrays Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation

D4M-13 What are Spreadsheets and Big Tables? Spreadsheets Big Tables Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, … Simultaneous diverse uses: matrices, functions, hash tables, databases, … No formal mathematical basis; Zero papers in AMA or SIAM

D4M- 14 First step axiomatization of the associative array Desirable features for our “axiomatic” abstract arrays –Accurately describe the tables and table operations from D4M –Matrix addition and multiplication are defined appropriately –As many matrix-like algebraic properties as possible A(B + C) = AB + AC A + B = B + A Definition An associative array is a map A:K n  S from a set of (possibly infinite keys) into a commutative semi-ring where A(k 1,…,k n ) = 0 for all but finitely many key tuples –Like an infinite matrix whose entries are “0 almost everywhere” –“Matrix-like” arrays are the maps A:K n  S –Addition [A + B](i,j) = A(i,j) + B(i,j) –Multiplication [AB](i,j) = ∑ k A(i,k) x A(k,j) An Algebraic Definition For Tables The Abstract Algebra of Big Data, Kepner & Chaidez, Union College Mathematics Conference 2013

D4M-15 Introduction Associative Arrays & Adjacency Matrices Database Schemas & Incidence Matrices Examples: Twitter & DNA Summary Outline

D4M-16 Generic D4M Triple Store Exploded Schema TimeCol1Col2Col aa bb cc Col1|aCol1|bCol2|bCol2|cCol3|aCol3|c Input Data Accumulo Table: T Tabular data expanded to create many type/value columns Transpose pairs allows quick look up of either row or column Flip time for parallel performance Col1|a1 Col1|b1 Col2|b1 Col2|c1 Col3|a1 Col3|c1 Accumulo Table: Ttranspose

D4M-17 Tables: SQL vs D4M+Accumulo log_idsrc_ipsrv_ip SQL Dense Table: T src_ip| src_ip| srv_ip| srv_ip| srv_ip| log_id|10011 log_id|20011 log_id| Accumulo D4M schema (aka NuWave) Tables: E and E T Use as row indices Create columns for each unique type/value pair Both dense and sparse tables stored the same data Accumulo D4M schema uses table pairs to index every unique string for fast access to both rows and columns (ideal for graph analysis)

D4M-18 Queries: SQL vs D4M Query OperationSQLD4M Select allSELECT * FROM T E(:,:) Select columnSELECT src_ip FROM T E(:,StartsWith('src_ip| ')) Select sub-columnSELECT src_ip FROM T WHERE src_ip= E(:,'src_ip| ') Select sub-matrixSELECT * FROM T WHERE src_ip= E(Row(E(:,'src_ip| '))),:) Queries are easy to represent in both SQL and D4M Pedigree (i.e., the source row ID) is always preserved since no information is lost

D4M-19 Analytics: SQL vs D4M Query OperationSQLD4M HistogramSELECT COUNT(src_ip) FROM T GROUP BY src_ip sum(E(:,StartsWith('src_ip| ')),2) Graph traversalSELECT * FROM T WHERE src_ip= v0 = 'src_ip| ' v1 = Col(E(Row(E(:,v0)),:)) v2 = Col(E(Row(E(:,v1)),:)) Graph construction… many lines …A = E(:,StartsWith('src_ip| ')). ’ * E(:,StartsWith('srv_ip| ')) Graph eigenvalues… many lines …eigs(Adj(A)) Analytics are easy to represent in D4M Pedigree (i.e., the source row ID) is usually lost since analytics are a projection of the data and some information is lost

D4M-20 Introduction Associative Arrays & Adjacency Matrices Database Schemas & Incidence Matrices Examples: Twitter & DNA Summary Outline

D4M-21 Assembled for Text REtrieval Conference (TREC 2011)* –Designed to be a reusable, representative sample of the twittersphere –Many languages Tweets2011 Corpus *McCreadie et al, “On building a reusable Twitter corpus,” ACM SIGIR ,141,812 million tweets sampled during to (16,951 from before) –11,595,844 undeleted tweets at time of scrape ( ) –161,735,518 distinct data entries –5,356,842 unique users –3,513,897 unique handles –519,617 unique hashtags (#) Ben Jabur et al, ACM SAC 2012

D4M-22 Twitter Input Data Mixture of structured (TweetID, User, Status, Time) and unstructured (Text) Fits well into standard D4M Exploded Schema Mixture of structured (TweetID, User, Status, Time) and unstructured (Text) Fits well into standard D4M Exploded Schema

D4M-23 Tweets2011 D4M Schema … Degree Tipo. Você fazer uma plaquinha pra mim, com o nome do FC pra você tirar uma foto, pode fazer isso? Wait :) null … stat|200stat|301stat|302 time|2011- stat|403 word|Tipo. user|Mich... user|bimo... user|Pen... user|… word|Você Accumulo Tables: Tedge/TedgeT Standard exploded schema indexes every unique string in data set TedgeDeg accumulate sums of every unique string TedgeTxt stores original text for viewing purposes time|2011- time|null Colum Key Row Key word|null word|Wait TedgeDeg t Row Key text Row Key TedgeTxt

D4M-24 Word Tallies D4M Code Tdeg = DB('TedgeDeg'); str2num(Tdeg(StartsWith('word|: '),:)) > D4M Result (1.2 seconds, N p = 1) word|: word|:( word|:) word|:D word|:P word|:p Sum table TedgeDeg allows tallies to be seen instantly

D4M-25 Users Who ReTweet the Most Problem Size D4M Code to check size of status codes Tdeg = DB('TedgeDeg'); Tdeg(StartsWith(’stat| '),:)) D4M Results (0.02 seconds, N p = 1) stat| OK stat| Moved permanently stat| ReTweet stat| Protected stat| Deleted tweet stat|408 1Request timeout Sum table TedgeDeg indicates 836K retweets (~5% of total) Small enough to hold all TweeetIDs in memory On boundary between database query and complete file scan

D4M-26 Users Who ReTweet the Most Parallel Database Query D4M Parallel Database Code T = DB('Tedge','TedgeT'); Ar = T(:,'stat|302 '); my = global_ind(zeros(size(Ar,2),1,map([Np 1],{},0:Np-1))); An = Assoc('','',''); N = 10000; for i=my(1):N:my(end) Ai = dblLogi(T(Row(Ar(i:min(i+N,my(end)),:)),:)); An = An + sum(Ai(:,StartsWith('user|,')),1); end Asum = gagg(Asum > 2); D4M Result (130 seconds, N p = 8) user|Puque , user| Say 113, user|carp_fans 115, user|habu_bot 111, user|kakusan_RT 135, user|umaitofu 116 Each processor queries all the retweet TweetIDs and picks a subset Processors each sum all users in their tweets and then aggregate

D4M-27 Users Who ReTweet the Most Parallel File Scan D4M Parallel File Scan Code Nfile = size(fileList); my = global_ind(zeros(Nfile,1,map([Np 1],{},0:Np-1))); An = Assoc('','',''); for i=my load(fileList{i}); An = An + sum(A(Row(A(:,'stat|302,')),StartsWith('user|,')),1); end An = gagg(An > 2); D4M Result (150 seconds, N p = 16) user|Puque , user| Say 113, user|carp_fans 114, user|habu_bot 109, user|kakusan_RT 135, user|umaitofu 114 Each processor picks a subset of files and scans them for retweets Processors each sum all users in their tweets and then aggregate

D4M-28 RNA Reference Set Collected Sample sequence word (10mer) reference sequence ID unknown sequence ID A1A1 A2A2 A 1 A 2 ' reference fungi unknown sample sequence word (10mer) reference sequence ID unknown sequence ID Sequence Matching  Graph  Sparse Matrix Multiply in D4M Associative arrays provide a natural framework for sequence matching Taming Biological Big Data with D4M, Kepner, Ricke & Hutchison, MIT Lincoln Laboratory Journal, Fall 2013

D4M-29 Leveraging “Big Data” Technologies for High Speed Sequence Matching D4M D4M + Accumulo BLAST (industry standard) 100x faster 100x smaller High performance triple store database trades computations for lookups Used Apache Accumulo database to accelerate comparison by 100x Used Lincoln D4M software to reduce code size by 100x

D4M-30 Computing on Masked Data Computing on masked data (CMD) raises the bar on data in the clear Uses lower over head approaches than Fully Homomorphic Encyption (FHE) such as deterministic (DET) encryption and order preserving encryption (OPE) Associative array (D4M) algebra is defined over sets (not real numbers); allows linear algebra to work on DET or OPE data RND: Semantically Secure DET: Deterministic OPE: Order Preserving Encryption CLEAR: No Masking (L=∞) Information Leakage ∞ Compute Overhead CLEARRNDDETOPE Big Data Today FHE CMD MPC

D4M-31 Big data is found across a wide range of areas –Document analysis –Computer network analysis –DNA Sequencing Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies Adjacency matrices, incidence matrices, and associative arrays provides a general, high performance approach for harnessing the power of these databases Summary