D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under.

Slides:

Advertisements

Similar presentations

Three-Step Database Design

Advertisements

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.

The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.

D4M-1 Jeremy Kepner & Vijay Gadepally IPDPS Graph Algorithm Building Blocks Adjacency Matrices, Incidence Matrices, Database Schemas, and Associative Arrays.

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.

Andrew Prout, William Arcand, David Bestor, Chansup Byun, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Peter Michaleas, Julie Mullen, Albert Reuther,

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)

Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,

Massive Graph Visualization: LDRD Final Report Sandia National Laboratories Sand Printed October 2007.

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Jeremy Boyd Director – Mindscape MSDN Regional Director

Distribution Statement A. Approved for public release; distribution is unlimited. Test and Evaluation/Science and Technology Program Rapid Data Analyzer.

IST Databases and DBMSs Todd S. Bacastow January 2005.

D4M – Signal Processing On Databases

Ch 4. The Evolution of Analytic Scalability

Systems analysis and design, 6th edition Dennis, wixom, and roth

The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.

Web-Enabled Decision Support Systems

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

GUS: 0262 Fundamentals of GIS Lecture Presentation 3: Relational Data Model Jeremy Mennis Department of Geography and Urban Studies Temple University.

Module 5 Planning for SQL Server® 2008 R2 Indexing.

File Systems and Databases Lecture 1. Files and Databases File: A collection of records or documents dealing with one organization, person, area or subject.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.

5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.

Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,

Database Concepts Track 3: Managing Information using Database.

1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Visualization Four groups Design pattern for information visualization

Web Technologies Lecture 13 Introduction to cloud computing.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.

1 Geog 357: Data models and DBMS. Geographic Decision Making.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

NoSQL: Graph Databases. Databases Why NoSQL Databases?

Mining of Massive Datasets Edited based on Leskovec’s from

Lustre, Hadoop, Accumulo Jeremy Kepner 1,2,3, William Arcand 1, David Bestor 1, Bill Bergeron 1, Chansup Byun 1, Lauren Edwards 1, Vijay Gadepally 1,2,

CSC 212 – Data Structures Lecture 28: More Hash and Dictionaries.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Neo4j: GRAPH DATABASE 27 March, 2017

CPSC-310 Database Systems

Image taken from: slideshare

NoSQL: Graph Databases

Big Data is a Big Deal!.

SNS COLLEGE OF TECHNOLOGY

Large Graph Mining: Power Tools and a Practitioner’s guide

SQL Server 2017 Graph Database Inside-Out

Open Source distributed document DB for an enterprise

Every Good Graph Starts With

architecting the DIGITAL enterprise

Spark Presentation.

= Michael M. Wolf, Sandia National Laboratories

Relational Algebra Chapter 4, Part A

Matthew Lyon James Montgomery Lucas Scotta 13 October 2010

Ch 4. The Evolution of Analytic Scalability

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Overview of big data tools

Lecture 1 File Systems and Databases.

Database Systems Summary and Overview

Course Introduction CSC 576: Data Mining.

Charles Tappert Seidenberg School of CSIS, Pace University

What's New in eCognition 9

D4M for Julia Alex Chen.

Presentation transcript:

D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under Air Force Contract #FA C Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

D4M-2 Nicholas Arcolano Michelle Beard Bob Bond Josh Haines Matthew Schmidt Ben Miller Benjamin O’Gwynn Tamara Yu Bill Arcand Bill Bergeron Acknowledgements David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Dylan Hutchinson

D4M-3 Introduction Theory Results Summary Outline

D4M-4 Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets Example Applications of Graph Analytics Cyber Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Social Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life ISR

D4M-5 Four Ecosystems Dominate Cloud Computing Enterprise Big DataDBMS Big Compute Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security

D4M-6 LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGridEnterprise Big DataDBMS - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Compute MapReduce Four Ecosystems Dominate Cloud Computing

D4M-7 Big Data + Big Compute Challenge Database Worldview “It’s the data!” Delivering data is the end Supercomputing Worldview “It’s the computer!” Delivering data is the start Shared Compute Separate DataSeparate Compute Shared Data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data

D4M-8 Big Data + Big Compute Stack High Level Composable API: D4M (“Databases for Matlab”) Weak Signatures, Noisy Data, Dynamics Novel Analytics for: Text, Cyber, Bio Interactive Super- computing High Performance Computing: LLGrid + Hadoop Distributed Database/ Distributed File System Distributed Database: Accumulo (triple store) A C E B Array Algebra Combining Big Compute and Big Data enables entirely new domains

D4M-9 High Level Language: D4M Distributed Database Query: Alice Bob Cathy David Earl Query: Alice Bob Cathy David Earl Associative Arrays Numerical Computing Environment D4M Dynamic Distributed Dimensional Data Model A C D E B A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

D4M-10 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline

D4M-11 What are Spreadsheets and Big Tables? Spreadsheets Big Tables Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, … Simultaneous diverse uses: matrices, functions, hash tables, databases, … No formal mathematical basis; Zero papers in AMA or SIAM

D4M-12 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or('alice ','bob ',47.0) xATxATx ATAT  alice bob alice carl bob carl cited

D4M-13 Key innovation: mathematical closure –All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A|B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Composable Associative Arrays Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation

D4M-14 Associative Array Definitions High level usage dictated by these definitions Deeper algebraic properties set by the collision function f() Frequent switching between “algebras” (how spreadsheets are used)

D4M-15 Associative arrays can be constructed from a few definitions Similar to linear algebra, but applicable to a wider range of data Key questions –Which linear algebra properties do apply to associative arrays (intuitive) –Which linear algebra properties do not apply to associative arrays (watch out) –Which associative array properties do not apply to linear algebra (new) Theory Questions Linear Algebra watch out Associative Arrays intuitive new

D4M-16 Book: “Graph Algorithms in the Language of Linear Algebra” Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: –Bader (Ga Tech) –Bliss (MIT-LL) –Bond (MIT-LL) –Dunlavy (Sandia) –Faloutsos (CMU) –Fineman (CMU) –Gilbert (USCB) –Heitsch (Ga Tech) –Hendrickson (Sandia) –Kegelmeyer (Sandia) –Kepner (MIT-LL) –Kolda (Sandia) –Leskovec (CMU) –Madduri (Ga Tech) –Mohindra (MIT-LL) –Nguyen (MIT) –Radar (MIT-LL) –Reinhardt (Microsoft) –Robinson (MIT-LL) –Shah (USCB) References

D4M-17 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline

D4M-18 Digraphs are Black & White

D4M-19 The World is Color Artist: Ann Pibal; Painting: “XCRS”

D4M-20 5 Edge Colors Artist: Ann Pibal; Painting: “XCRS” Blue Silver Green Orange Pink

D4M Vertices Artist: Ann Pibal; Painting: “XCRS” V2 V1 V3 V4 V5 V6 V7 V8 V9 V10 V11 V13 V14V12 V15 V16 V17 V18 V19 V20

D4M-22 1 Isolated Standard Edge Artist: Ann Pibal; Painting: “XCRS” P4

D4M Multi Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2

D4M Hyper Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8

D4M Edge Orderings Artist: Ann Pibal; Painting: “XCRS” O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8 O5 < P3,P6,P7,P8 O5 < B1,S1,G1,O1,O2,P1 O5 < B2,S2,G2,O3,O4,P2 < P7,P8

D4M Standard Multi Edges Artist: Ann Pibal; Painting: “XCRS” O5x5 P3x3 (B1,S1,G1,O1,O2,P1)x2 (B2,S2,G2,O3,O4,P2)x4 P5x2 P6x2 P7x2 P8x2

D4M-27 Summary Observations Artist: Ann Pibal; Painting: “XCRS” Standard edge representation fragments hyper edges –Information is lost Digraph representation compresses multi-edges –Information is lost Matrix representation drops edge labels –Information is lost Standard graph representation drops edge order –Information is lost Need edge representation that preserves information

D4M-28 Solution: Incidence Matrix Artist: Ann Pibal; Painting: “XCRS” EdgeColorOrder V01V02V03V04V05V06V07V08V09V10V11V12V13V14V15V16V17V18V19V20 B1Blue2 111 S1Silver2 111 G1Green2 111 O1Orange2 111 O2Orange2 111 P1Pink2 111 B2Blue S2Silver G2Green O3Orange O4Orange P2Pink O5Orange P3Pink P4Pink2 11 P5Pink2 111 P6Pink2 111 P7Pink3 111 P8Pink3 111

D4M-29 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline

D4M-30 Graph Construction Using D4M: Explode Schema Distributed Database Raw Data CSV Files Assoc. Arrays log_idsrc_ipserver_ip Dense Table src_ip| src_ip| server_ip| server_ip| server_ip| log_id| log_id| log_id| Exploded Table Use as row indices Create columns for each unique type/value pair

D4M-31 Graph Construction Using D4M: Storing Exploded Data as Triples Distributed Database Raw Data CSV Files Assoc. Arrays src_ip| src_ip| server_ip| server_ip| server_ip| log_id| log_id| log_id| Exploded Table D4M stores the triple data representing both the exploded table and its transpose Table Triples Table Transpose Triples RowColumnValue log_id|001src_ip| log_id|001server_ip| log_id|002src_ip| log_id|002server_ip| log_id|003src_ip| log_id|003server_ip| RowColumnValue server_ip| log_id|0021 server_ip| log_id|0011 server_ip| log_id|0031 src_ip| log_id|0011 src_ip| log_id|0031 src_ip| log_id|0021

D4M-32 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)... (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)...

D4M-33 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); (‘log_id|001’,‘server_ip| ’,1) (‘log_id|001’,‘src_ip| ’,1) (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)... (‘log_id|002’,‘server_ip| ’,1) (‘log_id|002’,‘src_ip| ’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)... (‘log_id|003’,‘server_ip| ’,1) (‘log_id|003’,‘src_ip| ’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)... (‘log_id|001’,‘server_ip| ’,1) (‘log_id|001’,‘src_ip| ’,1) (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)... (‘log_id|002’,‘server_ip| ’,1) (‘log_id|002’,‘src_ip| ’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)... (‘log_id|003’,‘server_ip| ’,1) (‘log_id|003’,‘src_ip| ’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)...

D4M-34 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1)... (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1)... Distributed Database Raw Data CSV Files Assoc. Arrays

D4M-35 Graphs can be constructed with minimal effort using D4M queries and associative array algebra Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Adj(G);

D4M-36 Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files 4 Mil e/s Accumulo Database: 1 Master + 7 Tablet servers

D4M-37 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline

D4M-38 Relative Cost per DNA Sequence Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: Accessed 03/08/2012 High Volume Sequencer Big Data Energy Efficient Portable Sequencer

D4M-39 Example Disease Outbreak May-July Virulent E. Coli Outbreak Germany Sequencing and crowd source analysis showed promising potential -> Still too slow Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen DNA Sequence released Outbreak identified Spanish Cucumbers implicated Sprouts Identified Deaths EHEC final report diarrhea kidney

D4M-40 RNA Reference Set Collected Sample sequence word (10mer) reference sequence ID unknown sequence ID A1A1 A2A2 A 1 A 2 ' reference bacteria unknown bacteria sequence word (10mer) reference sequence ID unknown sequence ID Sequence Matching  Graph  Sparse Matrix Multiply in D4M Associative arrays provide a natural framework for sequence matching

D4M-41 Database Automatically Computes Reference 10mer Distribution 50% 5% 0.5%

D4M-42 Leveraging “Big Data” Technologies for High Speed Sequence Matching D4M D4M + Triple Store BLAST 100x faster 100x smaller

D4M-43 Big data is found across a wide range of areas –Document analysis –Computer network analysis –DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation Summary