D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under Air Force Contract #FA C Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.
D4M-2 Nicholas Arcolano Michelle Beard Bob Bond Josh Haines Matthew Schmidt Ben Miller Benjamin O’Gwynn Tamara Yu Bill Arcand Bill Bergeron Acknowledgements David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Dylan Hutchinson
D4M-3 Introduction Theory Results Summary Outline
D4M-4 Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets Example Applications of Graph Analytics Cyber Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Social Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life ISR
D4M-5 Four Ecosystems Dominate Cloud Computing Enterprise Big DataDBMS Big Compute Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security
D4M-6 LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGridEnterprise Big DataDBMS - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Compute MapReduce Four Ecosystems Dominate Cloud Computing
D4M-7 Big Data + Big Compute Challenge Database Worldview “It’s the data!” Delivering data is the end Supercomputing Worldview “It’s the computer!” Delivering data is the start Shared Compute Separate DataSeparate Compute Shared Data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data
D4M-8 Big Data + Big Compute Stack High Level Composable API: D4M (“Databases for Matlab”) Weak Signatures, Noisy Data, Dynamics Novel Analytics for: Text, Cyber, Bio Interactive Super- computing High Performance Computing: LLGrid + Hadoop Distributed Database/ Distributed File System Distributed Database: Accumulo (triple store) A C E B Array Algebra Combining Big Compute and Big Data enables entirely new domains
D4M-9 High Level Language: D4M Distributed Database Query: Alice Bob Cathy David Earl Query: Alice Bob Cathy David Earl Associative Arrays Numerical Computing Environment D4M Dynamic Distributed Dimensional Data Model A C D E B A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
D4M-10 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline
D4M-11 What are Spreadsheets and Big Tables? Spreadsheets Big Tables Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, … Simultaneous diverse uses: matrices, functions, hash tables, databases, … No formal mathematical basis; Zero papers in AMA or SIAM
D4M-12 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or('alice ','bob ',47.0) xATxATx ATAT alice bob alice carl bob carl cited
D4M-13 Key innovation: mathematical closure –All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A|B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Composable Associative Arrays Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation
D4M-14 Associative Array Definitions High level usage dictated by these definitions Deeper algebraic properties set by the collision function f() Frequent switching between “algebras” (how spreadsheets are used)
D4M-15 Associative arrays can be constructed from a few definitions Similar to linear algebra, but applicable to a wider range of data Key questions –Which linear algebra properties do apply to associative arrays (intuitive) –Which linear algebra properties do not apply to associative arrays (watch out) –Which associative array properties do not apply to linear algebra (new) Theory Questions Linear Algebra watch out Associative Arrays intuitive new
D4M-16 Book: “Graph Algorithms in the Language of Linear Algebra” Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: –Bader (Ga Tech) –Bliss (MIT-LL) –Bond (MIT-LL) –Dunlavy (Sandia) –Faloutsos (CMU) –Fineman (CMU) –Gilbert (USCB) –Heitsch (Ga Tech) –Hendrickson (Sandia) –Kegelmeyer (Sandia) –Kepner (MIT-LL) –Kolda (Sandia) –Leskovec (CMU) –Madduri (Ga Tech) –Mohindra (MIT-LL) –Nguyen (MIT) –Radar (MIT-LL) –Reinhardt (Microsoft) –Robinson (MIT-LL) –Shah (USCB) References
D4M-17 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline
D4M-18 Digraphs are Black & White
D4M-19 The World is Color Artist: Ann Pibal; Painting: “XCRS”
D4M-20 5 Edge Colors Artist: Ann Pibal; Painting: “XCRS” Blue Silver Green Orange Pink
D4M Vertices Artist: Ann Pibal; Painting: “XCRS” V2 V1 V3 V4 V5 V6 V7 V8 V9 V10 V11 V13 V14V12 V15 V16 V17 V18 V19 V20
D4M-22 1 Isolated Standard Edge Artist: Ann Pibal; Painting: “XCRS” P4
D4M Multi Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2
D4M Hyper Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8
D4M Edge Orderings Artist: Ann Pibal; Painting: “XCRS” O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8 O5 < P3,P6,P7,P8 O5 < B1,S1,G1,O1,O2,P1 O5 < B2,S2,G2,O3,O4,P2 < P7,P8
D4M Standard Multi Edges Artist: Ann Pibal; Painting: “XCRS” O5x5 P3x3 (B1,S1,G1,O1,O2,P1)x2 (B2,S2,G2,O3,O4,P2)x4 P5x2 P6x2 P7x2 P8x2
D4M-27 Summary Observations Artist: Ann Pibal; Painting: “XCRS” Standard edge representation fragments hyper edges –Information is lost Digraph representation compresses multi-edges –Information is lost Matrix representation drops edge labels –Information is lost Standard graph representation drops edge order –Information is lost Need edge representation that preserves information
D4M-28 Solution: Incidence Matrix Artist: Ann Pibal; Painting: “XCRS” EdgeColorOrder V01V02V03V04V05V06V07V08V09V10V11V12V13V14V15V16V17V18V19V20 B1Blue2 111 S1Silver2 111 G1Green2 111 O1Orange2 111 O2Orange2 111 P1Pink2 111 B2Blue S2Silver G2Green O3Orange O4Orange P2Pink O5Orange P3Pink P4Pink2 11 P5Pink2 111 P6Pink2 111 P7Pink3 111 P8Pink3 111
D4M-29 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline
D4M-30 Graph Construction Using D4M: Explode Schema Distributed Database Raw Data CSV Files Assoc. Arrays log_idsrc_ipserver_ip Dense Table src_ip| src_ip| server_ip| server_ip| server_ip| log_id| log_id| log_id| Exploded Table Use as row indices Create columns for each unique type/value pair
D4M-31 Graph Construction Using D4M: Storing Exploded Data as Triples Distributed Database Raw Data CSV Files Assoc. Arrays src_ip| src_ip| server_ip| server_ip| server_ip| log_id| log_id| log_id| Exploded Table D4M stores the triple data representing both the exploded table and its transpose Table Triples Table Transpose Triples RowColumnValue log_id|001src_ip| log_id|001server_ip| log_id|002src_ip| log_id|002server_ip| log_id|003src_ip| log_id|003server_ip| RowColumnValue server_ip| log_id|0021 server_ip| log_id|0011 server_ip| log_id|0031 src_ip| log_id|0011 src_ip| log_id|0031 src_ip| log_id|0021
D4M-32 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)... (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)...
D4M-33 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); (‘log_id|001’,‘server_ip| ’,1) (‘log_id|001’,‘src_ip| ’,1) (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)... (‘log_id|002’,‘server_ip| ’,1) (‘log_id|002’,‘src_ip| ’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)... (‘log_id|003’,‘server_ip| ’,1) (‘log_id|003’,‘src_ip| ’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)... (‘log_id|001’,‘server_ip| ’,1) (‘log_id|001’,‘src_ip| ’,1) (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)... (‘log_id|002’,‘server_ip| ’,1) (‘log_id|002’,‘src_ip| ’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)... (‘log_id|003’,‘server_ip| ’,1) (‘log_id|003’,‘src_ip| ’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)...
D4M-34 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1)... (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1) (‘src_ip| ’,‘server_ip| ’,1)... Distributed Database Raw Data CSV Files Assoc. Arrays
D4M-35 Graphs can be constructed with minimal effort using D4M queries and associative array algebra Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Adj(G);
D4M-36 Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files 4 Mil e/s Accumulo Database: 1 Master + 7 Tablet servers
D4M-37 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline
D4M-38 Relative Cost per DNA Sequence Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: Accessed 03/08/2012 High Volume Sequencer Big Data Energy Efficient Portable Sequencer
D4M-39 Example Disease Outbreak May-July Virulent E. Coli Outbreak Germany Sequencing and crowd source analysis showed promising potential -> Still too slow Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen DNA Sequence released Outbreak identified Spanish Cucumbers implicated Sprouts Identified Deaths EHEC final report diarrhea kidney
D4M-40 RNA Reference Set Collected Sample sequence word (10mer) reference sequence ID unknown sequence ID A1A1 A2A2 A 1 A 2 ' reference bacteria unknown bacteria sequence word (10mer) reference sequence ID unknown sequence ID Sequence Matching Graph Sparse Matrix Multiply in D4M Associative arrays provide a natural framework for sequence matching
D4M-41 Database Automatically Computes Reference 10mer Distribution 50% 5% 0.5%
D4M-42 Leveraging “Big Data” Technologies for High Speed Sequence Matching D4M D4M + Triple Store BLAST 100x faster 100x smaller
D4M-43 Big data is found across a wide range of areas –Document analysis –Computer network analysis –DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation Summary