Download presentation
Presentation is loading. Please wait.
Published byJarred Wattles Modified over 9 years ago
1
D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.
2
D4M-2 Nicholas Arcolano Michelle Beard Bob Bond Josh Haines Matthew Schmidt Ben Miller Benjamin O’Gwynn Tamara Yu Bill Arcand Bill Bergeron Acknowledgements David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Dylan Hutchinson
3
D4M-3 Introduction Theory Results Summary Outline
4
D4M-4 Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets Example Applications of Graph Analytics Cyber Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Social Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life ISR
5
D4M-5 Four Ecosystems Dominate Cloud Computing Enterprise Big DataDBMS Big Compute Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security
6
D4M-6 LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGridEnterprise Big DataDBMS - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Compute MapReduce Four Ecosystems Dominate Cloud Computing
7
D4M-7 Big Data + Big Compute Challenge Database Worldview “It’s the data!” Delivering data is the end Supercomputing Worldview “It’s the computer!” Delivering data is the start Shared Compute Separate DataSeparate Compute Shared Data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data
8
D4M-8 Big Data + Big Compute Stack High Level Composable API: D4M (“Databases for Matlab”) Weak Signatures, Noisy Data, Dynamics Novel Analytics for: Text, Cyber, Bio Interactive Super- computing High Performance Computing: LLGrid + Hadoop Distributed Database/ Distributed File System Distributed Database: Accumulo (triple store) A C E B Array Algebra Combining Big Compute and Big Data enables entirely new domains
9
D4M-9 High Level Language: D4M http://www.mit.edu/~kepner/D4M Distributed Database Query: Alice Bob Cathy David Earl Query: Alice Bob Cathy David Earl Associative Arrays Numerical Computing Environment D4M Dynamic Distributed Dimensional Data Model A C D E B A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
10
D4M-10 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline
11
D4M-11 What are Spreadsheets and Big Tables? Spreadsheets Big Tables Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, … Simultaneous diverse uses: matrices, functions, hash tables, databases, … No formal mathematical basis; Zero papers in AMA or SIAM
12
D4M-12 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or('alice ','bob ',47.0) xATxATx ATAT alice bob alice carl bob carl cited
13
D4M-13 Key innovation: mathematical closure –All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A|B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Composable Associative Arrays Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation
14
D4M-14 Associative Array Definitions High level usage dictated by these definitions Deeper algebraic properties set by the collision function f() Frequent switching between “algebras” (how spreadsheets are used)
15
D4M-15 Associative arrays can be constructed from a few definitions Similar to linear algebra, but applicable to a wider range of data Key questions –Which linear algebra properties do apply to associative arrays (intuitive) –Which linear algebra properties do not apply to associative arrays (watch out) –Which associative array properties do not apply to linear algebra (new) Theory Questions Linear Algebra watch out Associative Arrays intuitive new
16
D4M-16 Book: “Graph Algorithms in the Language of Linear Algebra” Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: –Bader (Ga Tech) –Bliss (MIT-LL) –Bond (MIT-LL) –Dunlavy (Sandia) –Faloutsos (CMU) –Fineman (CMU) –Gilbert (USCB) –Heitsch (Ga Tech) –Hendrickson (Sandia) –Kegelmeyer (Sandia) –Kepner (MIT-LL) –Kolda (Sandia) –Leskovec (CMU) –Madduri (Ga Tech) –Mohindra (MIT-LL) –Nguyen (MIT) –Radar (MIT-LL) –Reinhardt (Microsoft) –Robinson (MIT-LL) –Shah (USCB) References
17
D4M-17 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline
18
D4M-18 Digraphs are Black & White
19
D4M-19 The World is Color Artist: Ann Pibal; Painting: “XCRS”
20
D4M-20 5 Edge Colors Artist: Ann Pibal; Painting: “XCRS” Blue Silver Green Orange Pink
21
D4M-21 20 Vertices Artist: Ann Pibal; Painting: “XCRS” V2 V1 V3 V4 V5 V6 V7 V8 V9 V10 V11 V13 V14V12 V15 V16 V17 V18 V19 V20
22
D4M-22 1 Isolated Standard Edge Artist: Ann Pibal; Painting: “XCRS” P4
23
D4M-23 12 Multi Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2
24
D4M-24 18 Hyper Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8
25
D4M-25 27 Edge Orderings Artist: Ann Pibal; Painting: “XCRS” O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8 O5 < P3,P6,P7,P8 O5 < B1,S1,G1,O1,O2,P1 O5 < B2,S2,G2,O3,O4,P2 < P7,P8
26
D4M-26 52 Standard Multi Edges Artist: Ann Pibal; Painting: “XCRS” O5x5 P3x3 (B1,S1,G1,O1,O2,P1)x2 (B2,S2,G2,O3,O4,P2)x4 P5x2 P6x2 P7x2 P8x2
27
D4M-27 Summary Observations Artist: Ann Pibal; Painting: “XCRS” Standard edge representation fragments hyper edges –Information is lost Digraph representation compresses multi-edges –Information is lost Matrix representation drops edge labels –Information is lost Standard graph representation drops edge order –Information is lost Need edge representation that preserves information
28
D4M-28 Solution: Incidence Matrix Artist: Ann Pibal; Painting: “XCRS” EdgeColorOrder V01V02V03V04V05V06V07V08V09V10V11V12V13V14V15V16V17V18V19V20 B1Blue2 111 S1Silver2 111 G1Green2 111 O1Orange2 111 O2Orange2 111 P1Pink2 111 B2Blue2 11111 S2Silver2 11111 G2Green2 11111 O3Orange2 11111 O4Orange2 11111 P2Pink2 11111 O5Orange1 111111 P3Pink2 1111 P4Pink2 11 P5Pink2 111 P6Pink2 111 P7Pink3 111 P8Pink3 111
29
D4M-29 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline
30
D4M-30 Graph Construction Using D4M: Explode Schema Distributed Database Raw Data CSV Files Assoc. Arrays log_idsrc_ipserver_ip 001128.0.0.1208.29.69.138 002192.168.1.2157.166.255.18 003128.0.0.174.125.224.72 Dense Table src_ip|128.0.0.1src_ip|192.168.1.2server_ip|157.166.255.18server_ip|208.29.69.138server_ip|74.125.224.72 log_id|00110010 log_id|00201100 log_id|00310001 Exploded Table Use as row indices Create columns for each unique type/value pair
31
D4M-31 Graph Construction Using D4M: Storing Exploded Data as Triples Distributed Database Raw Data CSV Files Assoc. Arrays src_ip|128.0.0.1src_ip|192.168.1.2server_ip|157.166.255.18server_ip|208.29.69.138server_ip|74.125.224.72 log_id|00110010 log_id|00201100 log_id|00310001 Exploded Table D4M stores the triple data representing both the exploded table and its transpose Table Triples Table Transpose Triples RowColumnValue log_id|001src_ip|128.0.0.11 log_id|001server_ip|208.29.69.1381 log_id|002src_ip|192.168.1.21 log_id|002server_ip|157.166.255.181 log_id|003src_ip|128.0.0.11 log_id|003server_ip|74.125.224.721 RowColumnValue server_ip|157.166.255.18log_id|0021 server_ip|208.29.69.138log_id|0011 server_ip|74.125.224.72log_id|0031 src_ip|128.0.0.1log_id|0011 src_ip|128.0.0.1log_id|0031 src_ip|192.168.1.2log_id|0021
32
D4M-32 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)... (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)...
33
D4M-33 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); (‘log_id|001’,‘server_ip|208.29.69.138’,1) (‘log_id|001’,‘src_ip|128.0.0.1’,1) (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)... (‘log_id|002’,‘server_ip|157.166.255.18’,1) (‘log_id|002’,‘src_ip|192.168.1.2’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)... (‘log_id|003’,‘server_ip|74.125.224.72’,1) (‘log_id|003’,‘src_ip|128.0.0.1’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)... (‘log_id|001’,‘server_ip|208.29.69.138’,1) (‘log_id|001’,‘src_ip|128.0.0.1’,1) (‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)... (‘log_id|002’,‘server_ip|157.166.255.18’,1) (‘log_id|002’,‘src_ip|192.168.1.2’,1) (‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)... (‘log_id|003’,‘server_ip|74.125.224.72’,1) (‘log_id|003’,‘src_ip|128.0.0.1’,1) (‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)...
34
D4M-34 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); (‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1) (‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1) (‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)... (‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1) (‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1) (‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)... Distributed Database Raw Data CSV Files Assoc. Arrays
35
D4M-35 Graphs can be constructed with minimal effort using D4M queries and associative array algebra Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Adj(G);
36
D4M-36 Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files 4 Mil e/s Accumulo Database: 1 Master + 7 Tablet servers
37
D4M-37 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline
38
D4M-38 Relative Cost per DNA Sequence Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 03/08/2012 High Volume Sequencer Big Data Energy Efficient Portable Sequencer
39
D4M-39 Example Disease Outbreak May-July 2011 - Virulent E. Coli Outbreak Germany Sequencing and crowd source analysis showed promising potential -> Still too slow Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen DNA Sequence released Outbreak identified Spanish Cucumbers implicated Sprouts Identified Deaths www.rki.de EHEC final report diarrhea kidney
40
D4M-40 RNA Reference Set Collected Sample sequence word (10mer) reference sequence ID unknown sequence ID A1A1 A2A2 A 1 A 2 ' reference bacteria unknown bacteria sequence word (10mer) reference sequence ID unknown sequence ID Sequence Matching Graph Sparse Matrix Multiply in D4M Associative arrays provide a natural framework for sequence matching
41
D4M-41 Database Automatically Computes Reference 10mer Distribution 50% 5% 0.5%
42
D4M-42 Leveraging “Big Data” Technologies for High Speed Sequence Matching D4M D4M + Triple Store BLAST 100x faster 100x smaller
43
D4M-43 Big data is found across a wide range of areas –Document analysis –Computer network analysis –DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation Summary
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.