Presentation is loading. Please wait.

Presentation is loading. Please wait.

D4M – Signal Processing On Databases

Similar presentations


Presentation on theme: "D4M – Signal Processing On Databases"— Presentation transcript:

1 D4M – Signal Processing On Databases
42 Sydney St Artarmon NSW 2064 Australia Virtualnation

2 Starting with Big Data Why care?
In your reach - big data and big compute on a budget Start with data and apply math D4M with Accumulo: New technology from MIT and NSA that claims It requires 100x less code; and is 100x faster than other approaches Fundamentally mathematical analysis for big data Lift the lid. MIT number 20 paper reference MIT / NSA – paper interested academic / practical. Analysis Part of big data land scape NICTA big data New piece technology come across Present it to you. Virtualnation

3 Understand the world through data and math
How do you want to understand and the world? IT approaches have evolved from a past where IT was expensive and controlled by the few Modeled and constrained problems to not only fit onto limited computers but fit in with the politics of the enterprise If you could observe without built in constraints and pre-conceived bias – how would you approach computing? Understand through scientific method - data and math How do you want to understand the world? Public Opinion. What others tell you think? Dogma. Do you want to be an Information factory worker or be a Software scientist / engineer. In the past IT has been expensive and IT $ and power was controlled by the few. Object databases – blasphemy. ORM introduced to work that data must be in columns with narrow relations. This was only really done so OO technology could be made politically acceptable for entrenched IT. Real data isn’t column in rows. Model to constrain problems so that we could apply computing technology Discovery, explore – things that are not obvious. Much lower cost find patterns. How did we get here? What if you were going to look at the world without constraint – how would you approach computing? Virtualnation

4 The Primordial Web (92) Client Server Database Browser (html):
Server (http): Language: Database (sql): Client Server Database SQL http put data http get Gopher How do you want to understand the world? Public Opinion. What others tell you think? Dogma. Do you want to be an Information factory worker or be a Software scientist / engineer. In the past IT has been expensive and IT $ and power was controlled by the few. Object databases – blasphemy. ORM introduced to work that data must be in columns with narrow relations. This was only really done so OO technology could be made politically acceptable for entrenched IT. Real data isn’t column in rows. How did we get here? Model to constrain problems so that we could apply computing technology Browser GUI? HTTP for files? Perl for analysis? SQL for data? A lot of work just to view data. Won’t catch on. Virtualnation

5 The Modern Web Client Server Database Game (data): Server (http):
Language: Database (triples): Client Server Database java http put data http get Explain a triple store – differ from a RDBMS and why is it important now. Why Hbase / Accumulo different from Sybase Relational Store unstructured data Schema on Read – loose. Disadvantage – of Why is modern web not sufficient Game GUI! HTTP for files? Perl for analysis? Triples for data! A lot of work to view a lot of data. Great view. Massive data.

6 Future Web? Client Server Database Game (data): Server (http):
Language: Database (triples): Client Server Database java http put data http get Game GUI! Fileserver for files! D4M for analysis! Triples for data! A little work to view a lot of data. Securely. Great view. Massive data.

7 Big Data and Big Compute on a budget
~$9K server with 256G RAM, 32 CPU core and 1.7TB SSD ~ $26K cost 270TB storage server $199 4TB USB drive ZFS / Smart OS as a free virtualization technology ~68TB entire transactional corpus $45B Australian retailer How big are your possible data sets? Process Talk about Big Data – need to Fault tolerance Availability Real-time Not just a big data problem Bigger – better approaches Using Smaller cluster – don’t 1000 core more Applying doen’t mean you need to humoungos – other Looking Virtualnation

8 Apache Accumulo NSA’s Big Table implementation and now top level Apache project Cell level security to support privacy and need to know Supports large scale processing of sparse matrices… Accumulo is the worlds fastest open source database and supports triple pairs. MIT’s newest result show 100M inserts per second. Being fast it also makes Accumulo cheaper to run. Almost all serious big data applications will inevitably bump into the problem of security and need to know. Sqrrl is commercial packaging of Accumulo that add additional support for graph processing and document store. Apart from features it that it significantly lowers the barriers to entry for developers. As we will discuss shortly, Accumulo also support large scale processing of space matrices (like a spread sheet). Enables taking a different perspective on data sets. Virtualnation

9 Packaged into a secure production configuration
Virtualnation

10 Parallel Architecture
Parallel Warehouse Scale Computer Memory Hierarchy Parallel Architecture Unit of Memory Implications High High CPU CPU CPU CPU Registers RAM RAM RAM RAM Instruction Operands Cache disk disk disk disk Blocks Bandwidth Latency Programmability Capacity Network Switch Local Memory Messages CPU CPU CPU CPU Operation Time (nsec) L1 cache reference Branch mispredict 5 L2 cache reference 7 Mutex lock/unlock 25 Main memory reference 100 Compress 1KB bytes with Zippy 3,000 Send 2K bytes over 1 Gbps network 20,000 Read 1MB sequentially from memory 250,000 Roundtrip within same datacenter 500,000 Disk seek ,000,000 Read 1MB sequentially from disk 20,000,000 Send packet CA -> Netherlands -> CA 150,000,000 Remote Memory RAM RAM RAM RAM Pages disk disk disk disk SSD High High Disk See Virtualnation

11 Starting with Big Data Now cheap to collect all data forever.
Unconstrained approach to data acquisition No analysis up front or modeling Much of it involves Graph Analytics ISR Social GOAL: Identify hidden social networks Cyber Upfront is costly Risky Cope with the changing need business Market is changing Tools that can adapt to change – in the data model itself. Spend figuring out what the data model – don’t have Put in production straignt away – system GOAL: Identify anomalous patterns of life GOAL: Detect cyber attacks or malicious software Virtualnation

12 D4M - Signal Processing on Database
Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics High Level Composable API: D4M (“Databases for Matlab”) Distributed Database/ Distributed File System Distributed Database: Accumulo/HBase (triple store) Whilst the paradigm of querying a database has enjoyed a long history – it breaks down at scale. Why? If you have databases that can ingest 100M per second, results changed after query is executed. Query is an insufficient concept . Different approach to database use is that of signal processing. Continually look for signals of interest Through streams of data and bring these to the forefront. D4M is a complete stack to enable interactive big data super computing. High Performance Computing: Cluster+ Hadoop Interactive Super-computing Virtualnation

13 Detection Theory Virtualnation
Subgraph detection challenges: longer integration period Knowledge extraction can be thought of a detection problem -Many techniques can be applied (many techniques can be thought of in this context) -One of the key challenges is finding the right model Virtualnation

14 Matlab Demo - Reuters Corpus V1 (NIST)
810,000 Reuters news items Demonstration picked 70,000 and found 13,000 entities A is a 70Kx13K associative array with 500K entries. D4M demonstrations D4M demonstration comes with a sample set from Reuters 0 Virtualnation

15 7 Universal Constructs for Analytics
CIA reports 7 universal constructs for analytics. These include People, Places etc. The Reuters corpus is broken into these sections Virtualnation

16 Multi-Dimensional Associative Array
Ke innovation: Mathematical closure - All associateive array operaitons return associateive arrays Enables composable mathematica operations A + B A - B A & B A|B A*B Enables composable query operations via array indexing Complex queries with ~50x less effort than Java /Sql Naturally leads to high performance parallel implementation Virtualnation

17 Universal Exploded Schema
Virtualnation

18 Numerical Computing Environment
D4M Stores Giant Space Matrices in the Accumulo Triple Store Database Triple Store Distributed Database D4M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment B A C Query: T(:,ggaatctgcc) E D A D4M query returns a sparse matrix or graph from a triple store… Triple store are high performance distributed databases for heterogeneous data …for statistical signal processing or graph analysis in Matlab Virtualnation

19 Big Data for High Speed Sequence Matching
Real world example. Virtualnation


Download ppt "D4M – Signal Processing On Databases"

Similar presentations


Ads by Google