D4M – Signal Processing On Databases

Slides:



Advertisements
Similar presentations
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke
Advertisements

Case Study: Photo.net March 20, What is photo.net? An online learning community for amateur and professional photographers 90,000 registered users.
Exadata Distinctives Brown Bag New features for tuning Oracle database applications.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,
3: OS Structures 1 OPERATING SYSTEM STRUCTURES PROCESS MANAGEMENT A process is a program in execution: (A program is passive, a process active.) A process.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Multidimensional Database in Context of DB2 OLAP Server Khang Pham Class: CSCI397-16C Instructor: Professor Renner.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
Progress Report 11/1/01 Matt Bridges. Overview Data collection and analysis tool for web site traffic Lets website administrators know who is on their.
Chapter 14 The Second Component: The Database.
1 © Prentice Hall, 2002 The Client/Server Database Environment.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Systems I Locality and Caching
1 © Prentice Hall, 2002 Chapter 8: The Client/Server Database Environment Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott,
MBA 664 Database Management Systems Dave Salisbury ( )
Software Software essential is coded programs that perform a serious of algorithms. Instructions loaded into primary memory (RAM) from secondary storage.
Consistency And Replication
Written by Margo Seltzer Presented by Mark Simko.
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
CH1. Hardware: CPU: Ex: compute server (executes processor-intensive applications for clients), Other servers, such as file servers, do some computation.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Inside your computer. Hardware Review Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
Architecture & Cybersecurity – Module 3 ELO-100Identify the features of virtualization. (Figure 3) ELO-060Identify the different components of a cloud.
Chapter 5 Computer Systems Organization. Levels of Abstraction – Figure 5.1e The Concept of Abstraction.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
Background Computer System Architectures Computer System Software.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Introduction to threads
CX Introduction to Web Programming
Understanding and Improving Server Performance
Modeling Big Data Execution speed limited by: Model complexity
Chapter 9: The Client/Server Database Environment
Flash Storage 101 Revolutionizing Databases
Chapter 1: A Tour of Computer Systems
Hardware Components By Charlie Leivers.
CS122B: Projects in Databases and Web Applications Winter 2017
Ramya Kandasamy CS 147 Section 3
Spark Presentation.
课程名 编译原理 Compiling Techniques
The Client/Server Database Environment
Windows Azure Migrating SQL Server Workloads
So far we have covered … Basic visualization algorithms
CS 105 Tour of the Black Holes of Computing
CS 21a: Intro to Computing I
Local secondary storage (local disks)
Hadoop Clusters Tess Fulkerson.
Software Architecture in Practice
Introduction to Computer Systems
CS 105 Tour of the Black Holes of Computing
Outline Virtualization Cloud Computing Microsoft Azure Platform
Ch 4. The Evolution of Analytic Scalability
Lecture 24: Memory, VM, Multiproc
Parallel Analytic Systems
Overview of big data tools
Internet Protocols IP: Internet Protocol
Principles of Programming Languages
Big Data, Bigger Data & Big R Data
Cache Memory and Performance
Moving your on-prem data warehouse to cloud. What are your options?
Web Application Development Using PHP
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

D4M – Signal Processing On Databases 42 Sydney St Artarmon NSW 2064 Australia Virtualnation

Starting with Big Data Why care? In your reach - big data and big compute on a budget Start with data and apply math D4M with Accumulo: New technology from MIT and NSA that claims It requires 100x less code; and is 100x faster than other approaches Fundamentally mathematical analysis for big data Lift the lid. MIT number 20 paper reference MIT / NSA – paper interested academic / practical. Analysis Part of big data land scape NICTA big data New piece technology come across Present it to you. Virtualnation

Understand the world through data and math How do you want to understand and the world? IT approaches have evolved from a past where IT was expensive and controlled by the few Modeled and constrained problems to not only fit onto limited computers but fit in with the politics of the enterprise If you could observe without built in constraints and pre-conceived bias – how would you approach computing? Understand through scientific method - data and math How do you want to understand the world? Public Opinion. What others tell you think? Dogma. Do you want to be an Information factory worker or be a Software scientist / engineer. In the past IT has been expensive and IT $ and power was controlled by the few. Object databases – blasphemy. ORM introduced to work that data must be in columns with narrow relations. This was only really done so OO technology could be made politically acceptable for entrenched IT. Real data isn’t column in rows. Model to constrain problems so that we could apply computing technology Discovery, explore – things that are not obvious. Much lower cost find patterns. How did we get here? What if you were going to look at the world without constraint – how would you approach computing? Virtualnation

The Primordial Web (92) Client Server Database Browser (html): Server (http): Language: Database (sql): Client Server Database SQL http put data http get Gopher How do you want to understand the world? Public Opinion. What others tell you think? Dogma. Do you want to be an Information factory worker or be a Software scientist / engineer. In the past IT has been expensive and IT $ and power was controlled by the few. Object databases – blasphemy. ORM introduced to work that data must be in columns with narrow relations. This was only really done so OO technology could be made politically acceptable for entrenched IT. Real data isn’t column in rows. How did we get here? Model to constrain problems so that we could apply computing technology Browser GUI? HTTP for files? Perl for analysis? SQL for data? A lot of work just to view data. Won’t catch on. Virtualnation

The Modern Web Client Server Database Game (data): Server (http): Language: Database (triples): Client Server Database java http put data http get Explain a triple store – differ from a RDBMS and why is it important now. Why Hbase / Accumulo different from Sybase Relational Store unstructured data Schema on Read – loose. Disadvantage – of Why is modern web not sufficient Game GUI! HTTP for files? Perl for analysis? Triples for data! A lot of work to view a lot of data. Great view. Massive data.

Future Web? Client Server Database Game (data): Server (http): Language: Database (triples): Client Server Database java http put data http get Game GUI! Fileserver for files! D4M for analysis! Triples for data! A little work to view a lot of data. Securely. Great view. Massive data.

Big Data and Big Compute on a budget ~$9K server with 256G RAM, 32 CPU core and 1.7TB SSD ~ $26K cost 270TB storage server $199 4TB USB drive ZFS / Smart OS as a free virtualization technology ~68TB entire transactional corpus $45B Australian retailer How big are your possible data sets? Process Talk about Big Data – need to Fault tolerance Availability Real-time Not just a big data problem Bigger – better approaches Using Smaller cluster – don’t 1000 core more Applying doen’t mean you need to humoungos – other Looking Virtualnation

Apache Accumulo NSA’s Big Table implementation and now top level Apache project Cell level security to support privacy and need to know Supports large scale processing of sparse matrices… Accumulo is the worlds fastest open source database and supports triple pairs. MIT’s newest result show 100M inserts per second. Being fast it also makes Accumulo cheaper to run. Almost all serious big data applications will inevitably bump into the problem of security and need to know. Sqrrl is commercial packaging of Accumulo that add additional support for graph processing and document store. Apart from features it that it significantly lowers the barriers to entry for developers. As we will discuss shortly, Accumulo also support large scale processing of space matrices (like a spread sheet). Enables taking a different perspective on data sets. Virtualnation

Packaged into a secure production configuration Virtualnation

Parallel Architecture Parallel Warehouse Scale Computer Memory Hierarchy Parallel Architecture Unit of Memory Implications High High CPU CPU CPU CPU Registers RAM RAM RAM RAM Instruction Operands Cache disk disk disk disk Blocks Bandwidth Latency Programmability Capacity Network Switch Local Memory Messages CPU CPU CPU CPU Operation Time (nsec) L1 cache reference 0.5 Branch mispredict 5 L2 cache reference 7 Mutex lock/unlock 25 Main memory reference 100 Compress 1KB bytes with Zippy 3,000 Send 2K bytes over 1 Gbps network 20,000 Read 1MB sequentially from memory 250,000 Roundtrip within same datacenter 500,000 Disk seek 10,000,000 Read 1MB sequentially from disk 20,000,000 Send packet CA -> Netherlands -> CA 150,000,000 Remote Memory RAM RAM RAM RAM Pages disk disk disk disk SSD High High Disk See http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf Virtualnation

Starting with Big Data Now cheap to collect all data forever. Unconstrained approach to data acquisition No analysis up front or modeling Much of it involves Graph Analytics ISR Social GOAL: Identify hidden social networks Cyber Upfront is costly Risky Cope with the changing need business Market is changing Tools that can adapt to change – in the data model itself. Spend figuring out what the data model – don’t have Put in production straignt away – system GOAL: Identify anomalous patterns of life GOAL: Detect cyber attacks or malicious software Virtualnation

D4M - Signal Processing on Database Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics High Level Composable API: D4M (“Databases for Matlab”) Distributed Database/ Distributed File System Distributed Database: Accumulo/HBase (triple store) Whilst the paradigm of querying a database has enjoyed a long history – it breaks down at scale. Why? If you have databases that can ingest 100M per second, results changed after query is executed. Query is an insufficient concept . Different approach to database use is that of signal processing. Continually look for signals of interest Through streams of data and bring these to the forefront. D4M is a complete stack to enable interactive big data super computing. High Performance Computing: Cluster+ Hadoop Interactive Super-computing Virtualnation

Detection Theory Virtualnation Subgraph detection challenges: longer integration period Knowledge extraction can be thought of a detection problem -Many techniques can be applied (many techniques can be thought of in this context) -One of the key challenges is finding the right model Virtualnation

Matlab Demo - Reuters Corpus V1 (NIST) 810,000 Reuters news items Demonstration picked 70,000 and found 13,000 entities A is a 70Kx13K associative array with 500K entries. D4M demonstrations D4M demonstration comes with a sample set from Reuters 0 Virtualnation

7 Universal Constructs for Analytics CIA reports 7 universal constructs for analytics. These include People, Places etc. The Reuters corpus is broken into these sections Virtualnation

Multi-Dimensional Associative Array Ke innovation: Mathematical closure - All associateive array operaitons return associateive arrays Enables composable mathematica operations A + B A - B A & B A|B A*B Enables composable query operations via array indexing Complex queries with ~50x less effort than Java /Sql Naturally leads to high performance parallel implementation Virtualnation

Universal Exploded Schema Virtualnation

Numerical Computing Environment D4M Stores Giant Space Matrices in the Accumulo Triple Store Database Triple Store Distributed Database D4M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment B A C Query: T(:,ggaatctgcc) E D A D4M query returns a sparse matrix or graph from a triple store… Triple store are high performance distributed databases for heterogeneous data …for statistical signal processing or graph analysis in Matlab Virtualnation

Big Data for High Speed Sequence Matching Real world example. Virtualnation