Antoine Chambille Head of Research & Development, Quartet FS

Slides:



Advertisements
Similar presentations
1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.
Advertisements

Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 1.
On the Locality of Java 8 Streams in Real- Time Big Data Applications Yu Chan Ian Gray Andy Wellings Neil Audsley Real-Time Systems Group, Computer Science.
High Performance Analytical Appliance MPP Database Server Platform for high performance Prebuilt appliance with HW & SW included and optimally configured.
Dandy Weyn Sr. Technical Product Mkt.
Oracle Confidential – Highly Restricted1. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is.
Capacity Planning and Predicting Growth for Vista Amy Edwards, Ezra Freeloe and George Hernandez University System of Georgia 2007.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.
Meanwhile RAM cost continues to drop Moore’s Law on total CPU processing power holds but in parallel processing… CPU clock rate stalled… Because.
6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.
1 6/29/2015 XLDB ‘09 Luke Lonergan
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Computer Science Storage Systems and Sensor Storage Research Overview.
Mixing Low Latency with Analytical Workloads for Customer Experience Management Neil Ferguson, Development Lead, NICE Systems.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.
Optimizing RAM-latency Dominated Applications
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
CLR: Garbage Collection Inside Out
31 January 2007Craig E. Ward1 Large-Scale Simulation Experimentation and Analysis Database Programming Using Java.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Moore’s Law means more transistors and therefore cores, but… CPU clock rate stalled… Meanwhile RAM cost continues to drop.
Runtime System CS 153: Compilers. Runtime System Runtime system: all the stuff that the language implicitly assumes and that is not described in the program.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Hadoop implementation of MapReduce computational model Ján Vaňo.
2/14/01RightOrder : Telegraph & Java1 Telegraph Java Experiences Sam Madden UC Berkeley
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir.
Sofia Event Center November 2013 Margarita Naumova SQL Master Academy.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 7 – Buffer Management.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
TerarkDB Introduction Peng Lei Terark Inc ( ). All rights reserved 1.
Peter Idoine Managing Director Oracle New Zealand Limited.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
© 2010 IBM Corporation Concurrency Demystified Ajith Ramanath | Senior Developer | JTC-ISL.
Oracle Exalytics Business Intelligence Machine Eshaanan Gounden – Core Technology Team.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
J. Xu et al, FAST ‘16 Presenter: Rohit Sarathy April 18, 2017
Java 9: The Quest for Very Large Heaps
CSE-291 Cloud Computing, Fall 2016 Kesden
Introduction to NewSQL
Software Architecture in Practice
Data Structures and Analysis (COMP 410)
NumaGiC: A garbage collector for big-data on big NUMA machines
20 Questions with Azure SQL Data Warehouse
Woojoong Kim Dept. of CSE, POSTECH
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Performance And Scalability In Oracle9i And SQL Server 2000
SQL Server 2016 High Performance Database Offer.
Presentation transcript:

Antoine Chambille Head of Research & Development, Quartet FS @activepivot Amir Javanshir Principal Software Engineer, Oracle @Amir_Javanshir

Credit Risk Use Case = 8 TB of raw data x 5000 x 1M trades Monte-Carlo simulation of future market data x 5000 Aggregate the valuation distribution vectors Valuation of financial instruments for each simulation x 1M trades x 200 time points RISK ENGINES Perform probabilistic calculation on aggregated vectors ACTIVEPIVOT = 8 TB of raw data

ActivePivot In-Memory Analytics In-Memory Database… Column Store High Cardinality Bitmap Index Multidimensional Aggregation Designed for Multicore … in Pure Java Strong customization capabilities Fast Safe Accessible to many

Oracle platform for in-memory computing M6 Big Memory Machine: Terabyte Scale Computing leveraging Open Standards 384 3072 32 High Availability Build in CORES COMPUTE THREADS TERABYTE SYSTEM MEMORY LOW Latency < 170ns 3 1.4 1 TERABYTES PER SECOND SYSTEM BANDWIDTH TERABYTES PER SECOND MEMORY BANDWIDTH TERABYTE PER SECOND I/O BANDWIDTH THE platform for in-memory computing 32 TERABYTE SYSTEM MEMORY

Off-Heap Memory (java.nio.Buffer) // Allocate an array of doubles (in the heap) double[] array = new double[1024*1024]; // ... double value54 = array[54]; array[500] = array[100] + array[200]; // Allocate an off-heap buffer of doubles DoubleBuffer buf = ByteBuffer.allocateDirect(8*1024*1024).asDoubleBuffer(); // ... double value54 = buf.get(54); buf.put(500, buf.get(100) + buf.get(200));

Off-Heap Memory (sun.misc.Unsafe) // You know what this is sun.misc.Unsafe UNSAFE = getUnsafe(); // Allocate 8MB of off-heap memory long pointer = UNSAFE.allocateMemory(8*1024*1024); UNSAFE.setMemory(pointer, 8*1024*1024, (byte) 0); // ... double value54 = UNSAFE.getDouble(pointer + 8*54); UNSAFE.putDouble( pointer + 8*500, UNSAFE.getDouble(pointer + 8*100) + UNSAFE.getDouble(pointer + 8*200) );

Off-Heap Memory Performance: Off-Heap Memory in the core data structures Columns of numbers Vectors of simulations Hash tables Indexes Serialized objects Usability: Standard Java in the query engine « Young » generation 1 TO 4 RATIO 12 TB dataset Off-Heap 3 TB Heap for queries and calculations 1 to 4 ratio 3 TB Heap for queries and calculations 12 TB dataset Off-Heap « Old » generation

Unleashing Multi-Core Parallelism Fork/Join Pool (©Doug Lea) Reentrant thread pool Work stealing Lock-Free Data Structures Dictionaries Indexes Queues Lockless Transactions Multiversion Concurrency Control … are not enough Scales on 4, 8, 16, 32 cores Not much further New approach required for Many-Cores architectures

Unleashing Many-Cores Parallelism Before After DATA PARTITION DATA CPU

Unleashing Many-Cores Parallelism 1 thread x 3072

Unleashing Many-Cores Parallelism Query Time (seconds) 10 20 30 24 s 17 s 13 s & 16x 6 TB & 32x 12 TB & 8x 3 TB

NUMA Architecture

NUMA Architecture 32 NUMA nodes

NUMA Architecture Allocated memory A Process A Process B Allocated memory B

NUMA System Library Linux protected interface NumaLibrary extends com.sun.jna.Library { /** @return -1 if NUMA is not available, … */ int numa_available(); /** @return the number of memory nodes in the system. */ int numa_num_configured_nodes(); /** @return the NUMA node the cpu currently belongs to. */ int numa_node_of_cpu(int cpu); int numa_run_on_node(int node); int numa_distance(int node1, int node2); //... }

NUMA System Library Linux Solaris NumaLibrary numaLib = (NumaLibrary) Native.loadLibrary("numa", NumaLibrary.class); PthreadLibrary pthreadLib = (PthreadLibrary) Native.loadLibrary("pthread", PthreadLibrary.class); int cpuId = pthreadLib.sched_getcpu(); int nodeCount = numaLib.numa_num_configured_nodes(); int currentNode = numaLib.numa_node_of_cpu(cpuId); LgrpLibrary lgrpLib = (LgrpLibrary) Native.loadLibrary("lgrp", LgrpLibrary.class); long cookie = lgrpLib.lgrp_init(lgrp_view.OS); int rootGroup = lgrpLib.lgrp_root(cookie); int nbGroups = lgrpLib.lgrp_nlgrps(cookie); // ... Solaris

NUMA System Library Linux NumaLibrary numaLib = (NumaLibrary) Native.loadLibrary("numa", NumaLibrary.class); PthreadLibrary pthreadLib = (PthreadLibrary) Native.loadLibrary("pthread", PthreadLibrary.class); int cpuId = pthreadLib.sched_getcpu(); int nodeCount = numaLib.numa_node_of_cpu(cpuId); int currentNode = numaLib.numa_node_of_cpu(pthreadLib.sched_getcpu()); // ... // Bind the current thread to NUMA node '0' numaLib.numa_run_on_node(0);

Unleashing NUMA Parallelism DATA PARTITIONS

Unleashing NUMA Parallelism Query Throughput 200 400 600 549 GB/s 399 GB/s 269 GB/s & 8x 3 TB & 16x 6 TB & 32x 12 TB

Garbage Collection (G1 and the 3TB Heap) -XX:+UseG1GC -XX:InitialHeapSize=3298534883328 -XX:MaxHeapSize=3298534883328 -XX:MaxDirectMemorySize=13194139533312 -XX:-UseAdaptiveSizePolicy -XX:+UnlockExperimentalVMOptions -XX:SurvivorRatio=1 -XX:ParallelGCThreads=960 -XX:ConcGCThreads=240 -XX:G1HeapWastePercent=5 -XX:G1MaxNewSizePercent=85 -XX:G1MixedGCLiveThresholdPercent=80 -XX:InitiatingHeapOccupancyPercent=10 -XX:G1NewSizePercent=70 -XX:G1ReservePercent=10 -XX:G1HeapRegionSize=33554432 -XX:LargePageSizeInBytes=268435456 -XX:MaxGCPauseMillis=1000 -XX:ReservedCodeCacheSize=251658240 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+SegmentedCodeCache -XX:G1LogLevel=finest

Garbage Collection (G1 and the 3TB Heap) -XX:+UseG1GC -XX:InitialHeapSize=3298534883328 -XX:MaxHeapSize=3298534883328 -XX:MaxDirectMemorySize=13194139533312 -XX:-UseAdaptiveSizePolicy -XX:+UnlockExperimentalVMOptions -XX:SurvivorRatio=1 -XX:ParallelGCThreads=960 -XX:ConcGCThreads=240 -XX:G1HeapWastePercent=5 -XX:G1MaxNewSizePercent=85 -XX:G1MixedGCLiveThresholdPercent=80 -XX:InitiatingHeapOccupancyPercent=10 -XX:G1NewSizePercent=70 -XX:G1ReservePercent=10 -XX:G1HeapRegionSize=33554432 -XX:LargePageSizeInBytes=268435456 -XX:MaxGCPauseMillis=1000 -XX:ReservedCodeCacheSize=251658240 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+SegmentedCodeCache -XX:G1LogLevel=finest

Garbage Collection (G1 and the 3TB Heap) -XX:+UseG1GC -XX:InitialHeapSize=3298534883328 -XX:MaxHeapSize=3298534883328 -XX:MaxDirectMemorySize=13194139533312 -XX:-UseAdaptiveSizePolicy -XX:+UnlockExperimentalVMOptions -XX:SurvivorRatio=1 -XX:ParallelGCThreads=960 -XX:ConcGCThreads=240 -XX:G1HeapWastePercent=5 -XX:G1MaxNewSizePercent=85 -XX:G1MixedGCLiveThresholdPercent=80 -XX:InitiatingHeapOccupancyPercent=10 -XX:G1NewSizePercent=70 -XX:G1ReservePercent=10 -XX:G1HeapRegionSize=33554432 -XX:LargePageSizeInBytes=268435456 -XX:MaxGCPauseMillis=1000 -XX:ReservedCodeCacheSize=251658240 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+SegmentedCodeCache -XX:G1LogLevel=finest

Garbage Collection (G1 and the 3TB Heap) -XX:+UseG1GC -XX:InitialHeapSize=3298534883328 -XX:MaxHeapSize=3298534883328 -XX:MaxDirectMemorySize=13194139533312 -XX:-UseAdaptiveSizePolicy -XX:+UnlockExperimentalVMOptions -XX:SurvivorRatio=1 -XX:ParallelGCThreads=960 -XX:ConcGCThreads=240 -XX:G1HeapWastePercent=5 -XX:G1MaxNewSizePercent=85 -XX:G1MixedGCLiveThresholdPercent=80 -XX:InitiatingHeapOccupancyPercent=10 -XX:G1NewSizePercent=70 -XX:G1ReservePercent=10 -XX:G1HeapRegionSize=33554432 -XX:LargePageSizeInBytes=268435456 -XX:MaxGCPauseMillis=1000 -XX:ReservedCodeCacheSize=251658240 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+SegmentedCodeCache -XX:G1LogLevel=finest

Garbage Collection (G1 and the 3TB Heap) -XX:+UseG1GC -XX:InitialHeapSize=3298534883328 -XX:MaxHeapSize=3298534883328 -XX:MaxDirectMemorySize=13194139533312 -XX:-UseAdaptiveSizePolicy -XX:+UnlockExperimentalVMOptions -XX:SurvivorRatio=1 -XX:ParallelGCThreads=960 -XX:ConcGCThreads=240 -XX:G1HeapWastePercent=5 -XX:G1MaxNewSizePercent=85 -XX:G1MixedGCLiveThresholdPercent=80 -XX:InitiatingHeapOccupancyPercent=10 -XX:G1NewSizePercent=70 -XX:G1ReservePercent=10 -XX:G1HeapRegionSize=33554432 -XX:LargePageSizeInBytes=268435456 -XX:MaxGCPauseMillis=1000 -XX:ReservedCodeCacheSize=251658240 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+SegmentedCodeCache -XX:G1LogLevel=finest

Garbage Collection Stress Test Garbage Collection Time Distribution 99%

What’s Next? Oracle and Quartet FS partnership Objectives Oracle Engineering will use ActivePivot as the reference use case for very large heap support Quartet FS will implement recommendations by Oracle Engineering Objectives Scale GC on many cores and NUMA Reduce GC pause times Accelerate Full GC Support Mixed-Workload Already working (Huge) read-only queries on (a lot of) static data The road to Operational Analytics Large data updates Real-time data updates Continuous queries on live data « Mixed Workload » operations

Do you want to be part of it? Please make the connection Antoine Chambille antoine.chambille@quartetfs.com Amir Javanshir amir.javanshir@oracle.com www.quartetfs.com