RAW A database for high-performance querying of raw data Miguel Branco.

Slides:



Advertisements
Similar presentations
Tag line, tag line Perforce Benchmark with PAM over NFS, FCP & iSCSI Bikash R. Choudhury.
Advertisements

Query Processing and Optimizing on SSDs Flash Group Qingling Cao
A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.
1. Aim High with Oracle Real World Performance Andrew Holdsworth Director Real World Performance Group Server Technologies.
28. Januar, Zürich-Oerlikon. Switch/Update to Team Foundation Server 2012 André Hofmann Software Engineer bbv Software Services AG.
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
Database Systems on Virtual Machines: How Much Do We Lose? Kristin Travis March 2, 2011.
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
Lightning Queries Miguel Branco. Obs. 1: Eating our own (dog) food Data Database Obs. 2: Data Deluge How many of you use databases to store your own data?
Accurate and Efficient Replaying of File System Traces Nikolai Joukov, TimothyWong, and Erez Zadok Stony Brook University (FAST 2005) USENIX Conference.
“GRID” Bricks. Components (NARA I brick) AIC RMC4E2-QI-XPSS 4U w/SATA Raid Controllers: 3ware- mirrored root disks (2) Areca- data disks, battery backed.
A Little About Us Who are we? Founded in 1989 by data mining experts from MIT, today MicroStrategy is the top independent analytic software vendor in the.
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
Buying a Laptop. 3 Main Components The 3 main components to consider when buying a laptop or computer are Processor – The Bigger the Ghz the faster the.
Digital Graphics and Computers. Hardware and Software Working with graphic images requires suitable hardware and software to produce the best results.
Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in We create innovative software solutions for SharePoint,
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
Technology Expectations in an Aeros Environment October 15, 2014.
Working With Large Datasets in Corporate Settings Ed Bassin
A Multithreading C# Data Synchronization Program and Its Realization Course: ECE 1747H Parallel Programming Professor: Christiana Amza Student / Presenter:
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
CS 111 – Aug – 1.3 –Information arranged in memory –Types of memory –Disk properties Commitment for next day: –Read pp , In other.
MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Goodbye rows and tables, hello documents and collections.
REQUIREMENTS The Desktop Team Raphael Perez MVP: Enterprise Client Management, MCT RFL Systems Ltd
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
Improving Efficiency of I/O Bound Systems More Memory, Better Caching Newer and Faster Disk Drives Set Object Access (SETOBJACC) Reorganize (RGZPFM) w/
Next-Generation Databases Miguel Branco on behalf of the RAW team.
£899 – Ultimatum Computers indiegogo.com/ultimatumcomputers The Ultimatum.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
Sandor Acs 05/07/
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Hardware. Make sure you have paper and pen to hand as you will need to take notes and write down answers and thoughts that you can refer to later on.
4 Dec 2006 Testing the machine (X7DBE-X) with 6 D-RORCs 1 Evaluation of the LDC Computing Platform for Point 2 SuperMicro X7DBE-X Andrey Shevel CERN PH-AID.
Module 1: Concepts of information Technology.  Central processing unit (CPU)  Hard disk  Common input and output devices  Types of memory Main Parts.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
NoDB: Querying Raw Data --Mrutyunjay. Overview ▪ Introduction ▪ Motivation ▪ NoDB Philosophy: PostgreSQL ▪ Results ▪ Opportunities “NoDB in Action: Adaptive.
YONSEI UNIVERSITY Korea Emme Users’ Conference 21 April 2010 Prof. Jin-Hyuk Chung.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Scaling up analytical queries with column-stores Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki École Polytechnique Fédérale de Lausanne.
LHC Physics Analysis and Databases or: “How to discover the Higgs Boson inside a database” Maaike Limper.
PC clusters in KEK A.Manabe KEK(Japan). 22 May '01LSCC WS '012 PC clusters in KEK s Belle (in KEKB) PC clusters s Neutron Shielding Simulation cluster.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
1 L.Didenko Joint ALICE/STAR meeting HPSS and Production Management 9 April, 2000.
1 Worker Node Requirements TCO – biggest bang for the buck –Efficiency per $ important (ie cost per unit of work) –Processor speed (faster is not necessarily.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
High-Performance Querying on RAW data Anastasia Ailamaki EPFL.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
LSST Cluster Chris Cribbs (NCSA). LSST Cluster Power edge 1855 / 1955 Power Edge 1855 (*LSST1 – LSST 4) –Duel Core Xeon 3.6GHz (*LSST1 2XDuel Core Xeon)
Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.
FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.
Jérôme Jaussaud, Senior Product Manager
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Unit 2 Assignment 2 M2 RECOMMENDED PC. ZOOSTORM Processor – Intel Core i Operating system – Windows 8 Memory – 8GB RAM Hard Drive – 1TB.
LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.
Understanding and Improving Server Performance
Execution Planning for Success
Solid State Disks Testing with PROOF
40% More Performance per Server 40% Lower HW costs and maintenance
Database backed DNS.
Running virtualized Hadoop, does it make sense?
Using Oracle Database In-Memory feature to Speed-up CERN Applications
Oracle Storage Performance Studies
Reporting Solutions for Scheduler
Run time performance for all benchmarked software.
Presentation transcript:

RAW A database for high-performance querying of raw data Miguel Branco

RAW is a database that answers queries directly from the user’s files. What is RAW?

Databases make querying easy. SELECT event FROM root:/data1/mbranco/ATLAS/*.root WHERE ( event.EF_e24vhi_medium1 OR event.EF_e60_medium1 OR …) AND event.muon.mu_ptcone20 < 0.1 * event.muon.mu_pt AND event.muon.mu_pt > AND ABS(event.muon.mu_eta) < 2.4 AND …. Databases make querying fast. Column-stores & vectorized execution use h/w efficiently. Why use a database?

Plenty of files around! Flexible & easy to use. Own data format means no vendor lock-in. (And with many files you get to rewrite “ls” and “cp” ) RAW combines best of both worlds: High-performance querying capabilities…. …. while keeping your own data formats, files, and scripts. Why query files?

How does RAW work? CSV ROOT SELECT event.jet… FROM csv, root WHERE csv.RunNumber == root.RunNumber AND root. EF_2mu13 == TRUE AND … join scan root scan csv filter … containing “good” run numbers … containing physics events Code Generate the Access Paths Code Generate the Query Build Position and Data Caches

A Higgs Analysis in RAW ~900 GB of ROOT physics data (127 files) ATLAS Experiment (Thanks to ATLAS Info. Officer - Dario Barberis) 1 CSV file with “Good Runs” Compare handwritten C++ ROOT query (Thanks to Maaike Limper from CERN/IT) … with RAW ROOTRAW Cold Caches1499 s1431 s Warm Caches52 s575 ms 8 x 10-Core Intel Xeon CPU 2.13GHz 192GB memory ​2 x 1TB 2.5'' SASII 24x7 Disk Drive 7’200rpm

RAW team, EPFL: Manos Karpathiotakis Miguel Branco Anastasia Ailamaki