A Relational Algebra Processor 6.375 Final Project Ming Liu, Shuotao Xu.

Slides:

Advertisements

Similar presentations

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

CS 540 Database Management Systems

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Router Architecture : Building high-performance routers Ian Pratt

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Performance See: P&H 1.4.

COMP3221 lec31-mem-bus-I.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 31: Memory and Bus Organisation - I

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Chapter 19 Query Processing and Optimization

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Computer Systems 1 Fundamentals of Computing The CPU & Von Neumann.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Survey of Existing Memory Devices Renee Gayle M. Chua.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Computer Hardware A computer is made of internal components Central Processor Unit Internal External and external components.

CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.

SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.

Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.

The Evicted-Address Filter

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Query Processing – Implementing Set Operations and Joins Chap. 19.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.

“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.

6.375 Tutorial 4 RISC-V and Final Projects Ming Liu March 4, 2016http://csg.csail.mit.edu/6.375T04-1.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Jehandad Khan and Peter Athanas Virginia Tech

Seth Pugsley, Jeffrey Jestes,

Reducing Hit Time Small and simple caches Way prediction Trace caches

Stored program concept

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

ISPASS th April Santa Rosa, California

Architecture & Organization 1

Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch

Cache Memory Presentation I

FPGAs in AWS and First Use Cases, Kees Vissers

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Flow Path Model of Superscalars

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Architecture & Organization 1

Dynamic Packet-filtering in High-speed Networks Using NetFPGAs

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

DRAM Hwansoo Han.

Modified from notes by Saeid Nooshabadi

Fast Accesses to Big Data in Memory and Storage Systems

Presentation transcript:

A Relational Algebra Processor Final Project Ming Liu, Shuotao Xu

Motivation  Today’s Database Management Systems (DBMS): software running on a standard operating system on a general purpose CPU  DBMS frequently used in analytics and scientific computing, but bottlenecked by:  Processor speed, software overhead, latency & bandwidth  Proposal: FPGA Based Relational Algebra Processor Host PC (DBMS) FPGA Relational Algebra Processor Physical Storage 2

Background |Relational Algebra (RA)  Many database queries are fundamentally decomposable to five basic RA operators  Although SQL is capable of much more Operator Functions SelectionFilter rows based on a Boolean condition ProjectionEliminate selected attributes (columns) of a table; remove duplicated results Cartesian Product Combine several tables with unique attributes UnionCombine several tables with the same attributes DifferenceSelect rows of several tables where the rows do not match Design dedicated processors on the FPGA for each operator 3

Project Goal  Design and implement an in-memory relational algebra processor on the FPGA  Explore the types of queries that can benefit from FPGA acceleration  Secondary: Outperform SQLite!  Some assumptions:  32-bit wide table entries  Tables fit in memory  Max number of columns is 32  Read only 4

Microarchitecture | Host Software 5 FPGA

Microarchitecture | Top-Level RAProcessor Host PC (C++ functions) RA Processor DRAM PCIe Host PC (DBMS) RA Processor Physical Storage 6

Microarchitecture | Row Marshaller 7  Exposes a simple interface for operators to access tables in DRAM  Address translation, burst aggregation, truncation & alignment  Multiplexes requests  Table values sent/received as 32-bit bursts

Microarchitecture | Selection 8  Filters rows based on predicates (e.g. age < 40)  16 predicate evaluators  Internally comparators  A tree of gates to qualify the predicates  Max: 4 ORs of 4 ANDs

Microarchitecture | Projection 9  Select columns of a table  Column mask one-hot encoded  Do not need to buffer row; operate directly on data bursts

Microarchitecture | Binary Operators 10  Cartesian Product, Union, Difference and Deduplication  Nested loop implementation

Microarchitecture| Inter-operator Bypassing 11  Operators enabled concurrently; data passed between operators  No intermediate storage  Conditions: 1. A singly link of unary operators 2. Each operator has a single target output 3. No structural hazard  Software reorders and schedules the RA commands  Data source/destination encoded in command

Microarchitecture| Inter-operator Bypassing 12  Multiple 32-bit wide output FIFOs to other operators

Implementation Evaluation 13  Timing  Maximum Frequency: MHz  Critical Path: Row Marshaller mux  Area  Slice Registers: 50%  LUTs: 85%  BRAM/FIFOs: 47% ModulesSlice RegistersLUTsBRAM/FIFOs TOTAL (50%)59328 (85%)71 (47%) Row Mashaller Controller Selection Projection Cartesian Product Union Difference Deduplication

Performance Benchmark | Setup 14  SQLite  Internal SQLite timer to report execution time of the query  Thinkpad T430, Core 2.90Ghz, 1x8GB DDR  RA Processor  Performance counters: cycles from start to ack of an operator TableRelational Algebra QuerySQL Query 1 table 100k x 30 SELECT,starLong,tableOut, mass,>,80000,AND,pos_x,>,10, OR,pos_x,<,pos_z, OR,col12,>,col14, AND,col20,<,col21 SELECT * FROM starLong WHERE mass > AND pos_x > 10 OR pos_x < pos_z OR col12 > col14 AND col20 < col21; 1 table 100k x 30 PROJECT,starLong,tableOut, pos_x,col19,col25,col29 SELECT pos_x,col19, col25, col29 FROM starLong; 2 tables 1k x 30 UNION,starMed1,starMed2,starUnionSELECT * FROM starMed1 UNION SELECT * FROM starMed2; 2 tables 1k x 30 XPROD,starMed1,starMed2,starXprod RENAME,starXprod,0,iOrder0,1,mass0,8,phi0 SELECT,starXprod,starFiltered, iOrder0,=,iOrder, AND,phi0,>,1, AND,mass0,>,mass PROJECT,starFiltered,starOut,mass0 SELECT s1.mass FROM starMed1 s1, starMed2 s2 WHERE s1.vx > s2.vx AND s1.phi > 1 AND s1.mass > s2.mass;

Performance Benchmark | Results 15  Limitation: Memory Bandwidth: 200MB/s vs 12.8GB/s

Performance Benchmark | Results 16  Select operator most competitive with SQLite  What happens with more predicates?

Improvements 17  Increasing data burst width  32-bit to 256-bit: potential 8x speedup  Area/critical path increase  Maximizing memory bandwidth  Additional row buffers to buffer data from DDR2 Memory  Larger, faster DRAM; Higher clock speed

Conclusion & Future Work 18  Complex filtering operations performs well on the FPGA  Better than SQLite with sufficient memory bandwidth  Data intensive operators do not perform well  Future opportunities:  An accelerator alongside SQLite  Integration with HDD/SSD controller