1 Implementing An Associative Processor on FPGAs.

Slides:



Advertisements
Similar presentations
CPU Review and Programming Models CT101 – Computing Systems.
Advertisements

Give qualifications of instructors: DAP
Altera FLEX 10K technology in Real Time Application.
Processor System Architecture
Efficient Representation of Data Structures on Associative Processors Jalpesh K. Chitalia (Advisor Dr. Robert A. Walker) Computer Science Department Kent.
CS 151 Digital Systems Design Lecture 37 Register Transfer Level
Instruction Set Architecture & Design
02/02/20091 Logic devices can be classified into two broad categories Fixed Programmable Programmable Logic Device Introduction Lecture Notes – Lab 2.
Efficient Associative SIMD Processing for Non-Tabular Data Jalpesh K. Chitalia and Robert A. Walker Computer Science Department Kent State University.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.
1/31/20081 Logic devices can be classified into two broad categories Fixed Programmable Programmable Logic Device Introduction Lecture Notes – Lab 2.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
Introduction to Computers and Programming. Some definitions Algorithm: Algorithm: A procedure for solving a problem A procedure for solving a problem.
Chapter 6 Memory and Programmable Logic Devices
Unit-1 PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE Advance Processor.
Chapter 17 Microprocessor Fundamentals William Kleitz Digital Electronics with VHDL, Quartus® II Version Copyright ©2006 by Pearson Education, Inc. Upper.
ALTERA UP2 Tutorial 1: The 15 Minute Design. Figure 1.1 The Altera UP 1 CPLD development board. ALTERA UP2 Tutorial 1: The 15 Minute Design.
Atmega32 Architectural Overview
Chapter 4 Programmable Logic Devices: CPLDs with VHDL Design Copyright ©2006 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights.
Computer Architecture and Organization
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
An Introduction to 8086 Microprocessor.
November 18, 2005 PACL and ASC Processor Research Overview 1 Research Overview Parallel and Associative Computing Group and the ASC Processor Group Kent.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Computer Architecture and Organization Introduction.
Memory Layout and SLC500™ System Addresses. Processor Memory Division An SLC 500 processor's memory is divided into two storage areas. Like two drawers.
CSC 3210 Computer Organization and Programming Chapter 1 THE COMPUTER D.M. Rasanjalee Himali.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Computer Organization and Architecture
Chapter 4 Programmable Logic Devices: CPLDs with VHDL Design Copyright ©2006 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights.
Basic Sequential Components CT101 – Computing Systems Organization.
Computer Architecture And Organization UNIT-II General System Architecture.
CSNB374: Microprocessor Systems Chapter 1: Introduction to Microprocessor.
Computer Organization & Assembly Language © by DR. M. Amer.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
Computer Systems - Processor. Objectives To investigate and understand the structure and role of the processor.
Electronic Analog Computer Dr. Amin Danial Asham by.
Computer Organization and Assembly Languages 2007/11/10
Computer Systems. Bits Computers represent information as patterns of bits A bit (binary digit) is either 0 or 1 –binary  “two states” true and false,
Computer Architecture Foundations for Graduate Level Students.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
Fundamentals of Programming Languages-II
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
The Processor & its components. The CPU The brain. Performs all major calculations. Controls and manages the operations of other components of the computer.
1 3 Computing System Fundamentals 3.2 Computer Architecture.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 251 Introduction to Computer Organization.
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ read/write and clock inputs Sequence of control signal combinations.
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Introduction to the FPGA and Labs
Computer Organization and Architecture Lecture 1 : Introduction
Sequential Logic Design
Atmega32 Architectural Overview
Control Unit Lecture 6.
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
Programmable Logic Devices: CPLDs and FPGAs with VHDL Design
Dynamically Reconfigurable Architectures: An Overview
ECEG-3202 Computer Architecture and Organization
Computer Architecture
ECEG-3202 Computer Architecture and Organization
Overview of Computer Architecture and Organization
Computer Systems An Introducton.
Register sets The register section/array consists completely of circuitry used to temporarily store data or program codes until they are sent to the.
Presentation transcript:

1 Implementing An Associative Processor on FPGAs

2 A Conceptual View of the KSU ASC Model Cell Interconnection Network MemoryPE MemoryPE MemoryPE MemoryPE Instruction Stream Control Unit

3 An Example for the Data Memory Organization: Auto Information Stored in the PE Cells PE0 PE1 3 Blue Focus OH 190 …….PE2 PE3 2 Blue Taurus OH 160 ……. 4 Red Focus PA 180 ……. 1 Burgundy Focus OH 170 ……. ID Color Model State Rebate

4 The Prototype of the Byte-serial ASC Processor IS Control Unit CPU (for Sequential and parallel instructions) 32-bit Instruction Memory Data Memory PE Array Associative Processing Array Responder Resolution Circuitry MAX/MIN Circuitry 16 8-bit Common Registers

5 Prototype of the 4-PE Associative Processing Array Data Memory0PE0 PE Cell 0 Responder Resolution Circuitry MAX/MIN Circuitry At_Least_One_Responder Data Memory1PE1 PE Cell 1 Data Memory2PE2 PE Cell 2 Data Memory3PE3 PE Cell 3

6 A Processing Element Overview 8- bit ALU CarryOut 1- bit AL U MUX 16 8-bit General Purpose Registers 16 1-bit Logical Registers 16-deep 1-bit Mask Stack 1-bit Responder Register Common Registers General-Purpose Registers Logical Registers Find/Step /ResolveFirst Comparator

7 Instruction Set and Assembling Language (1) Data Transfer Instructions - LD address, dstreg - LDI immediate, dstreg - LDRR srcreg, dstreg - LDRRSPD srcreg - ST srcreg, address Arithmetic and Logical Instructions (mnemonic srcreg1, srcreg2, dstreg) –ADD SUB –AND OR XOR NOT –SLL SRL –SLT SLE SGT SGE SEQ SNE

8 Instruction Set and Assembling Language (2) Mask Stack and Responder Instructions –SETMSK –TOPMSK TOPMSKRSPD –POPMSK POPMSKRSPD –POPTHEM POPTHEMRSPD –RPCMSK RPCMSKRSPD –PUSHMSK PUSHMSKRSPD –PUSHTHEM –PUSHMSKTHEM –STKTOMEM MEMTOSTK –FIND –STEP –RESFST

9 Instruction Set and Assembling Language (3) Maximum and Minimum Searching Instructions –SETMXMI –LDMXMI –STMXMI –MAX –MIN Branch/Jump Instructions –BNR –BRS –J

10 Associative Operations Related PE Components: The Responder Register: to indicate whether a PE is a responder to a particular associative search or not The Step/Find/ResolveFirst Unit: to support processing multiple responders in various ways The Mask Stack: to represent at most 16 levels of association. The top of the Mask Stack always represents the current status of the PE – whether it is masked (‘1’) or unmasked (‘0’)

11 Example of Associative Search: Find all Focus cars located in Ohio Perform the comparison: model == “Focus”, and store the result either ‘1’ or ‘0’ into $LR1 Perform the comparison: location == “Ohio”, and store the result into $LR2 AND $LR1 with $LR2, and store the result into the Responder Register (Note: all the instructions above performed by all PEs in parallel are called unmasked instructions)

12 Unmasked and Masked Instructions  Unmasked Instruction: Executed by all the PEs regardless of the state of the Mask Stack  Masked Instruction: Executed only by those PEs with a ‘1’ on the top of their Mask Stack

13 Example of Associative Search Using Masked Instructions: (Find all Focus cars located in Ohio )  Initialize the top of the Mask Stack to ‘1’ Perform the comparison: model == “Focus”, and store the result ‘1’ or ‘0’ into $LR1 Perform the comparison: location == “Ohio”, and store the result into $LR2 AND $LR1 with $LR2, and store the result into the Responder Register  AND the Responder Register with the top of the Mask Stack, and push the ANDing result into the Mask Stack and also store it into the Responder Register  Increase the rebate of all Focus cars in Ohio by 10 (masked instruction)

14 The MAX/MIN Circuitry, the Responder Resolution Circuitry, and PE 3 R0 V0 R1 V1 R2 V2 R3 V3 V4 From PE0 From PE1 From PE2 From PE3 to PE0 to PE1 to PE2 to PE3 To CU D0 MM0 R0 D1 MM1 R1 D2 MM2 R2 D3 MM3 R3 From PE0 : GPR RPD From PE1: GPR RPD From PE2: GPR RPD From PE3: GPR RPD to PE0 to PE1 to PE2 to PE3 Mask Stack Responder Step /Find /RslvFst General Purpose Registers Responder Resolution MAX/MIN PE 0 clr

15 Using the Falkoff Algorithm for MAX/MIN Search Maximum-Value Searching (the following steps are performed in parallel for all the data) Search bit slices of the data from the most significant bit to the least significant bit: As each bit slice is processed, each bit is ANDed with a corresponding MM bit (a 1-bit register used to indicate whether or not a data item is the maximum after processing a bit) Check the results of the AND to ensure that at least one new maximum value remains:

16 Using the Falkoff Algorithm for MAX/MIN Search (continued) If this condition is true, then the MM bits are updated by the results of AND; if all the results are 0, then the MM bits are not updated at this time Continue to process the remaining bit slices as above until all bits are processed After the least significant bit slice is processed: If only one MM bit is ‘1’, it marks the largest number; if more than one MM bit is ‘1’, those data are tied for the maximum value

17 Minimum-Value Searching: Similar to maximum value searching, but complement the bit slices each time before ANDing it with MM bits Using the Falkoff Algorithm for MAX/MIN Search (continued)

18 Bit Slices (7..0) of Rebates Values in MM bits During Processing Process bit from MSB to LSB After processing each bit (rebate) Initialize (170) (MM0) (160) (MM1) (190) (MM2) (max) (180) (MM3) Search For the Maximum Rebate in the Data Memories

19 MAX/MIN Circuit using the Falkoff Algorithm

20 The MAX/MIN Circuitry, the Responder Resolution Circuitry, and PE 3 R0 V0 R1 V1 R2 V2 R3 V3 V4 From PE0 From PE1 From PE2 From PE3 to PE0 to PE1 to PE2 to PE3 To CU D0 MM0 R0 D1 MM1 R1 D2 MM2 R2 D3 MM3 R3 From PE0 : GPR RPD From PE1: GPR RPD From PE2: GPR RPD From PE3: GPR RPD to PE0 to PE1 to PE2 to PE3 Mask Stack Responder Step /Find /RslvFst General Purpose Registers Responder Resolution MAX/MIN PE 0 clr

21 Functionality of Responder Resolution Circuit Responder resolution: Send an At-Least-One-Responder signal to the IS control unit Support responder selection: Send a corresponding Responder_Before_Me signal to each PE’s Find_ Step _ResolveFirst unit

22 The Responder Resolution Circuitry for 4 PEs R 0 to R 3 : from responder registers V 0 to V 3 : called Responder_Before_ME V 4 : called At_Least_One_Responder V 0 R 0 V 1 R 1 V 2 V 4 R 2 V 3 R 3 Responder Resolution Circuitry ‘0’ PE 0 PE 1 PE 2 PE 3

23 Responder Processing Process responders in parallel: –use masked instructions Process responders sequentially: –Need some responder selection instructions –Need a responder selection mechanism

24 Responder Selection Instructions Step repetitively used to pick one responding PE each time for further processing – “ for” loop e.g., to step through all the Focus cars in Ohio to list the features available on each car Find select a responding PE, while still keeping all responders identifiable – “ while” loop e.g., retrieve the tax rate from one of the cars located in OH, then increment the tax rate by a certain amount, afterwards apply this new tax rate to all the cars located in OH

25 Responder Selection Instructions (continued) ResolveFirst select a responder and only keep this responder identifiable e.g., resolve one PE from several PEs which have the values tied for the maximum value

26 The Responder Resolution Circuitry, MAX/MIN Circuitry, and PE 3 R0 V0 R1 V1 R2 V2 R3 V3 V4 From PE0 From PE1 From PE2 From PE3 to PE0 to PE1 to PE2 to PE3 To CU D0 MM0 R0 D1 MM1 R1 D2 MM2 R2 D3 MM3 R3 From PE0 : GPR RPD From PE1: GPR RPD From PE2: GPR RPD From PE3: GPR RPD to PE0 to PE1 to PE2 to PE3 Mask Stack Responder Step /Find /RslvFst General Purpose Registers Responder Resolution MAX/MIN PE 3 clr

27 Design Language: VHDL A standard hardware description language used to model and design digital hardware - Support concurrent events - can be translated into hardware by some design tools good for managing large design structures Supported by many CAD tool and programmable logic vendors

28 Altera MAX+PLUS II Development System Design Entry Device Programming Programmer Data I/O Other Programmers Graphic Editor Text Editor Waveform Editor Symbol Editor Floorplan Editor Other Design Entry Tools MAX+PLUS II Compiler Design Verification Simulator Waveform Editor Timing Analysis Other Verification Tools Design Compilation

29 Altera FLEX 10K FPLD FLEX10K70 Device: - 3,744 LEs - 9 EABs - 70,000 gates totally ( IOEs – I/O elements) Partial FLEX10K20 FPLD Architecture EABEAB EABEAB LABLAB LABLAB LA B LABLAB IOEs FastTrack Interconnect IOEs

30 Simulation on FLEX 10K 70 Chip The ISCU runs at about 10MHz using 50% logical gates One EAB is used as a local memory for one PE; 4 PEs and the support circuit runs at about 14MHz using 82% logical cells. From the simulation result, we can see that the FLEX 10K 70 chip isn’t large enough for the 4-PE processor. So our current work is targeting on Altera APEX 20K devices with 1million gates in one chip.

31 Future Work Explore more arithmetic features and associative operations Develop the complete ASC assembly language and the ASC back-end compiler Implement the PE cell interconnection network Implement the whole ASC processor on bigger and faster FPGA chips Develop the multiple instruction stream MASC model