Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Nios Multi Processor Ethernet Embedded Platform Final Presentation
Bus Specification Embedded Systems Design and Implementation Witawas Srisa-an.
Computer Architecture
Categories of I/O Devices
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Processor System Architecture
FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
MotoHawk Training Model-Based Design of Embedded Systems.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
Embedded Network Controller with Web Interface Bradley University Department of Electrical & Computer Engineering By: Ed Siok Advisor: Dr. Malinowski.
© 2006 Pearson Education, Upper Saddle River, NJ All Rights Reserved.Brey: The Intel Microprocessors, 7e Chapter 13 Direct Memory Access (DMA)
Reliable Data Storage using Reed Solomon Code Supervised by: Isaschar (Zigi) Walter Performed by: Ilan Rosenfeld, Moshe Karl Spring 2004 Part A Final Presentation.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
Configurable System-on-Chip: Xilinx EDK
Basic Input/Output Operations
Performance Analysis of Processor Characterization Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor:
ECE 526 – Network Processing Systems Design
Ethernet Bomber Ethernet Packet Generator for network analysis Oren Novitzky & Rony Setter Advisor: Mony Orbach Spring 2008 – Winter 2009 Midterm Presentation.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
OS Implementation On SOPC Final Presentation
Network Simulation Internet Technologies and Applications.
Winter 2013 Independent Internet Embedded System - Final A Preformed by: Genady Okrain Instructor: Tsachi Martsiano Duration: Two semesters
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Lecture 12 Today’s topics –CPU basics Registers ALU Control Unit –The bus –Clocks –Input/output subsystem 1.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
© 2010, Renesas Technology America, Inc., All Rights Reserved 1 Course Introduction  Purpose:  This course provides an overview of the SH7216 Ethernet.
NetBurner MOD 5282 Network Development Kit MCF 5282 Integrated ColdFire 32 bit Microcontoller 2 DB-9 connectors for serial I/O supports: RS-232, RS-485,
Adding the TSE component to BANSMOM system and Software Development m Yumiko Kimezawa October 4, 20121RPS.
Cis303a_chapt06_exam.ppt CIS303A: System Architecture Exam - Chapter 6 Name: __________________ Date: _______________ 1. What connects the CPU with other.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
MICROPROCESSOR INPUT/OUTPUT
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
System bus.
Chapter 1: Introduction. 1.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 1: Introduction What Operating Systems Do Computer-System.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Towards the Design of Heterogeneous Real-Time Multicore System m Yumiko Kimezawa February 1, 20131MT2012.
NIOS II Ethernet Communication Final Presentation
Towards the Design of Heterogeneous Real-Time Multicore System Adaptive Systems Laboratory, Master of Computer Science and Engineering in the Graduate.
Application Block Diagram III. SOFTWARE PLATFORM Figure above shows a network protocol stack for a computer that connects to an Ethernet network and.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Computer Organization & Assembly Language © by DR. M. Amer.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Lecture (Mar 23, 2000) H/W Assignment 3 posted on Web –Due Tuesday March 28, 2000 Review of Data packets LANS WANS.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Term 2, 2011 Week 2. CONTENTS Communications devices – Modems – Network interface cards (NIC) – Wireless access point – Switches and routers Communications.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Additional Hardware Optimization m Yumiko Kimezawa October 25, 20121RPS.
Ethernet Bomber Ethernet Packet Generator for network analysis
UNIT-3 Performance Evaluation UNIT-3 IT2031. Web Server Hardware and Performance Evaluation Key question is whether a company should host their own Web.
Renesas Electronics America Inc. © 2010 Renesas Electronics America Inc. All rights reserved. Overview of Ethernet Networking A Rev /31/2011.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
بسم الله الرحمن الرحيم MEMORY AND I/O.
Introduction Computer Organization Spring 1436/37H (2015/16G) Dr. Mohammed Sinky Computer Architecture
Introduction to Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program.
Lecture Overview Shift Register Buffering Direct Memory Access.
Networked Embedded Systems Sachin Katti & Pengyu Zhang EE107 Spring 2016 Lecture 9 Serial Buses – SPI, I2C.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
Lecture 2: Performance Evaluation
Microprocessors Personal Computers Embedded Systems Programmable Logic
Presentation transcript:

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory

Lecture 18 Outline Web server example MP3 example

Lecture 18 Example: Embedded web server application Basic web server capable of responding to simple HTTP requests Simple CGI requests for dynamic HTML Read a timer peripheral before, during, and after servicing an HTTP request to log throughput calculations, which are then displayed on a dynamically generated web page Simple read only file system was implemented using flash memory to store static web pages and JPEG images

Lecture 18 Throughput calculations Transmission throughput –Reflects the latency between starting to send the first TCP packet containing the HTTP response until the file was completely sent –Could theoretically reach a maximum of 10Mbps Raw network speed that the CPU and TCP/IP stack are capable of sustaining. HTTP server throughput –Takes into account all delay between the incoming HTTP connection request and file send completion Includes the transmission latency above Also measures the time the HTTP server took to open a TCP connection to the host

Lecture 18 Baseline system Web server put to test to serve up JPEG images of varying sizes across the LAN to a host PC During each transfer several snapshots of the timer peripheral were taken

Lecture 18 Baseline system dataflow NIOs CPU Instruction Master Data Master Avalon Bus UART, IO, Timer, etc. SRAM FLASHEthernet MAC Data flow The Nios CPU’s data master port is used to read data memory (SRAM) and write to the Ethernet MAC. This would occur for each packet transmitted in the baseline system.

Lecture 18 Performance optimization Using a DMA to transfer data from incoming packets into memory without the intervention of the microprocessor The use of a custom peripheral to do the checksum calculation The combination of the two Optimization of the slave-arbitration priority for the memories to provide maximum data throughput

Lecture 18 Dataflow enhancement with DMA Using DMA to transfer packets between Ethernet MAC and data memory CPU higher priority for any conflicts with the DMA During DMA, CPU is free to access other peripherals For access to the shared SRAM, arbitration is performed NIOs CPU Instruction Master Data Master Avalon Bus UART, IO, Timer, etc. SRAM FLASHEthernet MAC Data flow DMA Controller Read Master Write Master Avalon Bus Data flow Arbitrator

Lecture 18 Performance improvement Transmission throughput is doubled compared to baseline The entire HTTP server throughput is about 2.5X that of the baseline 36% increase of logic resource usage (3600 logic elements)

Lecture 18 TCP checksum Checksum calculations can be regarded as a necessary evil in dataflow-sensitive applications –For a 1300-byte payload, it takes 33,000 clock cycles –At a 33 Mhz clock speed it requires 1ms latency for each maximum size packet In the benchmark, the largest file (60KB) breaks down into 46 maximum-sized packets –46ms out of 156ms transmission latency in the baseline The inner loop of TCP/IP stack checksum performs a 16-bit one ’ s complement checksum calculation –Adding up data repeatedly is a simple task for hardware –A Verilog implementation can be designed –The checksum peripheral operates Reading the payload contents directly out of data memory Performing the checksum calculation Storing the result in a CPU-addressable register –It takes 386 clock cycles now –Speedup of 90X over the software version

Lecture 18 Checksum peripheral Again, for access to the shared SRAM, arbitration is performed NIOs CPU Instruction Master Data Master Avalon Bus UART, IO, Timer, etc. SRAM FLASH Data flow Checksum Peripheral Read Master Avalon Bus Data flow Arbitrator

Lecture 18 Performance boost Transmission latency decreased by 44ms Average transmission throughput increase of 40% and average HTTP throughput increase of 25% over the baseline Resource usage 22% increase over the baseline (3250 logic elements)

Lecture 18 Putting it all together

Lecture 18 Embedded uP systems in Xilinx FPGA Traditional embedded microprocessor system as implemented on a platform FPGA Co-processor Architecture with multiple hardware accelerators 1. Start with developing for the first architecture 2. Automatically generating the second architecture under the control of the user

Lecture 18 Profiling results DCT32 and IMDCT36 perform the discrete cosine transform and inverse discrete cosine transform respectively. The other functions are multiply-accumulate functions of various sizes. These functions comprise over 90% of the total application execution time on the host.

Lecture 18 Design automation Implement co-processor accelerators to meet performance requirements. Using the tagging facilities in Xilinx design environment to mark the functions for hardware acceleration. ‘ Compile for target ’ –The tool chain will create an implementation that includes a MicroBlaze processor and interfaces the same as before –Augmented with three hardware accelerators that implement the multiplications, DCT and inverse DCT. The creation of the hardware accelerator blocks is done automatically: –The use of an advanced C to hardware compiler optimized for Platform FPGAs. –The ‘ stitching ’ of the accelerators into the new co-processing architecture. –Handling the movement of the appropriate data to and from the accelerators.

Lecture 18 New architecture

Lecture 18 Final results Enables the mp3 application to run in real time at a system clock rate of 67.5MHz.

Lecture 18 A simple summary Platform-based design involves hardware/software codesign Right design decisions can provide significant amount of performance improvement Need careful tradeoff between performance, resource usage, cost and design time Platform FPGAs are a convenient/low cost platform for such a task

Lecture 18 Overview of the Rest of the Semester This is the last formal lecture –If we haven’t covered it already, we can’t really expect you to use it on your projects Quiz 2. Next Thursday. No class next Tuesday. Final project proposal is 4/13 and 4/15. –2 teams each day. Each team has 20 minutes –Proposal presentations can be sent to me through before class or brought in using a flash memory Initial report due on 4/20 (new due date) –Three-pages (four at most) –May contain: introduction, background, motivation, impact, block diagram, and workload partition among team members –Goal: give us enough information that we can provide feedbacks about project complexity and suggestions From now on, I’ll have office hours during class meeting times to discuss final project-related issues Final Project Presentation: 5/12 Final Project Report/Demo: Due 5/14 Details referring to Lecture 14

Lecture 18 Next time Quiz 2 (next Thursday, 4/8)