CS433 Spring 2001 Introduction Laxmikant Kale
2 Course objectives and outline You will learn about: –Parallel programming models Emphasis on 3: message passing, shared memory, and shared objects Ongoing evaluation and comparison of models –Parallel application classes –Parallel architectures Message passing support, routing, interconnection networks Cache-coherent scalable shared memory, synchronization Relaxed consistency models Novel architectures: Tera, Blue Gene, Processors-in-memory –Commonly needed parallel algorithms/operations –Performance analysis of parallel applications –Parallel application case studies
3 Project and homeworks Significant (effort and grade percentage) course project –groups of 5 students Homeworks/machine problems: –weekly (sometimes biweekly) Parallel machines: –NCSA Origin 2000, PC/SUN clusters
4 Resources Much of the course will be run via the web –Lecture slides, assignments, will be available on the course web page –Most of the reading material (papers, manuals) will be on the web –Projects will coordinate and submit information on the web Web pages for individual pages will be linked to the course web page –Newsgroup: uiuc.class.cs433 You are expected to read the newsgroup and web pages regularly
5 Advent of parallel computing “Parallel computing is necessary to increase speeds” –cry of the ‘70s –processors kept pace with Moore’s law: Doubling speeds every 18 months Now, finally, the time is ripe –uniprocessors are commodities (and proc. speeds shows signs of slowing down) –Highly economical to build parallel machines
6 Why parallel computing It is the only way to increase speed beyond uniprocessors –Except, of course, waiting for uniprocessors to become faster! –Several applications require orders of magnitude higher performance than feasible on uniprocessors Cost effectiveness: –older argument –in 1985, a supercomputer cost 2000 times more than a desktop, yet performed only 400 times faster. –So: combine microcomputers to get speed at lower costs –Incremental scalability: can get inbetween performance points with 20, 50, 100,… processors –But: You may get speedup lower than 400 on 2000 processors! Microcomputers became faster, killing supercomputers, effectively
7 Technology Trends The natural building block for multiprocessors is now also about the fastest!
8 General Technology Trends Microprocessor performance increases 50% - 100% per year Transistor count doubles every 3 years DRAM size quadruples every 3 years Huge investment per generation is carried by huge commodity market Not that single-processor performance is plateauing, but that parallelism is a natural way to improve it IntegerFP Sun MIPS M/120 IBM RS MIPS M2000 HP DEC alpha
9 Technology: A Closer Look Basic advance is decreasing feature size ( ) –Circuits become either faster or lower in power Die size is growing too –Clock rate improves roughly proportional to improvement in –Number of transistors improves like (or faster) Performance > 100x per decade; clock rate 10x, rest transistor count How to use more transistors? –Parallelism in processing multiple operations per cycle reduces CPI –Locality in data access avoids latency and reduces CPI also improves processor utilization –Both need resources, so tradeoff Fundamental issue is resource distribution, as in uniprocessors Proc$ Interconnect
10 Clock Frequency Growth Rate 30% per year
11 Transistor Count Growth Rate 100 million transistors on chip by early 2000’s A.D. Transistor count grows much faster than clock rate - 40% per year, order of magnitude more contribution in 2 decades
12 Similar Story for Storage Divergence between memory capacity and speed –Capacity increased by 1000x from , speed only 2x –Gigabit DRAM by c. 2000, but gap with processor speed greater Larger memories are slower, while processors get faster –Need to transfer more data in parallel –Need deeper cache hierarchies –How to organize caches? Parallelism increases effective size of each level of hierarchy, without increasing access time Parallelism and locality within memory systems too –New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface –Buffer caches most recently accessed data Disks too: Parallel disks plus caching
13 Architectural Trends Architecture translates technology’s gifts to performance and capability Resolves the tradeoff between parallelism and locality –Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect –Tradeoffs may change with scale and technology advances Understanding microprocessor architectural trends –Helps build intuition about design issues or parallel machines –Shows fundamental role of parallelism even in “sequential” computers Four generations of architectural history: –Vaccum tube, transistor, IC, VLSI –Here focus only on VLSI generation Greatest delineation in VLSI has been in type of parallelism exploited
14 Architectural Trends Greatest trend in VLSI generation is increase in parallelism –Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit slows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue) great inflection point when 32-bit micro and cache fit on a chip –Mid 80s to mid 90s: instruction level parallelism pipelining and simple instruction sets, + compiler advances (RISC) on-chip caches and functional units => superscalar execution greater sophistication: out of order execution, speculation, prediction –to deal with control transfer and latency problems
15 Economics Commodity microprocessors not only fast but CHEAP Development cost is tens of millions of dollars (5-100 typical) BUT, many more are sold compared to supercomputers –Crucial to take advantage of the investment, and use the commodity building block –Exotic parallel architectures no more than special-purpose Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors Standardization by Intel makes small, bus-based SMPs commodity Desktop: few smaller processors versus one larger one? –Multiprocessor on a chip
16 What to Expect? Parallel Machine classes: –Cost and usage defines a class! Architecture of a class may change. –Desktops, Engineering workstations, database/web servers, suprtcomputers, Commodity (home/office) desktop: –less than $10,000 –possible to provide processors for that price! –Driver applications: games, video /signal processing, possibly “peripheral” AI: speech recognition, natural language understanding (?), smart spaces and agents New applications?
17 Engineeering workstations Price: less than $100,000 (used to be): –new proce level acceptable may be $50,000 –100+ processors, large memory, –Driver applications: CAD (Computer aided design) of various sorts VLSI Structural and mechanical simulations… Etc. (many specialized applications)
18 Commercial Servers Price range: variable ($10,000 - several hundreds of thousands) –defining characteristic: usage –Database servers, decision support (MIS), web servers, e-commerce High availability, fault tolerance are main criteria Trends to watch out for: –Likely emergence of specialized architectures/systems E.g. Oracle’s “No Native OS” approach Currently dominated by database servers, and TPC benchmarks –TPC: transactions per second –But this may change to data mining and application servers, with corresponding impact on architecure.
19 Supercomputers “Definition”: expensive system?! –Used to be defined by architecture (vector processors,..) –More than a million US dollars? –Thousands of processors Driving applications –Grand challenges in science and engineering: –Global weather modeling and forecast –Rational Drug design / molecular simulations –Processing of genetic (genome) information –Rocket simulation –Airplane design (wings and fluid flow..) –Operations research?? Not recognized yet –Other non-traditional applications?
20 Consider Scientific Supercomputing Proving ground and driver for innovative architecture and techniques –Market smaller relative to commercial as MPs become mainstream –Dominated by vector machines starting in 70s –Microprocessors have made huge gains in floating-point performance high clock rates pipelined floating point units (e.g., multiply-add every cycle) instruction-level parallelism effective use of caches (e.g., automatic blocking) –Plus economics Large-scale multiprocessors replace vector supercomputers –Well under way already
21 Scientific Computing Demand
22 Engineering Computing Demand Large parallel machines a mainstay in many industries –Petroleum (reservoir analysis) –Automotive (crash simulation, drag analysis, combustion efficiency), –Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), –Computer-aided design –Pharmaceuticals (molecular modeling) –Visualization in all of the above entertainment (films like Toy Story) architecture (walk-throughs and rendering) –Financial modeling (yield and derivative analysis) –etc.
23 Applications: Speech and Image Processing Also CAD, Databases, processors gets you 10 years, 1000 gets you 20 !
24 Learning Curve for Parallel Applications AMBER molecular dynamics simulation program Starting point was vector code for Cray MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D
25 Raw Uniprocessor Performance: LINPACK
Fastest Computers Number of systems u u u u n n n n s s s s 11/9311/9411/9511/ n PVP u MPP s SMP