Building Scalable, High Performance Cluster and Grid Networks: The Role Of Ethernet Thriveni Movva CMPS 5433
Overview About Grids/Clusters Uses of Grid Differences between Grids/Clusters Benefits of Grid Grid Architecture Building Ethernet Network for Grids/Clusters Examples of Ethernet Grids/Clusters Conclusion/Summary
What Is A Grid Computer? Hardware and Software System Integrates a collection of distributed system components Computer systems Storage etc Solves large-scale computation problems Appear to the user as a single, large “Virtualized” computing system Consists of geographically dispersed computers
What is a Cluster? Multiprocessor system consisting of co-located computers and storage Viewed as though it were a single computer Connected through fast local area networks (Localized within a room or building) Provides more speed and/or reliability than a single computer Cost-effective than single computers of comparable speed or reliability.
Uses of Grid Computing Computer systems and other resources not constrained to be dedicated to individual users or applications Can be made available for dynamic pooling/sharing according to the changing needs Using internet, Grid-based resource sharing and collaborative problem solving can be extended to multi-institutional “Virtual Organizations”
Differences between Grids/Clusters Grids: dispersed over a local/metropolitan/WANdispersed over a local/metropolitan/WAN span administrative boundariesspan administrative boundaries focus on problems in distributing computing/resource sharingfocus on problems in distributing computing/resource sharing distribute workloads among different machine types and OSdistribute workloads among different machine types and OS Clusters: localized within a room/buildinglocalized within a room/building single administrationsingle administration focus on compute-intensive problems and HPCfocus on compute-intensive problems and HPC homogenous (single type of processor and OS)homogenous (single type of processor and OS)
Benefits Of The Grid Grid Computing offers a number of Potential uses and benefits that can be broadly categorized in the following way: High Performance Computing (HPC) Data Federation and Collaboration Resource Allocation and Optimization
High Performance Computing (HPC) Computationally intensive parallelizable applications can be benefited Uses computer array of numerous commodity or specialized systems Most applications of the Grid fall into HPC classification Advantages Of HPC: Cost effective solutions to critical problems High return on investment Solves problems that were previously insolvable within given time and cost Solve problems too large for conventional supercomputers Fields in which the HPC Grid has successfully addressed a wide range of computational problems include: Climate/weather/ocean modeling and simulation, Internet search engines, Signal/image processing, Pharmaceutical research, Military forces simulation
Data Federation and Collaboration Consolidates data from different sources in a single data service Hides data location, local ownership and infrastructure from the application No data disruption by local users, applications or data management policies Facilitates wide range of integrated applications like: Corporate performance dashboards Marketing analysis tools Customer service applications Data mining applications
Resource Allocation and Optimization Sharing of computing and storage to improve resource utilization For Example, the applications and the batch jobs can be transferred to an idle server Benefits of resource optimization Reclaims much of the stranded capacity of the computing infrastructure Reduces the level of capital investment No modification of existing application required
Grid Computing Architecture Basic architecture of Grid consists of User Interface Applications Grid Middleware Computing Resources Grid Network
Applications Classification of parallel applications Embarrassingly Parallel Computations (EPC) Divided into independent partsDivided into independent parts Allocated to multiple processors for simultaneous executionAllocated to multiple processors for simultaneous execution No communication is required between the processorsNo communication is required between the processors Example : Testing large integers to determine prime numbersExample : Testing large integers to determine prime numbers Parametric and Data Parallel Computations Also referred to as Nearly Embarrassingly Parallel Computations (NEPC)Also referred to as Nearly Embarrassingly Parallel Computations (NEPC) Each processor works on independent subset of the dataEach processor works on independent subset of the data Data is later gathered by a single processData is later gathered by a single process Examples: Internet search enginesExamples: Internet search engines Loosely Coupled Synchronous Parallel Computations Inter-process communication between small subset of processors before the computation can be completedInter-process communication between small subset of processors before the computation can be completed
Grid Middleware Gives the Grid the semblance of a single computer system Provides coordination among computing resources of the Grid Provides location transparency Allows the applications to run over a virtualized layer of networked resources Available from system vendors and independent software vendors Example: Globus Toolkit
Functions of Middleware Discovery and monitoring Discover what resources or services are available Monitor their status Resource allocation and management Matches application requirements to the available computing resources Creates and schedules remote jobs as required Ensures optimum load balancing and resource utilization Security Shared resources may contain sensitive information Secures communications, authenticate user identities using SSL/TLS etc Message Passing System Used by compute-intensive parallel applications for inter-process communication Examples: MPI (Message passing interface) and PVM (parallel virtual machine)
Ethernet Networks for Clusters and Grids Single-switch Clusters Large Clusters Ethernet Grid Networks
Single-switch Clusters Built using a single high-availability Gigabit Ethernet switch/router as the cluster interconnect The maximum size of a single-switch Ethernet cluster is determined by the non-blocking port capacity of the switch Current Switch/routers provide interconnect for over 600 GbE connected servers All server ports configured to be in same subnet
Large Clusters Built using meshes of Federated Ethernet switches Ethernet switches use non-blocking, constant Bi-sectional Bandwidth (CBB) topologies CBB Provides scalability to support thousands of cluster nodes Provide high bandwidth connectivity to the network The core of the cluster provides each node switch with equal load share to avoid blocking of ports
Ethernet Grid Networks (Campus Grid network based on Ethernet switching) Ethernet allow the cluster to participate in a broader campus or Enterprise Grid structure Desktop computers, workstations connected to the campus grid network using gbE Server farms Outside of cluster are connected to site switches using gbE Goal of campus LANs gives high priority to general Grid traffic ensures critical Grid traffic does not incur any added latency
Grid Tools Tools used to prioritize critical grid traffic Priority Queuing The forwarding capacity of a congested port is immediately allocated to any high priority traffic that enters the queueThe forwarding capacity of a congested port is immediately allocated to any high priority traffic that enters the queue Rate limiting and policing Limits the amount of lower priority traffic that enters the networkLimits the amount of lower priority traffic that enters the network Weighted Random Early Discard (WRED) Packet loss can be eliminated if buffers are never allowed to fill to capacity with resulting overflows Overflows can be avoided by applying WRED to the lower priority traffic WRED eliminates the possibility of high priority packets arriving at a buffer that is already overflowing with lower priority packets
Examples of Ethernet Cluster/Grids TeraGrid Is a multi-institutional effort to build and deploy world’s most comprehensive computing infrastructure for open scientific research NASA NASA uses ESDCD “Grid of clusters”, to help scientists increase their understanding of the Earth, the solar system and the universe through computational modeling and processing of space-borne observations
Conclusion/Summary Ethernet continues to evolve as a highly cost-effective and flexible technology Majority of parallel and general Grid applications are very well served by the performance characteristics of Ethernet as the cluster/Grid interconnect In the future, Ethernet end-to-end data transfer bandwidths, message latencies and CPU utilization will improve dramatically due to NIC enhancements Volume production leading to price decline These developments expected to improve the overall performance of existing Ethernet clusters/Grids and use of cluster/Grid technology by a broader range of commercial enterprises