Download presentation
Presentation is loading. Please wait.
1
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard evard@mcs.anl.gov http://www.mcs.anl.gov/chiba/ Mathematics & Computer Science Division Argonne National Laboratory
2
The Chiba City Project The objective of the Chiba City Project is to provide a computing platform for development and testing of large-scale high-performance computing software while carrying out research in systems software. Motivations include: Improving our collective understanding of scalability issues. Enabling development and testing at scale to facilitate software for future systems. It’s very difficult for most people to develop and test at any kind of large scale. The first Chiba City cluster was installed at ANL in late 1999. The system has 256 dual-processor user nodes. (And other stuff.) Emphasis was on open source software and commodity components. The next platform is in the works. Target: 1K nodes, 8K virtual nodes.
3
Chiba City 8 Computing Towns 256 Dual Pentium III systems 1 Visualization Town 32 Pentium III systems with Matrox G400 cards 1 Storage Town 8 Xeon systems with 300G disk each Cluster Management 12 PIII Mayor Systems 4 PIII Front End Systems 2 Xeon File Servers 3.4 TB disk High Performance Net 64-bit Myrinet Management Net Gigabit and Fast Ethernet Gigabit External Link The Argonne Scalability Testbed 27 Sep 1999
4
Chiba City Usage Types (in priority order) Scalability Testing System Software OS releases Scientific code Networking … Development Cluster-, Scalability-, or HPC-related Computer Science Visualization Algorithms … Computational Science Must be users who are willing to put up with and test a variable environment.
5
Associated Requirements 1.Computational Usage Basic development support Queue-based job submission Frequently will have serious I/O needs All the usual expectations 2.Basic Development On-demand access Interactive access Permission to stress the system Debugging capability 3.System Development Root access Specialized kernels Occasional hardware access 4.Extreme Development User-defined node software Dynamic system services 5.Hardware Development and Testing Serious hardware access. 6.Hybrid Model Highly-customized node software to support computational usage
6
Supporting Testbed Activities ChibaDB Hardware and software configs State tracking OS Image Deployment Mechanism Network boot DB-based OS delivery Management Systems (mayors) 1 for every 32 user nodes Remote Power Control Remote Serial Console Systems User nodes Viz nodes Login nodes Node Software Full-featured Linux by default Dynamically Rebuilt User SpaceStatic Management Infrastructure In effect, the user’s OS, environment, and access level can be scheduled in with the user’s application. In effect, the user’s OS, environment, and access level can be scheduled in with the user’s application.
7
Perspectives … on size of the OS When each user has an isolated environment (i.e. their own node), there’s no reason that the kernel or the runtime system must be the same for all users: Fully-loaded Linux Minimal kernel Bproc-based environment Whatever We use full-featured Linux as the default because: Someone else maintains it. People know what to expect from it. Some of our apps (viz, development, … ) tend to use a lot of the system. It’s necessary for the management infrastructure. Defensible position: A one-fits-all scenario is unlikely. Let’s keep our options open. Indefensible positions: Some apps of the future will need a fully-featured OS. (And interactivity.) If it’s really worth it for performance and reliability (…?), then: Tune each OS (and runtime) to each app: the “OS working set”.
8
Perspectives … on fault tolerance Al is right. Everything can fail, everything does fail. We’ve found that a lot of system-wide software (PBS, file systems, etc) gets pretty confused when nodes continually vanish. In general, failures don’t seem to be caused by having an entire Linux installation on the nodes. Users aren’t very fault tolerant (or adaptable) either. Minor note: In our environment, the system itself can handle node failures. But it is extremely intolerant of management system failures.
9
Perspectives… on future paths Software improvement rate is much slower than hardware improvement rate: Fault tolerance Overall scalability Advanced programming models … Therefore we should be investigating as many interesting software ideas as possible and fostering software development and exploration.
10
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard evard@mcs.anl.gov http://www.mcs.anl.gov/chiba/ Mathematics & Computer Science Division Argonne National Laboratory
11
The Chiba Management Infrastructure n33 n34 n35 n36 n37 n38 n39 n40 n41 n42 n43 n44 n45 n46 n47 n48 n49 n50 n51 n52 n53 n54 n55 n56 n57 n58 n59 n60 n61 n62 n63 n64 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 n17 n18 n19 n20 n21 n22 n23 n24 n25 n26 n27 n28 n29 n30 n31 n32 mayor2 mayor1 citymayor scheduler login1 login2 file1 file2 City DB Automatic image deployment Automatic config mgmt Remote power control Remote serial console City DB Automatic image deployment Automatic config mgmt Remote power control Remote serial console 1 mayor per town, each managed by the city mayor. 1 mayor per town, each managed by the city mayor.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.