Presentation is loading. Please wait.

Presentation is loading. Please wait.

The RAPIDS Project Israel Koren C. Mani Krishna ARTS

Similar presentations


Presentation on theme: "The RAPIDS Project Israel Koren C. Mani Krishna ARTS"— Presentation transcript:

1 The RAPIDS Project Israel Koren C. Mani Krishna ARTS
Architecture and Real-time Systems (ARTS) Lab Dept. of Electrical and Computer Engineering University of Massachusetts, Amherst MA ARTS

2 Our Goal RAPIDS To develop a tool that aids in the
analysis and enhancement of the performability of real-time systems Performability = performance + reliability Measure performance through detailed monitoring Measure reliability through fault injection and monitoring fault recovery Provide a framework to explore various configurations within the scope of the resources available Provide an experimental testing capability for REE November 22, 2018 JPL kickoff meeting 2000

3 RAPIDS Why Is It Important? It can help us further understand the
interaction between hardware and software Real applications are run on real hardware Monitoring the application closely will expose Performance bottlenecks Recovery bottlenecks Design errors Better understanding the working of the application Making the application more efficient November 22, 2018 JPL kickoff meeting 2000

4 Performability validation helps build confidence
RAPIDS Wait, There’s More!! Performability validation helps build confidence Combined capabilities of fault injection and recovery monitoring help test applications thoroughly The designer can experiment with various configurations and system parameters until the required performance is obtained More aggressive designs can be implemented Investigators can be assured of the performance, dependability and availability of the REE flight system November 22, 2018 JPL kickoff meeting 2000

5 RAPIDS The Current Version RAPIDS 3.0 - The Simulator
A simulation testbed for evaluating real-time algorithms and software Users can specify topology and network protocol task set to be run on the system type of (fault) environment various algorithms (allocation, scheduling & fault recovery) Output: The number of deadline misses (among many other results) for the duration of the mission RAPIDS 3.0 has already been installed at JPL November 22, 2018 JPL kickoff meeting 2000

6 RAPIDS 4.0 = Emulator + RAPIDS 3.0
The Next Version RAPIDS 4.0 = Emulator + RAPIDS 3.0 Configuration Parameters Simulator Algorithms Emulator Configuration Additional Parameters Emulator Simulator Configuration November 22, 2018 JPL kickoff meeting 2000

7 RAPIDS The Emulator Design The tool should not interfere greatly with
the working of the application Two phases of development: Phase I: Monitoring The MPI Wrapper The Monitoring Modules The Graphical User Interface Phase II: Control Configurability (Allocation and Scheduling Algorithms) Fault Injector Synthetic Workload November 22, 2018 JPL kickoff meeting 2000

8 RAPIDS The Process Model November 22, 2018 JPL kickoff meeting 2000
Application Node Main Display Node Application Node IGP Main Control Module GUI Application Node Legend Application Task MPI Wrapper Application Node Local Control Module Monitoring Channel Control Channel IGP Info Gathering Process November 22, 2018 JPL kickoff meeting 2000

9 The MPI Wrapper RAPIDS All applications are assumed to use MPI for
communication Important MPI calls are wrapped with a system call that sends relevant information to the display node The lightweight wrapper minimizes system overhead due to extra system calls network overhead due to extra messages Evaluation of the overhead is important The applications will now use the RAPIDS-MPI library November 22, 2018 JPL kickoff meeting 2000

10 RAPIDS Monitoring Modules IGP: Information Gathering Process
one IGP per application collects monitoring messages through MPI calls from subtasks forwards messages to the main module through IPC MCM: Main Monitoring and Control Module accepts input from the user through GUI spawns the IGPs, executes appropriate “mpirun” s collects and displays all monitoring information IGPs and MCM are located in the Main Display Node (MDN), separate from the nodes running the applications November 22, 2018 JPL kickoff meeting 2000

11 RAPIDS The GUI Provides a detailed pictorial view showing:
allocation of the subtasks of the application to nodes start and end of task instance messages sent and received during execution checkpointing epochs faults injected and recovery actions taken User can choose from various levels of monitoring Extra monitoring handlers allow user to display other important events or values of key variables Extra handlers are part of the RAPIDS-MPI library November 22, 2018 JPL kickoff meeting 2000

12 RAPIDS Enhanced GUI & Apps
The Enhanced GUI will provide further flexibility User can select specific variables and events to be monitored, through an easy interface The display of certain events/variables can be user-defined Two REE applications are being used: OTIS NGST Both have been successfully ported and run November 22, 2018 JPL kickoff meeting 2000

13 RAPIDS Control Parameters
The user can analyze the impact of various system parameters and determine their appropriate values Selectable System Configuration Parameters: Application(s) to be run Number of subtasks for each application Subset of nodes on which to run the applications Task Allocation -- manual or algorithm-based Scheduling of tasks on application nodes depending on the operating system used November 22, 2018 JPL kickoff meeting 2000

14 RAPIDS Task Parameters User can specify the period of each subtask
Synthetic workloads can be used to emulate applications that are unavailable User can specify synthetic tasks through a detailed user interface a task trace generated earlier Workload surges can be emulated Ability to handle load surges is another measure for dependability November 22, 2018 JPL kickoff meeting 2000

15 RAPIDS Fault Parameters Type of fault Time and duration of fault
Register faults Memory faults I/O device failures Network faults message corruption message delaying/loss Time and duration of fault Wall clock time Stochastic (based on a distribution) Selectable parameters determined by the fault injector November 22, 2018 JPL kickoff meeting 2000

16 RAPIDS Fault Injector We start by integrating SWIFI into RAPIDS
SWIFI capabilities: fault injection into application’s virtual memory address space registers, code, data, heap, stack or user defined regions multiprocessor fault injection some rudimentary monitoring centrally controllable The LCM (Local Control Module) houses SWIFI November 22, 2018 JPL kickoff meeting 2000

17 RAPIDS SWIFI status Fault Injection:
SWIFI4 ported to Linux Initial experiments have run successfully Initial results show that most faults can cause process to crash (dump core) SWIFI primarily relies on the ptrace() system call ptrace() was designed for debugging programs setting and clearing break points reading and writing to virtual space thus emulating faults November 22, 2018 JPL kickoff meeting 2000

18 RAPIDS SWIFI & ptrace ptrace() drawbacks:
The kernel must do four context switches for each fault injection this interference can slow down the application considerably ptrace() can only be used on child processes Requires a separate parent process for each MPI task ptrace() requires modifications to the source code A child process must call TRACE_ME (enter trace mode) November 22, 2018 JPL kickoff meeting 2000

19 RAPIDS Fault Injection Using the /proc file system
The /proc file systems contains files for each process Information about the process status, memory, network statistics etc. Location-specific faults can be injected by reading and writing to the appropriate offset of the file No context switches are involved Multiple faults in contiguous locations can be injected in one call A separate process for each MPI task is not required Source code of the application is not needed November 22, 2018 JPL kickoff meeting 2000

20 Fault Injection (cont.)
RAPIDS Fault Injection (cont.) Fault injection through the /proc file system Only one fault injector process needed per system Only the superuser can use this facility Evaluating fault tolerance of the OS is important It is rarely swapped out of physical memory ptrace() cannot be used to debug the OS The /proc file system can be used to inject faults into the physical memory directly including the OS The /proc file system approach seems a viable option! November 22, 2018 JPL kickoff meeting 2000

21 Setup of the ARTS Lab Cluster
Windows PC Tintin Nestor DNS DHCP WWW DeskJet Bianca ECS Gateway Eric’s Laptop Firewall 100 Ethernet Hub Haddock Calculus Thomson Snowy NFS NIS Myrinet Switch LaserJet 2100M Legend November 22, 2018 JPL kickoff meeting 2000 Myrinet Host Card Ethernet Card Parallel Port PC Chassis

22 Monitoring Alternatives
RAPIDS Monitoring Alternatives User can choose from these two alternatives Myrinet-only Monitoring messages also pass through Myrinet Produces both system as well as network overhead Myrinet-Ethernet Monitoring messages use only “EtherNetwork” Main overhead is from wrapper calls These alternatives provide a way to assess network overhead November 22, 2018 JPL kickoff meeting 2000

23 Remote Experimentation
RAPIDS Remote Experimentation Restricted remote access to our lab cluster will be provided User downloads the RAPIDS Java Applet (RAJA) Only encrypted GUI update messages are passed S S ARTS Cluster JPL November 22, 2018 JPL kickoff meeting 2000

24 RAPIDS Deliverables The RAPIDS Emulator The MPI Wrapper
The Monitoring and Control Modules The RAPIDS GUI SWIFI integrated into RAPIDS Synthetic workloads Light-weight fault recovery techniques November 22, 2018 JPL kickoff meeting 2000

25 Conformance with Specs
RAPIDS Conformance with Specs Use of COTS components Current development on a Linux/Myrinet system using MPI for message passing Plans to use Linux-RT in the future Portability of software - a design requirement Use of REE fault model and REE Executive Our timeline fits well into REE schedule Completion of Phase I: December 2000 Completion of Phase II: September 2001 November 22, 2018 JPL kickoff meeting 2000

26 RAPIDS Long-Term Plans Integration of the emulator and simulator
Fault injection through the /proc filesystem Light-weight application-specific checkpointing Exploitation of application-level information Integrating our application-level fault recovery techniques (ALFT) into RAPIDS Techniques for low-power fault recovery Tools for evaluating power-aware techniques November 22, 2018 JPL kickoff meeting 2000

27 Application-Level Fault Tolerance
RAPIDS Ongoing work in ARTS Lab Application-Level Fault Tolerance Lightweight application-specific fault detection and fault recovery techniques Key Idea: Exploit application semantics to implement low overhead fault tolerance Initial results are very promising The RTHT benchmark requires 15% redundancy Redundancy can be tuned to the extent of fault-tolerance required Guidelines to develop ALFT for applications November 22, 2018 JPL kickoff meeting 2000

28 Power-Aware Real-time Systems
RAPIDS Ongoing work in ARTS Lab Power-Aware Real-time Systems Design of power-aware algorithms Impact of high-level algorithms such as allocation algorithms, scheduling algorithms etc. on power Voltage clock scaling - an alternative for reducing energy consumption in embedded systems Initial results reveal that considerable energy savings can be obtained through voltage scaling and appropriate allocation and scheduling algorithms November 22, 2018 JPL kickoff meeting 2000

29 RAPIDS: A multi-faceted tool
Summary RAPIDS: A multi-faceted tool An integrated platform for the launch, monitoring and validation of real applications on real hardware A framework to test different configurations (parameters, algorithms) to get the best out of the system Monitoring exposes bottlenecks and provides feedback for improvement Validation ensures the system meets the goals of the mission November 22, 2018 JPL kickoff meeting 2000

30 RAPIDS The Console November 22, 2018 JPL kickoff meeting 2000

31 RAPIDS Schedule Windows November 22, 2018 JPL kickoff meeting 2000

32 RAPIDS The Task Editor November 22, 2018 JPL kickoff meeting 2000

33 RAPIDS Enhanced GUI November 22, 2018 JPL kickoff meeting 2000


Download ppt "The RAPIDS Project Israel Koren C. Mani Krishna ARTS"

Similar presentations


Ads by Google