Interactive Tool Use Cases

Title

Interactive Tools

High Level Description

This set of use cases involve those situations in which there are interactions between client code, running on either the system front-end service and/or on a set of separate client machines, and the target application code. Optionally, one or more human beings will be interacting with the client machines.

The interactions will be mediated by agent code (that runs on the same processing elements as the application) at the leaves of the communication infrastructure and filter code that runs as a plugin at the internal routing nodes of the communication infrastructure.

Environment Assumptions

  1. There exists a single node (possibly a workstation) that manages interactions with the user.
  2. Multiple frontends may interact with infrastructure simultaneously
  3. Authentication and authorization is consistent with the technique(s) employed by the non-debug launch mechanism. e.g. be able to use same credentials used to launch the original job
  4. A concise description of the job (task ID, executables, node set, environment, etc.) can be determined.
  5. There must be a minimum level of system support on which the infrastructure can operate (e.g. remote process launch, push file to remote node)
  6. Heterogenous configurations are allowed assuming that the appropriate runtime environments are provided by the operating system.
  7. Runtime support (OS) for attaching to and detaching from a running context.
  8. Runtime support (OS) for thread level interactions.

STCI Infrastructure Assumptions

  1. The STCI infrastructure matches the architecture diagram started at Hawthorne.
  2. Requires the job node set as input. A job node is an network addressable component having an unique copy of the operating system. I.e. Multiple job nodes can share the same hardware.
  3. Requires the topology as input. The topology is how the plugins and agents are connected.
  4. Establishes the interconnect between the front end and the job node set via the topology (including fan-out).
  5. Provides an enumeration of the job node set to allow the tool to specify and configure the agents on a per job node basis.
  6. Provides an enumeration of the plug-in node set including its location (level) in the hierarchy.
  7. Provides default group operations (e.g. broadcast, gather, etc.)
  8. Ability to specify which agents are located on which nodes, and pass configuration to sets of agents
  9. Supports the configuration of sets of tool plugins or agents
  10. Ability to send/receive over subsets

Specific Use Cases:

Long Running Application Monitoring

Monitoring the progress of a long running application by a set of stakeholders including system operations staff, members of the code team, and end users of the calculation being done. Information available through the monitoring process can come from a variety of sensors including system data (utilization, performance counter measurements) as well as application data.

Steering the Computation

This scenario adds controls or actuators to the monitoring scenario. The simplest form of steering is the ability to abort or pause (through a checkpoint) the computation. The general case allows the tool to interact with the application through an appropriate (non-infrastructure) API.

Debugging Scenarios

The debugging scenario involves attaching a scalable debugging tool to the application. All of the usual debugging operations would be supported. In addition, new debugging methods would be available. For example, conventional conditional breakpoints could be extended to support filtering, clustering, and other analytic methods to deal with the huge number of nodes on the capability systems.

There are (at least) three sub-cases of the debugging scenarios to consider:

Start an Application with the Debugger Attached.

  1. The application is compiled to include appropriate debug support.
  2. The application components are accessable for execution on the job nodes.
  3. A description of the interconnect topology node set is provided. This description is either prepared by the user or generated by the job scheduler/manager. The topology node set is always >= job node set. The specific details are outside the scope of STCI. The debugger front-end must accept the description and transform it into the input required to configure and initialize the STCI. Any additional resources such as a high speed switch is included.
  4. The user makes whatever arrangments necessary to schedule the job. If this involves interacting with a job scheduler then the resources requested reflect the resources required by the interconnect topology and not just the application.
  5. A description of the parallel job to be debugged is provided. Details in the description include:
    • the topology node set and the job node set. (Where)
    • the command(s) to be executed on a job node. (What)
    • the number of tasks involved (How many)
    • the user identity (Who)
    • the batch queue id (When)
  6. The debugger FE establishes an STCI session
  7. The debugger FE passes the job node set information and the topology node set information into the STCI.
  8. The STCI interconnects the nodes.
  9. The debugger FE requests an enumeraton of the topology plug-in node set.
  10. The debugger FE configures the parallel debugger plug-ins by utilizing the level information and identification information in the enumeration.
  11. The debugger FE requests an enumeration of the job nodes.
  12. The debugger FE sends a request to each job node to connect the appropriate debugger agents.
  13. The connection of the debugger agents is acknowledged.
  14. The debugger FE sends a command to each debugger agent describing how to launch the debuggee.
  15. The launch operation (usually) includes the suspension of the debuggee(s).
  16. The debugger agent(s) acknowledge the launch operations.
  17. The debugger tool is now ready to use.
  18. Debugger tool is used to debug the parallel job. The details of this process is outside the scope of STCI.
  19. Goto Debugger tool shutdown

Attach a Debugger Tool to an Running Job.

  1. Assume that the application components have been compiled with appropriate debug support.
  2. The debugger tool acquires a description of the job node set. This comes from the user or from the job scheduler/manager.
  3. The information in the parallel job description is similar to that required to start the job initially but includes details about the identity of the running application components associated with the parallel job.
  4. The debugger tool establishes an STCI session.
  5. The topology node set equals the complete job node set.
  6. The user can select a subset of the job node set for attachment.
  7. The debugger tool passes the job node (sub)set information and the topology node set into the STCI.
  8. The STCI interconnects the nodes.
  9. The debugger FE requests an enumeraton of the topology plug-in node set.
  10. The debugger configures the parallel debugger plug-ins by utilizing the level information and identification information in the enumeration.
  11. The debugger FE requests an enumeration of the job node (sub)set.
  12. The debugger FE sends a request to each job node to connect the appropriate debugger agents.
  13. The connection of the debugger agents is acknowledged.
  14. The debugger sends a command to each debugger agent describing how to attach to the debuggee.
  15. The attach operation suspends the debuggee(s).
  16. The debugger agent(s) acknowledge the attach operations.
  17. The debugger tool is now ready to use.
  18. Debugger tool is used to debug the parallel job. The details of this process is outside the scope of STCI.
  19. Goto Debugger tool shutdown

Debugger Tool Shutdown

  1. The user requests that the debug session be terminated.
  2. The debugger FE sends a detach command to each debugger agent. The detach command specifies whether the debuggee(s) should be terminated or allowed to continue.
  3. The debugger agent(s) acknowledge the detach operation.
  4. The debugger FE sends a command to the debugger agents to disconnect from STCI and terminate.
  5. The debugger FE sends a command to the debugger plug-ins to disconnect from STCI and terminate.
  6. The STCI reports the disconnect of the debugger agent(s).
  7. The debugger FE closes the STCI session causing the STCI to teardown the interconnect.
  8. If the application has been terminated, the debugger tool informs the job scheduler of the termination of the job.

Adaptive (and Interactive) Performance Monitoring

The client code and the filters work together to activate appropriate levels of low overhead monitoring in the agents. The filters are adaptive, depending either on parameters passed in from the client and/or dynamically on the data being collected. The client code runs adaptive analyses and presents both filtered data and analytic information to the user. User interactions modify the behavior of the client code.

Target Computer Systems

The target systems are implementations of a variety of scalable computational architectures. The tools need to be usable in a consistent way across a range going from laptop and desktop systems being used for code development up to peta- and exa-scale capability systems.

The first target application domain will be large scale computational science and engineering, Clusters used in other domains (e.g. graphical rendering) or server farms are also potential target areas. A constraint is that an instantiation of a tool only works within a single administrative domain.

Unresolved Questions

  1. What does the description of a parallel job look like and where does the data for it come from?
  2. How and when does the programming model identity (for example the rank under MPI) get allocated and later associated with a specific application component (task, process, thread)?
  3. Further, assuming that the debugger will want to present and exploit information about the underlying programming model of the parallel job, what are the requirements to map the STCI identity to the programming model identity?
  4. What constraints or dependencies, if any, does the enumeration of the programming model identities have on the sequencing of the STCI construction?

Contributors

  • Rob Fowler
  • Steve Cooper
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License