Use Case Template

Title

Brief and descriptive, such as "Interctive Interogation of Running Job"

High Level Description

Short description of the use scenario, such as "User starts up Interactive Tool, attaches to running job, goes through several iterations of data collection, and job steering, disconnects from the job."

Environment Assumptions

Assumptions about the environment exterior to the tool such as:

  • Operating system assumptions
  • Security and job control assumptions
  • Network communications assumptions
  • File system assumtions
  • There exsits a clear definition of what constitutes a job
  • Can identify running job
  • Can get a "handle" on the job, which the tool can use for attaching to the tool
  • Can attach to the running job
  • Can interact with the job
  • Can detach from the job with out terminating it

STCI Infrastructure Assumptions

Assumptions about the STCI infrastructure, such as:

  1. STCI will provide scalable agent startup, and if need be front-end startup.
    • Performance expectations:
      • Agents do not need to be distributed:
        • X MB binary, on 10 nodes, in Y seconds.
        • X MB binary, on 1000 nodes, in Y seconds.
        • X MB binary, on 10000 nodes, in Y seconds.
      • Agents need to be distributed:
        • X MB binary, on 10 nodes, in Y seconds.
        • X MB binary, on 1000 nodes, in Y seconds.
        • X MB binary, on 10000 nodes, in Y seconds.
  2. Agents will be staged for execution, if need be.
  3. Agents running on different platform types (Heterogeneous) will be supported
  4. Shared objects will be staged, if need be
  5. The STCI infrastructure will interact with resource managers/schedulers to obtain the credentials required to run on a given platform.
  6. The STCI infrastructure will use native EA startup mechanisms, where required
  7. The STCI infrastructure will be responsible for initializing it's communications network
  8. The STCI infrastructure will provide the means for the front end and Agents (EA) to implement it's own communications protocol
  9. The STCI communications infrastructure will provide reliable, in-order network data delivery services. This may imply data reliability, and network communications fault-tolerance.
  10. The STCI communications infrastructure will be asynchronous in nature.
  11. The communication network used for Front-end <-> EA, and EA <-> EA is scalable
  12. The communication network used for Front-end <-> EA, and EA <-> EA has redundancy built in, where possible
  13. The communication network infrastructure will provide the means to define sub-groups of EA's and the front-end. These groups may be used in "collective" communications
  14. The network communications infrastructure will provide the following aynchronous "collective" communications operations:
    • synchronization
    • broadcast
    • all-to-all
    • reduction
    • gather/scatter
  15. The network communications infrastructure will support multiple, simuoutaneous "collective" operations.
  16. The communications infrastructure will provide support for noncontiguous, user-defined data
  17. The infrastructure will manage "network" data conversions in heterogeneous environments.
  18. The infrastructure provides services, but does not set policy. This is the responsibility of the front-end and the agents. For example, the infrastructure will monitor the health of the EA's and front-end, but will allow the EA's and front-end to ask for DA termination, some sort of notification, or some other action. However, there should be a default policy set, with the ability to over-ride this.

Specific Use Cases

Such as:

  • User starts a 100,000 process job under the control of a parallel debugger ….
  • User starts a 100,000 process job, at some later time attaches to 5 of the processes ….
  • User starts up a job with perfomance monitoring enabled. At some later time the user attaches to the job, turns on data collections, and monitors ….

For each Use Case

  • Detailed Tool (Agent + Frontend) startup description
  • Detailed Communications patterns (including any "filter" expections from STCI)
  • Requirements from STCI
    • Life cycle
      • Startup
      • Monitoring
      • Response to errors
      • Termination
    • Infrastructure Communications requirements
      • Scalability (need to work well with X number of agents)
      • Communications latencies (need to distribute control data from the front-end to X number of Agents in Y micro-seconds)
      • Communications bandwidths (need to move X number of bytes from each Agent to the front-end in Y seconds, or need to move X number of bytes from each agent to a single file located at Y, in Z seconds, …)
      • Communications fault tolerance (if connection fails, try to recover - restart or take alternat
      • Is the infrastructure assumed to provide some sort of proxy services, such as I/O forwarding
  • Dependencies on function from other working groups
  • Unresolved questions

Target Computer Systems

The computer systems this use case will target.

Contributors

A list of the people who have contributed to this use case.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License