Distributed Agent Use Case

Title

Distributed Agent (DA) - In Progress

High Level Description

The DA is started up to perform its work. It relies on the STCI infrastructure to

  1. start it up, which is often initiated by the front-end. This typically assumes many execution threads (EA's) starting up at once.
  2. monitor the status of each of the Agent's execution threads
  3. deliver notification when certain events occur (such as EA failure). This tends to be on the order of single EA in terms of event occurrences, with notification going to a large number of the EA's.
  4. provide more EA's on request
  5. provide a means of communications between EA's (this need not be communications associated with the ULP, but typically is internal administrative traffic)
  6. provide clean job termination services, leaving no unintended trace left behind
  7. communicate with the front end
  8. relay stdin from the front-end to each EA of the DA
  9. relay stdout/stderr from the EA's to the front-end

Environment Assumptions

  1. The front-end may run in a different authentication domain than the the EA's
  2. 3rd party schedulers and resource managers may manage the compute resources, and process startup on these resources
  3. Operating systems:
    • Full O/S per node: Linux flavors, Window's flavors (?)
      • Primarily Beowulf-style clusters
      • Compute nodes run same or similar operating system to the frontend
    • Full O/S per node + specialized hardware
      • The nodes run a full operating system, but also contain specialized computational hardware that requires additional steps to initiate computation
      • e.g. RoadRunner
    • Lightweight O/S
      • Compute node run a lightweight operating system that does not provide full O/S services
      • Frontend usually runs full kernel
      • e.g. BG/P, Cray XT
    • Shared memory MP
      • Single operating system image for the whole machine
      • e.g. Cray X1E, SGI Altix
  4. Network services:
    • - Some sort of network services are available, but maybe not TCP/IP
  5. File system services:
    • File systems may be:
      • global
      • local to a compute node
        • same layout across nodes
        • different layout across nodes
  6. User name/password may not be the same across all available resources

STCI Infrastructure Assumptions

See template for details.

Specific Use Cases

Interactive run of Distributed Agent

Job life cycle:

  • User logs on to desktop system
  • User Lanuches Distributed Agent on remote system
    • desktop and remote system may be in distinct authentication domains
  • User monitors running job
  • User shuts down desktop system, but remote job continues to run
  • User periodically monitors running job
  • Job terminates at some point, with either normal or abnomal termination, and leaves no untintended tracks on the system.

Batch run of Distributed Agent

Job life Cycle:

  • User submits batch distributed Agent job
  • User may periodically monitor running job
  • Job terminates at some point, with either normal or abnomal termination, and leaves no untintended tracks on the system.

Target Computer Systems

Contributors

  • Rich Graham
  • Darius Buntinas
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License