Distributed Agent Use Case

Title

Distributed Agent (DA) - In Progress

The DA is started up to perform its work. It relies on the STCI infrastructure to

start it up, which is often initiated by the front-end. This typically assumes many execution threads (EA's) starting up at once.
monitor the status of each of the Agent's execution threads
deliver notification when certain events occur (such as EA failure). This tends to be on the order of single EA in terms of event occurrences, with notification going to a large number of the EA's.
provide more EA's on request
provide a means of communications between EA's (this need not be communications associated with the ULP, but typically is internal administrative traffic)
provide clean job termination services, leaving no unintended trace left behind
communicate with the front end
relay stdin from the front-end to each EA of the DA
relay stdout/stderr from the EA's to the front-end

The front-end may run in a different authentication domain than the the EA's
3rd party schedulers and resource managers may manage the compute resources, and process startup on these resources
Operating systems:
- Full O/S per node: Linux flavors, Window's flavors (?)
  - Primarily Beowulf-style clusters
  - Compute nodes run same or similar operating system to the frontend
- Full O/S per node + specialized hardware
  - The nodes run a full operating system, but also contain specialized computational hardware that requires additional steps to initiate computation
  - e.g. RoadRunner
- Lightweight O/S
  - Compute node run a lightweight operating system that does not provide full O/S services
  - Frontend usually runs full kernel
  - e.g. BG/P, Cray XT
- Shared memory MP
  - Single operating system image for the whole machine
  - e.g. Cray X1E, SGI Altix
Network services:
- - Some sort of network services are available, but maybe not TCP/IP
File system services:
- File systems may be:
  - global
  - local to a compute node
    - same layout across nodes
    - different layout across nodes
User name/password may not be the same across all available resources

See template for details.

Job life cycle:

User logs on to desktop system
User Lanuches Distributed Agent on remote system
- desktop and remote system may be in distinct authentication domains
User monitors running job
User shuts down desktop system, but remote job continues to run
User periodically monitors running job
Job terminates at some point, with either normal or abnomal termination, and leaves no untintended tracks on the system.

Job life Cycle:

User submits batch distributed Agent job
User may periodically monitor running job
Job terminates at some point, with either normal or abnomal termination, and leaves no untintended tracks on the system.