Distributed Agent Use Case
Title
Distributed Agent (DA) - In Progress
High Level Description
The DA is started up to perform its work. It relies on the STCI infrastructure to
- start it up, which is often initiated by the front-end. This typically assumes many execution threads (EA's) starting up at once.
- monitor the status of each of the Agent's execution threads
- deliver notification when certain events occur (such as EA failure). This tends to be on the order of single EA in terms of event occurrences, with notification going to a large number of the EA's.
- provide more EA's on request
- provide a means of communications between EA's (this need not be communications associated with the ULP, but typically is internal administrative traffic)
- provide clean job termination services, leaving no unintended trace left behind
- communicate with the front end
- relay stdin from the front-end to each EA of the DA
- relay stdout/stderr from the EA's to the front-end
Environment Assumptions
- The front-end may run in a different authentication domain than the the EA's
- 3rd party schedulers and resource managers may manage the compute resources, and process startup on these resources
- Operating systems:
- Full O/S per node: Linux flavors, Window's flavors (?)
- Primarily Beowulf-style clusters
- Compute nodes run same or similar operating system to the frontend
- Full O/S per node + specialized hardware
- The nodes run a full operating system, but also contain specialized computational hardware that requires additional steps to initiate computation
- e.g. RoadRunner
- Lightweight O/S
- Compute node run a lightweight operating system that does not provide full O/S services
- Frontend usually runs full kernel
- e.g. BG/P, Cray XT
- Shared memory MP
- Single operating system image for the whole machine
- e.g. Cray X1E, SGI Altix
- Full O/S per node: Linux flavors, Window's flavors (?)
- Network services:
- - Some sort of network services are available, but maybe not TCP/IP
- File system services:
- File systems may be:
- global
- local to a compute node
- same layout across nodes
- different layout across nodes
- File systems may be:
- User name/password may not be the same across all available resources
STCI Infrastructure Assumptions
See template for details.
Specific Use Cases
Interactive run of Distributed Agent
Job life cycle:
- User logs on to desktop system
- User Lanuches Distributed Agent on remote system
- desktop and remote system may be in distinct authentication domains
- User monitors running job
- User shuts down desktop system, but remote job continues to run
- User periodically monitors running job
- Job terminates at some point, with either normal or abnomal termination, and leaves no untintended tracks on the system.
Batch run of Distributed Agent
Job life Cycle:
- User submits batch distributed Agent job
- User may periodically monitor running job
- Job terminates at some point, with either normal or abnomal termination, and leaves no untintended tracks on the system.
Target Computer Systems
Contributors
- Rich Graham
- Darius Buntinas





