Use Cases From Lifecycle Perspective

What is the STCI lifecycle?

STCI Deployment Lifecycle

  • Install (deploy)
  • Configure (setup)
  • Standby
  • Enable (start)
  • Status
    • available resources (nodes, …)
    • active sessions
  • Disable (stop)
  • Uninstall (remove)
STCI-Deployment-Lifecycle-2.png

The STCI is a meta-tool whose purpose is to facilitate the deployment of, and utilization by, other tools.

If we think of the STCI as a tool then the most effective way to deploy the STCI tool would be to use the STCI; however, you can't use the STCI until its deployed. We have a classic chicken and egg problem.

The first state in the Session Lifecycle diagram says 'Stage agent, plug-in, etc..'. Doesn't this assume a pre-existing communication infrastructure? We need to specify what this pre-existing infrastructure looks like, when it is deployed and how tool specific utilization takes place.

The STCI needs to have the following characteristics to fulfill its role as the tool to deploy and configure the external tools.

  • One instance sharable by all tools
  • Facilitates the scalable deployment of other tools
    • A 'dedicated' stream that can be utilized in configuring tool specific infrastructure and components.
  • Scalable start-up
    • Hardcoded interconnect (topology) established during STCI install/setup/configure.
    • Interconnects all nodes within the administration domain.
  • Supports infrastructure staging
    • Tool specific topology setup.
    • Tool specific components (agent, plug-ins, etc..).
  • Supports data staging.
    • Includes library and executables.
    • On-demand
    • Caching
  • Interconnect that is intrinsic to the STCI tool
    • Includes STCI plug-ins and agents to support efficient external tool configuration.
    • Looks and operates a lot like a normal external tool stream/session.

How does this relate to the STCI Deployment Lifecycle state diagram?

  • Install
    • Load all the STCI software onto the HPC controller node.
    • Run the configuration scripts.
  • Configure
    • Enumerate the nodes inside the administration domain to be included in the STCI.
    • Establish a topology to be used to interconnect said nodes.
    • Identify the job scheduler.
    • Establish the session manager
    • Configure the nodes
      • Distribute the STCI software including STCI plug-ins and STCI agents.
      • Distribute the configuration instructions. (who connects to whom)
    • Enter standby state.
  • Standby
    • Minimal footprint
    • Listen for STCI start request from external tool
    • Enable the session manager
    • Build STCI control channel
    • Enter enabled state.
  • Enabled
    • Respond to STCI API requests
      • Distribute tool agents, plugins, etc.
      • Build tool streams
      • Support grouping
      • etc…
  • Disabled
    • Disable the session manager
    • Tear down the STCI tool interconnect.

Lifecycle w.r.t. utilization by a tool

  • Connect to STCI
  • Create an STCI session
    • Interact with resource managers/schedulers to obtain the credentials required to run on a given platform. (Session Management)
  • Configure the STCI session
    • Set policy (error recovery, error reporting, event response, etc…) options. What should the STCI do when something goes wrong?
    • Accept session configuration commands from tool FE, tool agents and tool plug-ins.
  • Build interconnect based upon supplied topology and job information. (S)
    • Responsible for initializing it's communications network
    • network used for Front-end <-> EA, and EA <-> EA is scalable
    • network used for Front-end <-> EA, and EA <-> EA has redundancy built in, where possible
    • manage "network" data conversions in heterogeneous environments. (Data Exchange)
  • Provide enumeration of topology and job elements. (S)
  • Attach tool agents (S)
    • provide scalable agent startup
  • Attach tool plug-ins (S)
    • provide scalable plug-in startup
  • Support group subset formation and utilization (S) (Lifecycle/Data Exchange)
  • Support tool specific communication (FE <-> Agents, FE <-> plugins) (S, R) (Data Exchange)
    • means for the front end and Agents (EA) to implement it's own communications protocol
    • support multiple, simuoutaneous "collective" operations
    • provide support for noncontiguous, user-defined data
  • Support the following aynchronous "collective" communications operations (S) (Data Exchange)
  • Implement response policy w.r.t. unexpected agent(plug-in) termination.
  • Detach tool agents (S)
  • Detach tool plug-ins (S)
  • Dismantle interconnect (S)
  • Close STCI session
  • Disconnect from STCI

First cut of the session lifecycle diagram.
session-lifecycle-20070919-1.png

Slide2.png
Key to above diagram
Stream specification
A stream specification describes 1) which agents should be deployed on which nodes, 2) a topology, 3) which plugins should be deployed and 4) where they should be deployed in the topology.
Error policy
An error policy defines how the infrastructure must react to errors, e.g., error recovery, error reporting

Distributed Agent Requirements

From the Assumptions about STCI intrastructure

  • STCI will provide scalable agent startup, and if need be front-end startup. (Lifecycle)
  • Agents will be staged for execution, if need be. (Data Exchange/????)
  • Agents running on different platform types (Heterogeneous) will be supported (Data Exchange/Lifecycle/Session management)
  • Shared objects will be staged, if need be (Data Exchange/Lifecycle/Session management)
  • Interact with resource managers/schedulers to obtain the credentials required to run on a given platform. (Session Management)
  • Use native EA startup mechanisms, where required (Lifecycle)
  • Responsible for initializing it's communications network (Lifecyle/Data Exchange)
  • Provide the means for the front end and Agents (EA) to implement it's own communications protocol (Data Exchange)
  • Provide reliable, in-order network data delivery services. This may imply data reliability, and network communications fault-tolerance. (Data Exchange)
  • The STCI communications infrastructure will be asynchronous in nature. (Data Exchange)
  • The communication network used for Front-end <-> EA, and EA <-> EA is scalable (Data Exchange Lifecycle)
  • The communication network used for Front-end <-> EA, and EA <-> EA has redundancy built in, where possible (Data Exchange/Lifecycle)
  • The communication network infrastructure will provide the means to define sub-groups of EA's and the front-end. These groups may be used in "collective" communications (Lifecycle/Data Exchange/Session?)
  • The network communications infrastructure will provide the following aynchronous "collective" communications operations: (Data Exchange)
    • synchronization
    • broadcast
    • all-to-all
    • reduction
    • gather/scatter/allgathar
  • The network communications infrastructure will support multiple, simultaneous "collective" operations. (Data Exchange)
  • The communications infrastructure will provide support for noncontiguous, user-defined data (Data Exchange)
  • The infrastructure will manage "network" data conversions in heterogeneous environments. (Data Exchange)
  • The infrastructure provides services, but does not set policy. This is the responsibility of the front-end and the agents. For example, the infrastructure will monitor the health of the EA's and front-end, but will allow the EA's and front-end to ask for DA termination, some sort of notification, or some other action. However, there should be a default policy set, with the ability to over-ride this. (Lifecycle/Session management)

Data Staging Requirements

From the Assumptions about STCI intrastructure

  • STCI infrastructure is up and running (Lifecycle/Data Exchange)
  • High-bandwidth data transfer (Data Exchange)
  • Multiple data transfers can be performed asynchronously (Data Exchange)
  • Performance requirements (Data Exchange)
    • X MB file can be staged on Y nodes in T seconds
    • X MBps data transfer rate between "stages", i.e., front-end to plugin, plugin to plugin and plugin to agent

Batch Instrumentation Requirements

From the Assumptions about STCI intrastructure

  • The infrastructure is assumed to be up and running before the analysis session begins. (Lifecycle/Data Exchange)
  • The infrastructure provides an API for agent code linked with an application to determine if it has successfully connected to the infrastructure. (Session Management/Data Exchange)
  • The infrastructure provides a means for instrumentation code to make an initial connection to the infrastructure to query the infrastructure for the proper infrastructure node to actually connect to. (Data Exchange/Session Mangement)
  • The infrastructure allows plugins to be loaded and unloaded at any time. (Data Exchange/Session Mangement)
  • The infrastructure provides a way to control the order of loading or unloading for plugins. (Data Exchange)
  • The infrastructure provides a way to enforce dependencies between plugins, where a plugin may require another plugin to be present, or a load request may be rejected if a specific plugin requires that another plugin not be present. (Data Exchange)
  • If there are endian or other data representation issues due to heterogeneous architecture, the infrastructure must deal with those issues transparently. (Data Exchange)
  • The security component may need to refresh user credentials for the infrastructure from the node where the analysis tool was invoked, since in some security implementations, credentials have a lifetime. (Session Managment)
  1. Plugins are assumed to be staged to the infrastructure nodes so they are present on that node when there is a request to load them. (Session Management)

Interactive Tool Requirements

From the Assumptions about STCI intrastructure

  • The STCI infrastructure matches the architecture diagram started at Hawthorne. (all)
  • Requires the job node set as input. A job node is an network addressable component having an unique copy of the operating system. I.e. Multiple job nodes can share the same hardware. (Lifecycle)
  • Requires the topology node set as input. The topology node set is the set of nodes used to construct the STCI. Topology node set >= job node set. (Lifecycle/Data Exchange)
  • Establishes the interconnect between the front end and the job node set via the topology node set (including fan-out). (Data Exchange)
  • Provides an enumeration of the job node set to allow the tool to specify and configure the agent on a per job node basis. (lifecycle)
  • Provides an enumeration of the plug-in node set including its location (level) in the hierarchy. (Lifecycle)
  • Initializes with a default plug-in that acts as a pass through. All downstream data is duplicated on the fan-out. (Data Exchange/Lifecycle)
  • Supports the connection of specified tool agent(s) to the infrastructure on a job node basis. (Lifecycle/Session Mangement)
  • Supports the configuration of tool plugins (load/unload/ordering) on a plug-in node basis. (lifecycle/Data Exchange)
  • The launch or attach (and subsequent detach) specific commands for the tool agent are passed as payload via the infrastructure. (Data Exchange)

Legend:

  • S - scalable
  • R - reliable, in-order network data delivery services. This may imply data reliability, and network communications fault-tolerance.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License