Bootstrapping

Bootstrapping is the process of establishing the infrastructure and components necessary for a tool to operate. The infrastructure is comprised of a number of different types of agents. Agents are execution threads that perform certain activities. We specifically don't refer to agents as "processes" or "daemons" since this implies a particular type of implementation, and may impose restrictions on the types of architectures that can be supported.

There are three types of agents:

  • Service agents - these provide services that relate specifically to the infrastructure itself. This includes component management, security services, event management, agent startup/cleanup/monitoring, administrative activities, etc.
  • Session agents - these provide services that relate to the creation and operation of sessions. This includes loading/unloading junctions, managing junction communication, forwarding network traffic, file staging, I/O forwarding, new process creation, etc.
  • Tool agents - these agents are supplied separately from the infrastructure as part of a tool. The infrastructure has no control over the activities of a tool agent, other than creation and termination.

Service agents can run in one of two modes: root-level or user-level. Session and tool agents can only ever run as user-level agents, although special types of tool agents may support elevated privilege levels.

When a tool is first started, the infrastructure may or may not be in operation on the machine. However, the assumption is that the infrastructure is installed on the machine. The process of creating a tool session causes the necessary infrastructure to be created so that the tool can operate. This bootstrapping of the infrastructure is transparent to the operation of the tool.

The following diagram show a high-level view of this bootstrapping operation:

tool_startup.png

Starting Service Agents

Service agents can be started either at boot time (root-level) or when a tool session is created (user-level).

Assumptions:

  • Root-level agent do not require resource allocation (e.g., this happens at boot time)
  • Agent configuration topology is specified.
  • Agent bootstrap/startup topology is specified. This topology describes the scalable roll-out of the startup.

Fault Tolerance Considerations
Since this is responsible for starting jobs up on the system, it is imortant that the agents be configured such that a failure of a single (or multiple) agents does not prevent us from continuing to use the components of the system that work. This implies that we need to:

  • keep track of the status of agents
  • have redundant communication routes
  • have multiple agents that can accept commands to be executed on the system.

Root-level Startup

Starting root-level service agents happens at system boot time or under the control of a privileged user. The exact mechanism for starting the agents is system specific, but must ultimately result in the instantiation of one or more agents. If more than one agent is started, then the agents must self-organize.

On distributed memory type architectures there are likely to be two main scenarios:

  • The service agents are started as part of the system boot process for each node
  • The service agents are started as part of the system boot process for the head node

User-level Startup

Startup steps:

  1. Check if root-level agents are already running
  2. If yes, no further action required
  3. If no, check to see if agents are already installed at the targets
  4. If not installed, stage files (e.g., agent binary, and configuration files)
  5. If files installed, but a force re-install is specified, stage files
  6. If necessary, request resource allocation for service agents
  7. Launch agents
  8. Setup service stream using configuration topology
stage_root_daemons.png

Startup Examples

Specific example of Root Daemon startup using ssh as startup mechanism, with out a shared file system:

  • Configuration file is supplied describing where Daemons will run, and what files need to be staged.
  • Bootstrap/startup topology is specified as binary-tree
  • Daemon configuration is mapped onto a binary-tree topology
  • Startup and stage Daemon binaries:
For each child, C, in list of children
    scp stci_config_file $C:/stci_install_dir/etc
    ssh $C /stci_install_dir/bin/stci_daemon --bootstrap --version $VERSION
    if stci_daemon does not startup # e.g., executable not found or incompatible version number
       scp stci_daemon $C:/stci_install_dir/bin
       ssh $C /stci_install_dir/bin/stci_daemon --bootstrap --version $VERSION
# "stci_daemon --bootstrap" performs the above action for the next level of children
  • From the configuration topology information (file), each Daemon can tell who his parent and children are, and establish a connection with them.
  • Setup Daemon Control Stream over the established connections

The user-initiated infrastructure scenario is the same as the root-level scenario, except that the root Daemons are owned by the user rather than root. Otherwise the startup procedure is pretty much the same, except for:

  • Nodes need to be allocated, possibly in co-ordination with the system resource manager.

Startup scenario - Starting a Tool Session

  • Assumptions
    • Root Daemons (root level Agent) are running on relevant nodes, and is used for job startup.
    • A node allocation has already been granted.
  • Daemon Control Stream has been established
  • Front End establishes connection with a root-level Daemon Front-end
    • The technique used by the front-end to locate the STCI is implementation dependent.
      • A simple file can be used or a web-site page.
      • Ideally, some form of indirection is recommended. For instance, a well known authority, accessible by the Front ends and the STCI (is this the controller?), could be used keep track of the status of the STCI infrastructure. When the STCI infrastructure has been bootstrapped the root-level daemon front-end(s) would register with the authority.
  • The Tool requests Service Agents be created
    • The startup request, which consists of a list of Agent binaries and their location, is used to create the description of the User-Level Service Agents (aka, Service Agents) that need to be created
    • Root Daemon front-end forwards the request to all relevant Root-level Daemons
    • Root level Daemons start Service Agents, and establish monitoring of this service agent.
    • A Service Agent Stream is created between the Service Agents and the User-level (Service) Front End
    • The startup request is used to determine where the Service Agents should be started
    • A Service Agent is required at each Tool Agent node and each Tool Junction node.
    • The Tool Agent mappings and job parameters are sent out to the Service Agents, using the Service Agent Stream
    • The Service Agents start up the Tool Agents (aka, Agents)
    • Note: Do we want to use the Stream tag to label stdio traffic, i.e. create 3 streams for i/o forwarding, or use a header tag, and multiplex using the Service Agent Stream?
    • I/O forwarding is setup, using the the appropriate Streams (which streams?)
    • A job-level management Stream is created between the User-Level Front end and the Tool Agents
  • Tool requests that a Tool Stream be created.
    • Broadcast tool stream-creation request on Service Agent Stream (this may be pruned as the request is propagated)
      • The tool stream-creation request includes a stream topology (who's connected to who)
    • Upon receiving tool stream-creation request, Service Agents:
      • Start appropriate Tool Junctions
      • Forward stream-creation request to Tool Agents and Tool Junctions
    • Tool Agents and Junctions handle request and build Tool Stream
      • Setup stream communications channels
      • Call the create-stream callback function
agent_startup.pngprocess_startup_1.png

Startup Scenario — Connecting to pre-started Root daemons

  • Daemons have been started at boot time, but are not connected
  • TODO:
    • How do the daemons discover each other?
    • Need to do this in a fault-tolerant manner in case some nodes/daemons are down

Startup Scenario — Connecting Agents that are compiled into application started by tool

This scenario assumes the following

  • The instrumentation (agent) code compiled into the application has an initialization function that must be called before any other instrumentation call can be made on an individual thread. The instrumentation code could be structured such that each instrumentation call tests to determine if instrumentation has been initialized and calls the initialization function if required.

The steps followed in this scenario are

  • User application is started by the tool.
  • Each application task runs until it reaches the instrumentation initialization call
  • The initialization function attempts to connect to a monitoring daemon, part of the STCI infrastructure) at a known hostname and port. If the connection fails, then instrumentation is disabled.
    • If an instrumentation call is made on a second or subsequent thread before instrumentation initialization is complete, that thread must block at the instrumentation call until instrumentation initialization has completed.
  • Once connected, the task sends identifying information to the monitoring daemon so that the monitoring daemon can verify that the connection attempt is coming from an application task that is actually part of the monitored application.
    • This could be as simple as sending the userid, hostname and process id of the task so the monitoring daemon can match this information to the mapping file generated by the parallel runtime system, or it might involve a more secure exchange using security credentials.
  • The monitoring daemon determines the appropriate connection point in the STCI infrastructure for this application task.
  • The monitoring daemon sends the connection information and a session identification token to the application task.
  • The instrumentation initialization function can disconnect from the monitoring daemon now.
  • The instrumentation initialization function now attempts to connect to the STCI infrastructure at the location specified in the data sent by the monitoring daemon. If the connection attempt fails then instrumentation is disabled.
  • The instrumentation initialization function sends the session identification token to the STCI infrastructure.
  • The STCI infrastructure validates the token sent by the instrumentation initialization function.
  • The STCI infrastructure sends accept/reject status to the instrumentation initialization function in the application task.
  • If the status is reject then instrumentation in that task is disabled. If the status is accept, then instrumentation is enabled.
  • The instrumentation initialization function signals initialization complete and any threads blocked awaiting instrumentation initialization are allowed to resume.
  • Application and instrumentation execution continues normally from this point thru application task termination.

The following diagram shows the major steps and the responsible components for this scenario

application_agent_1.png

note In the above scenario, there may be a scaling issue if a single host/port is connected to by all application tasks calling instrumentation_initialize(). This may be solvable by having monitoring daemons at many nodes in the STCI infrastructure, and each application process attempts to connect to the monitoring daemon closest to it.

In the case where STCI infrastructure daemons are allowed to run on the same compute nodes as the application tasks, this scenario could be simplified.

  • The STCI infrastructure daemon running on each node implements a listener thread which listens for incoming connections. This thread fills the role of a STCI monitoring daemon.
  • The application is started as inthe above scenario.
  • When an application task executes instrumentation_initialize(), it attempts to connect to the STCI daemon using portnnnn@localhost (a suitable shared memory scheme could also work)
  • Once connected, the application task sends identifying information to the STCI daemon as described above.
  • The STCI daemon validates the connection request and sends an accept or reject status back to the application task.
  • If the STCI daemon sends an accept back to the application task, it also properly hooks up the application task to the STCI infrastructure so that the instrumentation code can send data thru STCI.
  • If the application task receives an accept response, then instrumentatiion is enabled, otherwise instrumentation is disabled.
  • All application task threads waiting for instrumentation_initialization() to complete are signalled that they can resume execution.

Startup Scenario — Connecting Agents that are compiled into application running before the tool is started

This scenario assumes the following

  • The instrumentation (agent) code compiled into the application has an initialization function that must be called before any other instrumentation call can be made on an individual thread. The instrumentation code could be structured such that each instrumentation call tests to determine if instrumentation has been initialized and calls the initialization function if required.
  • The parallel runtime system provides a way to obtain the mapping for each task in the application, mapping it to hostname and process id for that task. This information can be provided in several ways, including
    • explicit host list file created by the user before starting the application.
    • The parallel runtime system puts this mapping a file in a known location once the application has been started.
    • The parallel runtime system provides an API that can be queried to obtain this mapping.
  • The application has been loaded and started before the tool attempts to connect to the instrumentation code (agents) compiled into the application. If the tool attempts to connect before the application has fully started, the mapping information provided by the parallel runtime system may be incomplete or not present.
  • Before the tool connects to the agents in the application tasks, those agents should treat instrumentation as inactive (nop) since there is no place to send the data. Alternatively, the instrumentation code could store data in memory (or in a file) with the intent that data would be sent once a connection to STCI infrastructure was completed.
  • The instrumentation code in the application must provide a method to listen for an incoming request from the STCI infrastructure. Several ways to do this are
    • Implement a listener thread that listens for connections on a specific port (would this even work with multiple application tasks per node?)
    • Implement a thread that watches a shared memory location for notification of a connection request. STCI code with attach to the shared memory segment and set this memory location to signal a connection request
    • Implement a signal handler that monitors for a specific signal (SIGUSR1, SIGUSR2, etc) as notification of a connection request

The processing in this scenario is as follows

  • The application process is started by some mechanism outside the scope of STCI (command line invocation, batch job, etc)
  • Instrumentation code is initially disabled/inactive (or storing data internally)
  • The first instrumentation call made by a task (from any thread) must start the listening mechanism. Instrumentation remains inactive.
  • The application runs normally.
  • The tool is started.
  • The user specifies the application that is to be connected to.
  • The STCI infrastructure uses this information to query the parallel runtime system for the application task to hostname/process id mapping for each application task.
  • The STCI infrastructure sends a connection request to each application task, where that request may be socket connection, shared memory, signal, etc that is understood by the instrumentation code.
  • The listening code in the application task sets a flag indicating that instrumentation is to be activated.
  • The next instrumentation call in the application is treated as if it was an instrumentation_initialization() call as described above. Any subsequent application thread that makes an instrumentation call before instrumentation_initialize() completes must block until instrumentation_initialize() completes.
  • When instrumentation_initialize() is called, it is treated exactly as above for the scenario where the tool started the application.
  • If the connection request status is accept then instrumentation is activated. Optionally, the instrumentation code may send any stored data thru the STCI infrastructure.
  • If the connection status is reject, then instrumentation remains inactive.
  • If a connection request fails, the instrumentation code should reset its state such that the STCI infrastructure can attempt to connect again (at user request)

The following diagram shows the major steps and the responsible components for this scenario

application_agent_2a.PNG
application_agent_2b.PNG

Notes from Jan 8 ORNL Meeting (Rich and Darius)

Connecting running agents with STCI

Agents were started (e.g., as part of an application) by something other than STCI. The infrastructure is then started.

Questions:

  • In Dave's description above, the agent initiates the connection to the STCI when a tool message is sent. We see a few issues:
    • The agent needs to be part of a stream on which to send data
    • The agent may not know where the infrastructure/FE are, but the FE knows where the agents are
  • We propose that the STCI connects to the agents when the session is created by the FE. Once the agents are part of the session, streams can be created, and communication for instrumentation can begin.
  • How do we connect to the agents?
    • Assume that the front end starts a session then adds the agents to that session
    • Before an agent is added to a session, it does not communicate with the infrastructure
      • Can it add itself to a session? Can it create a session? Or is this something only the FE can do.
      • How would an agent create a session with the other agents of the job?
        • We would need a way to make sure that multiple sessions aren't created
        • We would need a way for the agents to discover each other
        • This is difficult if no infrastructure exists when the agents are started.
          • Do we want to allow agents to create an infrastructure? How do we make sure there is only one?
      • If we assume that sessions are only created by FEs, and that agents just wait until they are added (by the FE) to a session, these issues go away
        • The FE creates one session.
        • The FE creates one infrastructure.
        • The FE knows where to find the agents
    • Front end has access to the description of where the agents are located (hostname, pid)
      • The location info the FE has is only what would be provided by a scheduler
      • So we can't count on knowing an agent's "listener port" from the location info
    • How would the infrastructure know how to connect to the agent?
      • Example 1: agent opens a listener port on a well known port number
        • This would (hopefully) be a rare case, and might be used as a last resort
      • Example 2: agent opens a pipe-file in ${TMPDIR} with the name based on the pid
        • STCI daemon can be started on the node and connect to the agent
  • Assumption:
    • Agents can be started in an environment where there is not STCI infrastructure
    • An agent must wait until it is part of a session before it can communicate

Scenario: Agents are running (not started by STCI) and FE attaches to them.

  • Assumptions
    • compute nodes are allocated
    • agents are running on a subset of compute nodes
    • agents are not attached to infrastructure
  • FE creates a session
  • FE (i.e., the tool) indicates whether we need service nodes
    • The tool may decide not to use service nodes, and run junctions on compute nodes, or use no junctions at all
  • FE allocates service nodes
  • FE adds agents to session
  • FE creates streams to agents

Scenario: Nothing is running — need to allocate compute and service nodes

  • FE creates a session
  • FE specifies topologies
  • FE queries STCI library for number of junctions/levels/etc in a topology (e.g., for "built-in" topologies like binary tree, the tool may specify N agents, but it doesn't know how many junctions)
  • FE uses this info to determine how many service nodes and compute nodes it needs
  • FE allocates nodes
    • internally, compute and service nodes may need to be allocated separately
    • FE can request different classes of nodes (e.g., compute, service, or something else if we find a need for it) in one allocation call. This allows the possibility of making a single low-level allocation request.
    • FE can make multiple allocation requests
    • allocation request is nonblocking
    • allocation request function returns a handle for each class of requested nodes
  • FE stages and starts agents
  • FE or agents create streams

Dynamic process allocation

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License