Architecture

Overview

The goal of the STCI project is to provide an open source, scalable, portable, and extensible run-time and communication infrastructure for the deployment of developer tools on the emerging peta- and exa-scale high performance computing systems.

Traditionally, tool communication infrastructure has been developed by single organizations as frameworks specific to a single tool (e.g. TotalView) or as generic, tool independent, scalable communication infrastructure (e.g. MRNet.) The new class of peta- and exa-scale systems are going to demand significantly greater scalability from such infrastructure than has been required in the past. This is going to make it much more difficult for tool developers to absorb the cost of developing, porting and maintaining both the infrastructure and the tool. The STCI initiative is aimed at amortizing this cost across a broad community, while at the same time providing a common API for developing the next generation of high performance computing tools.

The main features of the infrastructure will include:

  • a layered, component architecture
  • system resource management services
  • primitives that support aggregate and point-to-point communication
  • low latency/high bandwidth asynchronous communication
  • fault tolerance
  • heterogeneous data support

The following diagram provides a high-level overview of the proposed architecture.

overview.png

The tool front end, tool junctions and tool agents are tool-specific components that collectively implement the tool's functionality. The front end is the user interface component of the tool. Agents are components responsible for interaction with the application processes. Junctions are components that can be used to aggregate, filter, modify, transform, etc. messages sent between the front end and the agents.

The front end, junctions, and agents are not part of the STCI, but are supplied as part of the tool package. STCI is responsible for deploying the junctions and agents in a manner that enables communication between the agents and the front end, and between agents and the application processes. It also provides a range of other services, such as session management, interaction with system resources, I/O forwarding, file staging, etc.

Communication between the front end and agents takes place via streams. A stream is a logical grouping of agents and junctions using some routing topology. Two types of communication are supported: aggregate and point-to-point. Aggregate communication enables communication in a one-to-many or many-to-one pattern (i.e. multicast/reduction). Point-to-point communication allows messages to be addressed to individual end points in the infrastructure. Stream creation is a light-weight operation, and there can be many streams during the life time of a particular tool.

Architecture Details

The following diagram shows the STCI component architecture. Elements in blue are included in the scope of the design. Elements in green are external to the design (although may be included in the implementation.)

STCI_Architecture2.pdf

Components

Each component is described in more detail in the following sections (for definitions of the terms used, see the Glossary.)

New component hierarchy

Operation

This diagram shows the relationship between parts of the STCI showing also physical nodes and interconnect.

node-component-20080110.png

Here's a version of the same diagram showing what happens when the front-end is disconnected:

node-component-20080110-disconnected.png

Session Operation

Session Operation

Data Structures

These are the global data structures that will be used by the infrastructure components.

Communication

Communication operation descriptions are here.

Bootstrapping

A description of the bootstrapping process is here.

Session and Stream Management

The following diagram shows the high level activities involved in session and stream creation.

job_flow_1.png

A detailed description of the session and stream initialization and termination process is here.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License