Data Staging Use Case

Title: Data Staging

High Level Description

  • User has files accessible from the front-end and wants to distribute them to the compute nodes, either to local storage, or directly to memory.
  • There may be zero or more files to distribute.
  • The files may be distributed to a subset of nodes.
  • Files may need to be distributed at any time from before the app starts executing, while the app is executing, to after the app exits.

Environment Assumptions

  • OS/architecture need not be homogeneous across compute nodes.
  • OS/architecture need not be the same on front-end node as any compute node.
  • OS/architecture need not be homogeneous on the infrastructure nodes
  • Compute nodes may or may not have local storage.
  • Compute nodes do not have a shared filesystem from which to access the files, or the shared filesystem is too slow to be used to distribute the files.
  • User must have write access to destination directory in local storage on compute nodes (only if files are staged to disk) and read access to files on front-end node.

STCI Infrastructure Assumptions

  • STCI infrastructure is up and running
  • High-bandwidth data transfer
  • Multiple data transfers can be performed asynchronously
  • Performance requirements
    • X MB file can be staged on Y nodes in T seconds
    • X MBps data transfer rate between "stages", i.e., front-end to plugin, plugin to plugin and plugin to agent

Specific Use Cases

I. Staging binary executables to homogeneous compute nodes.

A single binary executable file is staged to each compute node. The front-end can specify whether or not the file is deleted on application exit. The front-end can issue a request to delete the file at any time.

  • Detailed Tool (Agent + Front-end) startup description
    • Agent and Front-end are started after job is scheduled, but before app starts running.
  • Detailed Communications patterns (including any "filter" expectations from STCI)
    • Data flows from front-end (possibly directly from disk), through STCI and plugins, to an agent (again, possibly directly to disk or directly to memory) on every compute node.
    • Plugins (filters) receive data from parent and resend data verbatim to each child.
    • The file can be compressed by the front-end and decompressed by the agent
  • Requirements from STCI
    • High-bandwidth data transfer (in the direction from front-end towards compute nodes)
      • to allow tool to perform a high-bandwidth broadcast
  • Life cycle
    • Agents and plugins must be deployed and started before the application starts
    • Agents and plugins must ensure that all active transfers complete before shutting down
  • Startup
  • Monitoring
    • Should be able to query progress of transfers
  • Response to errors
    • Errors in transferring files should not be fatal
  • Termination
    • Agents and plugins must ensure that all active transfers complete before a normal exit
    • In the event of abnormal termination (e.g., abort()) active transfers are killed, possibly without clean-up
    • Front-end can request to kill active transfers
      • Agents and plugins should clean-up by removing partially transfered files
  • Infrastructure Communications requirements
  • Scalability
    • STCI must be able to support one agent per process (10,000's)
    • STCI must be able to support O(log(n)) plugins (n is # of agents)
    • Data-staging tool will configure a single tree topology connecting plugins for high-bandwidth distribution of files
  • Communications latencies
  • Communications bandwidths
    • Transfer X MBps between plugin stages / agents
  • Communications fault tolerance
    • Faults must be detected and reported by STCI
  • Is the infrastructure assumed to provide some sort of proxy services, such as I/O forwarding
    • The ability for the front-end to transfer files directly from/to disk would improve performance

II. Staging binary executables to heterogeneous compute nodes.

Similar to above, except that the front-end has one executable file for each compute node architecture. Each compute node will receive one of those files.

  • Detailed Communications patterns (including any "filter" expectations from STCI)
    • Multiple topologies (streams) will be used
    • Each stream connects an exclusive subset of the compute nodes to the front-end
    • Transfers should be able to be made concurrently
      • i.e., the initiation of the transfer of one file should not have to wait for another transfer to complete
    • Agents, plugins and the front-end may need to perform data conversion (endian swap, etc) on the headers / metadata exchanged by the tool.
    • The plugins won't need to perform data conversion on the files themselves, because the plugins are not interpreting the data, just copying it verbatim.

III. Staging dynamically-loaded objects during execution

The application makes a request to stage the binary for a dynamically-loaded object in the course of execution. The request is sent from the process to an agent, e.g., via an RPC.

IIIa. Applications request dynamically-loaded objects collectively (synchronous)

The request will be a collective request (i.e., if one process makes a request, then all processes will make a request), but each process need not request the same file and some may request no file at all. In this version the requests are blocked until all processes send the request.

  • Detailed Communications patterns (including any "filter" expectations from STCI)
    • The agents receive the requests from the processes and send them to the front-end via the plugins.
    • The plugins receive the requests from the agents and consolidate the requests
      • Because the processes make the requests collectively, a plugin will wait to receive one from each child before forwarding the consolidated request to its parent
        • Plugins will do a 'blocking synchronization" for incoming requests
      • A consolidated request will consist of a list of files to be staged along with a list of nodes to which each file should be staged (note that this list may specify ranges of nodes rather then listing each node, e.g., "/lib/liba.so 0,3-14,17", "/lib/libb.so 1,2")
    • The actual staging of files should be the same as the previous examples
  • Monitoring
    • App should be able to query whether its requested file has been staged yet

IIIb. Applications request dynamically-loaded objects collectively (asynchronous, with file caching)

The request will be a collective request (i.e., if one process makes a request, then all processes will make a request), but each process need not request the same file and some may request no file at all. In this version, the requests are forwarded to the front-end, and the front-end starts staging the files as the requests come in. The plugins cache the files to satisfy additional requests for the same file, rather than having the front-end resend the file each time.

  • Detailed Communications patterns (including any "filter" expectations from STCI)
    • The agents receive the requests from the processes and send them to the front-end via the plugins.
    • A plugin receives a request from an agent and passes it to its parent, not waiting for additional requests to consolidate
    • The plugin will remember which requests it has forwarded, and not forward additional requests for the same file. Instead it will add the child to the list of children the file should be forwarded to when the file comes in.
    • When the plugin receives a file, it forwards the file to each child that requested it, and also keeps a copy of the file locally
    • If a plugin receives a request for a file which it has cached, it will send the file to the requestor from its cache.
    • The cached files can be removed when a request has been received from all children
  • Monitoring
    • App should be able to query whether its requested file has been staged yet

IIIc. Applications request dynamically-loaded objects individually

Each individual process can make a request for a dynamically-loaded object.

  • Detailed Communications patterns (including any "filter" expectations from STCI)
    • An agent receives a request from a processes and sends it to the front-end via the plugins.
    • When a plugin receives a request it forwards it to its parent. Because requests are not collective, the plugin should forward the request immediately. No consolidation is performed.
    • The actual staging of files is similar to cases I and !!, except that only one file is being send and it is sent to exactly one agent.
  • Monitoring
    • App should be able to query whether its requested file has been staged yet

IV. Distributing data files

Essentially as I-III above except with data files rather than binary.

For heterogeneous compute nodes should the tool perform data conversion? I'm not sure, since the tool would need to know the file format.

V. Distributing multiple files

More than one file is to be staged at each node.

  • Detailed Communications patterns (including any "filter" expectations from STCI)
    • Transfers of files to the same process should be pipelined. I.e., the front-end should initiate the next transfer as soon as it has locally completed the previous transfer, rather than waiting for the previous transfer to complete at the compute nodes.
  • Monitoring
    • Should be able to query the status of each transfer.

VI. Staging of agents and plugins

Staging binaries for plugins to the infrastructure nodes and for agents to the compute nodes

VIa. Staging of agents and plugins: Agents are deployed on infrastructure nodes

This would be essentially the same as the above cases, except that

  • the file-staging agents and plugins would need to be deployed and started up with the infrastructure before any other plugins or agents can be staged
  • agents would be deployed on infrastructure nodes to allow staging of plugin binaries

VIb. Staging of agents and plugins: Plugins, but not agents are deployed on infrastructure nodes**

In this case the data staging tool plugins copy the binaries of the plugins being staged to local storage on the infrastructure nodes.

  • As in VIa, the file-staging agents and plugins would need to be deployed and started up with the infrastructure before any other plugins or agents can be staged
  • Agents would be deployed on all compute nodes, and plugins are deployed on all infrastructure nodes
  • As a plugin forwards a files to be staged, it will save a copy of the file to disk, just as an agent would in the cases above. Essentially, the plugin now has agent capabilities.
  • Note that not all files being staged will be saved by every plugin
    • when a binary for an agent, as opposed to a plugin, is being staged
    • when not all infrastructure nodes will receive the same plugin binary, e.g.:
      • heterogeneous infrastructure nodes
      • different plugins at different stages of the tree

VII. Collecting data from compute nodes back to front-end

One or more files are sent from any subset of compute nodes to the front-end. When more than one compute node is involved in the same transfer operation, there is an opportunity to combine/coalesce that set of files in order to compress them.

  • Detailed Communications patterns (including any "filter" expectations from STCI)
    • The agent copies data from the disk (or memory), and forwards it to the front-end via the plugins.
    • The plugins forward a file from a child to the parent
    • When more than one compute node is sending a file as part of the same operation, there is an opportunity to coalesce the files to reduce the amount of data being sent. E.g., if the files are largely the same, such as in a checkpoint image:
      • The first time a plugin receives a file of an operation, it keeps a copy
      • For each subsequent file it receives from the same operation, it performs a diff with the original and only forwards the diff with information on which file it was diffed with, etc.
      • The front-end can either reconstruct the original set or files, or store the files in this compressed form
    • In the above example, the transfer operation is initiated by the front-end
      • A request message to initiate the operation is sent from the front-end to the set of agents.
      • The transfer operation can also be initiated by the application via the agent, e.g., if one or more files are being transfered from a single process
  • Requirements from STCI
  • High-bandwidth data transfer (in the direction from compute nodes toward front-end)
    • To allow tool to perform a high-bandwidth gather/reduction

VIII. Issues with heterogeneous infrastructure nodes

The nodes of the infrastructure on which the plugins will execute may be heterogeneous.

  • A separate plugin binary will be needed for each architecture
  • Plugins, agents and the front-end will need to perform data conversion on headers and metadata.

Target Computer Systems

  • Large clusters w/o scalable shared filesystems (10,000's of nodes+ )
  • Large parallel machines w/o scalable shared filesystems (10,000's of nodes+ )

Contributors

  • Darius Buntinas
  • Rich Graham
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License