Reno November 12 2007

The third STCI meeting was held in conjunction with SC07 in Reno, NV on Monday November 12, from 1-5 pm, in the Tradewinds II on the convention level of the Atlantis Hotel.

Attendees:

Rob Fowler (RENCI)
Tim Mattox (IU)
Steve Cooper (IBM)
Darius Buntinas (ANL)
Michael Brim (UWISC)
George Bosilca (UTK)
Greg Watson (IBM)
John DelSignore (TotalView)

License

The license issue was again raised, but no resolution was reached. Key points were:

  • OpenMPI is licensed under BSD, so a BSD-style license would be preferred
  • IBM cannot contribute to a project licensed under GPL or LGPL
  • MRNet is licensed under LGPL, so this is the preferred license.

Currently, a license using either the CPL, EPL or Apache has the most support.

Timeline

A discussion was held on what the timeline for v1.0 should be, and if there were any deadlines that needed to be met. The only requirement identified was mid-08 by ORNL.

Design

We need to split the start-up use cases into two separate cases and add more detail:

  • preexisting infrastructure running as root
  • per-user infrastructure startup

For the user-instantiated infrastructure, when is it torn down? When the last session terminates? What if you want to keep it around because you know you'll be starting another session after that? Tim and George proposed that if the infrastructure is torn down after the last session exits and you want the infrastructure to remain between sessions, you can start a dummy session when the infrastructure is started, just so that there is aways at least on session running. When you're ready to tear down the session, you terminate the dummy session.

In the session terminate use-case diagram, it shows that agents are terminated. In some cases agents shouldn't be terminated. E.g., if the agent was started outside of the session, or if an agent is still being used by another session.

Front-end disconnect. It was brought up in an email previously that when the front-end disconnects, a tool may want to continue processing messages (e.g., to log messages, or perform autonomous analysis). At that time, it was suggested that the tool implementor can add a junction just before the FE through which all streams flow to act as a proxy for the FE, rather than explicitly add a proxy component. Now, however, a junction can only be part of one stream, and so cannot act effectively as a proxy for a FE that has many streams. At the SC07 meeting it was suggested to "move the disconnect line" so that when the FE disconnects, part of the FE remains connected to the infrastructure, running on an infrastructure node: perhaps the bulk of the work of the FE is done on the persistent part of the FE on an infrastructure node, and the "disconnectable" part is just the UI. Since we believe that disconnecting will be a common tool feature, the STCI should provide support for it rather than leaving it to the tool developer to implement disconnect/reconnect.

The documentation should note that junctions can run on the same nodes as compute nodes.

Reliability and message ordering should be selectable on a per-stream basis. Some tools may be willing to sacrifice reliability or in order message delivery for better performance. If a tool doesn't need those, the infrastructure may be able to better optimize message transfer.

Still need to make diagrams for

  • stream startup details.
  • security manager
  • connect/disconnect of front-end
  • multiple front-ends connecting
  • agent connection/lifecycle

Need to add MPI process manager use-case (in the use case section) so we have somewhere to point to to justify features. E.g., the "spawn additional processes" diagram seemed MPI-specific. STCI, as defined so far, does not seem to prevent a tool from simply starting a new agent on a compute node and creating a stream that includes it and the existing agents. This is would be the more likely use case for tools like debuggers.

The following organizations indicated they were willing to assist in the completion of the the design document and block diagrams: ORNL, UTK, IBM, ANL. It was agreed to have the design substantially completed by the January meeting.

Implementation Plan

It was agreed that an implementation plan would be necessary, however the architecture and design document needs to be completed first.

Resources

A discussion on resourcing the project was held. The following potential resources were identified:

  • ORNL - 1 FTE + 1 student
  • UTK - 1 post doc + 1 student
  • IBM - 1 FTE + 1 intern
  • IU - 1 post doc
  • ANL - 1 FTE + 1 intern
  • RENCI - as possible

Next meeting

The exact dates of the next meetings to be determined via email. The goal is to have at least a teleconference in December and a face-to-face meeting in January.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License