Fault Manager
Provides an API for managing fault policy. Fault policy can be provided on a per-stream basis.
Fault types
- Infrastructure bootstrap faults
- Startup faults
- Link failures (temporary or permanent)
- Data loss
- Data corruption
- Process failure
Notes from Jan 8 ORNL Meeting
Tool agent failure (non-fatal response)
- kill all streams associated with the failed agent
- remove the agent from all streams and continue
- this is similar to adding and removing agents dynamically in a non-fault situation
- May result in data loss depending on communication reliability policy
- restart agent and "fix" streams
- may be implemented as remove followed by add asynchronously)
- May result in data loss depending on communication reliability policy
Service Agent or Session Agent fails
Same as tool agent but policy is to always restart
- restart
- same location if possible
- internal state information may have to be reconstructed
Tool junctions fails
Similar to tool agent
- kill all streams
- restart junction
- May result in data loss depending on communication reliability policy
- Can't just remove the junction from the stream
- What would it mean to remove the junction? What would the topology be?
Front-end fails
- Continue running (i.e., treat as a FE disconnect)
- May result in data loss depending on communication reliability policy
- Kill session
Communication Failures
- Connection timeouts
- Data
- corruption
- dropped packets
- Policy
- unreliable, unordered
- unreliable, ordered
- reliable
- data is not lost as long as the connection or a junction/agent doesn't fail
- data is not lost even if the connection or a junction/agent fails
Shutdown
Normal
- Shutdown infrastructure
- Shutdown Session
- Stream
- Junction
- Shutdown agents that were started by this session (Unless they're part of another session).
- Stream
- Shutdown Session
The shutdown is done asynchronously. E.g., shut down streams at the same time
Abnormal Shutdown of Session (Infrastructure still running)
Preserve:
- std out/err
Scenarios:
- Agent dies
- Agent/FE calls abort
- Agent calls stci_abort()
- bcast "terminate" on control channel
- to all agents in sesion
- to FE
- session agents
- When agents get "terminate" signal
- flush std out/err to session agent
- Session agents wait for agents to shut down then kill if they don't
- then flush std out/err stream
- When IO stream is flushed, tear down session agents/streams/etc
Agents should timeout after being disconnected from the infrastructure. Action on timeout is dependent on policy:
- Kill agent
- Stay around
- This may result in orphaned processes (is this needed?)