Fault Manager

Provides an API for managing fault policy. Fault policy can be provided on a per-stream basis.

Fault types

  • Infrastructure bootstrap faults
  • Startup faults
  • Link failures (temporary or permanent)
  • Data loss
  • Data corruption
  • Process failure

Notes from Jan 8 ORNL Meeting

Tool agent failure (non-fatal response)

  • kill all streams associated with the failed agent
  • remove the agent from all streams and continue
    • this is similar to adding and removing agents dynamically in a non-fault situation
    • May result in data loss depending on communication reliability policy
  • restart agent and "fix" streams
    • may be implemented as remove followed by add asynchronously)
    • May result in data loss depending on communication reliability policy

Service Agent or Session Agent fails

Same as tool agent but policy is to always restart

  • restart
    • same location if possible
  • internal state information may have to be reconstructed

Tool junctions fails

Similar to tool agent

  • kill all streams
  • restart junction
    • May result in data loss depending on communication reliability policy
  • Can't just remove the junction from the stream
    • What would it mean to remove the junction? What would the topology be?

Front-end fails

  • Continue running (i.e., treat as a FE disconnect)
    • May result in data loss depending on communication reliability policy
  • Kill session

Communication Failures

  • Connection timeouts
  • Data
    • corruption
    • dropped packets
  • Policy
    • unreliable, unordered
    • unreliable, ordered
    • reliable
      • data is not lost as long as the connection or a junction/agent doesn't fail
      • data is not lost even if the connection or a junction/agent fails

Shutdown

Normal

  • Shutdown infrastructure
    • Shutdown Session
      • Stream
        • Junction
      • Shutdown agents that were started by this session (Unless they're part of another session).

The shutdown is done asynchronously. E.g., shut down streams at the same time

Abnormal Shutdown of Session (Infrastructure still running)

Preserve:

  • std out/err

Scenarios:

  • Agent dies
  • Agent/FE calls abort
    • Agent calls stci_abort()
    • bcast "terminate" on control channel
      • to all agents in sesion
      • to FE
      • session agents
  • When agents get "terminate" signal
    • flush std out/err to session agent
  • Session agents wait for agents to shut down then kill if they don't
    • then flush std out/err stream
  • When IO stream is flushed, tear down session agents/streams/etc

Agents should timeout after being disconnected from the infrastructure. Action on timeout is dependent on policy:

  • Kill agent
  • Stay around
    • This may result in orphaned processes (is this needed?)

Abnormal Shutdown of Infrastructure

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License