09. März 2015
CITEC 1.015

Murphy Was An Optimist

Safety-critical systems rarely fail in a way that was anticipated by
their designers (e.g., redundancy exhaustion).  NASA's C. Michael
Holloway observed: "To a first approximation, we can say that accidents
are almost always the result of incorrect estimates of the likelihood of
one or more things."  This presentation explores the factors that lead
to designers underestimating the possibility/probabilities of various
failures and provides help in identifying failure modes.  The latter is
done via examples of rare, but actually occurring failures.  This
includes:  Byzantine faults, component transmogrifications,
"evaporating" software, and exhaustively tested software that still failed.

Several examples will be given for Byzantine failures because these
failures are not know or not understood by many designers, but occur
often enough that they must be considered for any safety-critical system
design.  A Byzantine failure is one in which a fault produces different
symptoms to different observers and the observers must agree on their
observations.  The latter condition holds for all known systems that use
redundancy.  Non-Byzantine examples will include scenarios that most
designers would say "that can't happen" until hindsight explanations
show that, not only can they happen, but are likely to happen with
probabilities not allowed by safety-critical system requirements.