Introduction To Real-time Systems And Fault-tolerance

When a component of a computer system fails, it will usually produce some undesirable effects and it can be said to no longer behave according to its specification. Such a breakdown of a component is called a fault and its consequence is called a failure. A fault may occur sporadically, or it may be stable and cause the component to fail permanently. Even when a fault occurs instantaneously, a fault such as a memory fault may have consequences that manifest themselves after a considerable time.

Fault-tolerance is the ability of a system to function correctly despite the occurrence of faults. Faults caused by errors (or ‘bugs’) in software are systematic and can be reproduced in the right conditions. The formal methods described in previous chapters address the problem of errors in software and, while their use does not guarantee the absence of software errors, they do provide the means ofmaking a rigorous, additional check. Hardware errors may also be systematic but in addition they can have random causes. The fact that a hardware component functions correctly at some time is no guarantee of flawless future behaviour;  Note that hardware faults often affect the correct behaviour of software. One of the reasons for introducing dynamic scheduling is to deal with the unexpected computational load imposed when faults do occur.

Of course, it is not possible to tolerate every fault. A failure hypothesis stipulates how faults affect the behaviour of a system. An example of a failure hypothesis is the assumption that a communication medium might corrupt messages. With triple modular redundancy, a single component is replaced by three replicas and a voter that determines the outcome, and the failure hypothesis is that at any time at most one replica is affected by faults. Afailure hypothesis divides abnormal behaviour, i.e. behaviour that does not conform to the specification, into exceptional and catastrophic behaviours. Exceptional behaviour conforms to the failure hypothesis and must be tolerated, but no attempt need be made to handle catastrophic behaviour (and, indeed, no attemptmay be possible). For example, if the communication medium mentioned earlier repeatedly sends the same message, then this may be catastrophic for a given fault-tolerance scheme. It is important to
note that ‘normal’ behaviour does not mean ‘perfect’ behaviour: after a time-out occurs, the retransmission of a message by a sender is normal but it may result in two copies of the same message reaching its destination. Exceptional and normal behaviours together form the acceptable behaviour that the system must tolerate.

We shall use the compositional proof method of Chapter 5 for reasoning about acceptable behaviour, and the failure hypothesis of a system will be formalized as a relation between its normal and acceptable behaviour. Such a relation will allow us to abstract from the precise nature and occurrence of faults and focus on the abnormal behaviour that might be caused. This will lead us to a proof rule by which a specification of the acceptable behaviour can be obtained from the specification of the normal behaviour and a predicate characterizing the failure hypothesis. Given a failure hypothesis c, Poc stands for ‘P under c’ and means execution of process P under the assumption c. The acceptable behaviour of process P under the failure hypothesis c is the normal behaviour of the failure prone process Poc.

Use of the method will be demonstrated on the mine pump problem. We shall describe how each component can be affected by malfunctions and then devise ways to tolerate the failures. Because we have to be particularly careful, shifts might be missed unnecessarily. However, we shall prove that the resulting system is safe, i.e. it will not cause an explosion.