What Does Fault Tolerance Mean? Get Its Information Now [MiniTool Wiki]
Fault Tolerance Definition
What is fault tolerance definition? Fault tolerance is the ability that allows a system (computer, network, cloud cluster, etc.) to continue operating normally without interruption even if one or more components fail. A fault-tolerant design allows the system to continue its expected operation, possibly at a reduced level, rather than failing, when certain parts of the system fail.
Within the scope of an individual system, fault tolerance can be achieved by anticipating abnormal conditions and building systems to deal with these conditions, and generally speaking, the goal is to achieve self-stability, so that the system converges to an error-free state.
However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of replication. In any case, if the consequences of a system failure are catastrophic, the system must be able to use restore to fall back to safe mode. This is similar to rollback recovery, but if there are people in the loop, it may be human behavior.
Fault-tolerant computing may include several levels of fault tolerance:
- The lowest level: The ability to respond to power failures.
- Strengthening level: When the system fails, the backup system can be used immediately.
- Enhanced level: If a disk fails, the mirrored disk will immediately take over the disk. Although part of the system fails or degrades normally, it can still provide functionality instead of immediate crash and loss of functionality.
- High level: Multiple processors cooperate to scan data and output to detect errors, and then immediately correct them.
The fault-tolerant system ensures that service will not be interrupted by using backup components that automatically replace failed components. These may include:
- Hardware systems with the same or equivalent backup operating system. For example, a server with the same fault-tolerant server can run mirroring of all operations in the backup in parallel, so the server is fault-tolerant. By eliminating single points of failure, the redundant form of hardware fault tolerance can make any component or system more secure and reliable.
- Software systems are backed up by other software instances. For example, if you continuously replicate the customer database, an operation in the primary database can be automatically redirected to the second database if the first fails.
- If the backup power supply can automatically take over during a power failure, the redundant power supply can help avoid system failures, thus ensuring no loss of service.
Fault Tolerance Requirements
Here are some basic requirements of fault tolerance:
- There is no single point of failure - If the system fails, it must continue to operate without interruption during the repair process.
- Fault isolation from failing components - When a failure appears, the system must be able to isolate the failure from the offending component. This requires the addition of dedicated fault detection mechanisms, which are only used for fault isolation. Recovery from a fault condition requires classification of the fault or failing component.
- Fault containment prevents the propagation of the failure - Certain fault mechanisms will propagate the failure to the rest of the system, possibly causing system failure. A firewall or other mechanism is needed to isolate malicious transmitter or malfunctioning components to protect the system.
- Availability of reversion modes.
You may like this: Surprising Instructions to Recover Your Data after Power Outage
Fault Tolerance Disadvantages
Although fault-tolerant design possesses many obvious advantages, there are some disadvantages:
- Interfere with fault detection in the same component.
- Interfere with fault detection in another component.
- Reduce the priority of fault correction.
- Test difficulty.
- Inferior ingredients.
Fault Tolerance Examples
Hardware fault tolerance sometimes requires removing damaged parts and replacing them with new parts while the system is still running (referred to hot swapping in computing). This kind of system implemented with a single backup is called a single point tolerant and represents the vast majority of fault-tolerant systems.
In such a system, the average time between failures should be long enough to allow the operator time to repair the damaged equipment (average repair time) before the backup also fails. If the time between two failures is as long as possible, it will help, but this is not particularly required in a fault-tolerant system.
Fault tolerance is particularly successful in computer applications. Tandem Computer has built such a machine for its entire business. It used a single-point tolerance to create their NonStop system with uptimes measured in years.
The fail-safe architecture may also include computer software, for example through process replication. The data format can also be designed to be moderately degraded. For example, HTML is designed to be forward compatible, allowing web browsers that do not understand new HTML entities to ignore them without rendering the document unusable.