Man has more than twice the power
that he needs to support himself
—Leonardo da VinciIn this final chapter, we briefly review techniques and concepts in fault-tolerant computing. Then we sketch the design of a fault-tolerant parallel computer, hpc (“hypercube parallel computer”), based on the results and ideas from previous chapters.
Introduction
A fault-free computer, or any human artifact, has never been built, and will never be. No matter how reliable each component is, there is always possibility, however small, that it will go wrong. Statistical principles dictate that, other things being equal, this possibility increases as the number of components increases. Such an event, if not anticipated and safe-guarded against, will eventually make the computer malfunction and lead to anything from small annoyance and inconvenience to disaster.
Recently, the same enormous decrease in hardware cost which makes parallel computers economically feasible also makes fault tolerance more affordable [297]. In other words, the low cost of hardware makes possible both high degree of fault tolerance using redundancy and high performance. Indeed, most fault-tolerant computers today employ multiple processors; see [241, 254, 317] for good surveys.
It is in the light of these backgrounds that we take this extra step toward designing a hypercube parallel computer (HPC for short). In the HPC processors are grouped into logical clusters consisting of physically close processors, and each program execution is replicated at all members of a cluster. Clusters overlap, however. The concept of cluster — logical or physical — introduces a two-level, instead of flat, organization and can be found in, for example, the Cm [344], Cedar [187], and FTPP (“Fault Tolerant Parallel Processor”) [148] computers.