Design Issues in Distributed Systems
The designers of a distributed systemmust take a number of design challenges into account. The system should be robust so that it can withstand failures. The system should also be transparent to users in terms of both file location and user mobility. Finally, the system should be scalable to allow the addition of more computation power, more storage, or more users. We briefly introduce these issues here. In the next section, we put them in context when we describe the designs of specific distributed file systems.
Robustness
A distributed system may suffer from various types of hardware failure. The failure of a link, a host, or a site and the loss of a message are the most common types. To ensure that the system is robust, we must detect any of these failures, reconfigure the system so that computation can continue, and recover when the failure is repaired.
Asystem can be fault tolerant in that it can tolerate a certain level of failure and continue to function normally. The degree of fault tolerance depends on the design of the distributed system and the specific fault. Obviously, more fault tolerance is better.
We use the term fault tolerance in a broad sense. Communication faults, certain machine failures, storage-device crashes, and decays of storage media should all be tolerated to some extent. A fault-tolerant system should continue to function, perhaps in a degraded form, when faced with such failures. The degradation can affect performance, functionality, or both. It should be pro- portional, however, to the failures that caused it. A system that grinds to a halt when only one of its components fails is certainly not fault tolerant.
Unfortunately, fault tolerance can be difficult and expensive to implement. At the network layer, multiple redundant communication paths and network devices such as switches and routers are needed to avoid a communication failure. A storage failure can cause loss of the operating system, applications, or data. Storage units can include redundant hardware components that auto- matically take over from each other in case of failure. In addition, RAID systems can ensure continued access to the data even in the event of one ormore storage device failures (Section 11.8).
Failure Detection
In an environment with no shared memory, we generally cannot differentiate among link failure, site failure, host failure, and message loss. We can usu- ally detect only that one of these failures has occurred. Once a failure has been detected, appropriate action must be taken. What action is appropriate depends on the particular application.
To detect link and site failure, we use a heartbeat procedure. Suppose that sites A and B have a direct physical link between them. At fixed intervals, the sites send each other an _I-am-up_message. If site Adoes not receive thismessage within a predetermined time period, it can assume that site B has failed, that the link betweenAandB has failed, or that themessage fromBhas been lost. At this point, site Ahas two choices. It can wait for another time period to receive an I-am-up message from B, or it can send an Are-you-up? message to B.
If time goes by and site Astill has not received an I-am-up message, or if site Ahas sent an Are-you-up? message and has not received a reply, the procedure can be repeated. Again, the only conclusion that site A can draw safely is that some type of failure has occurred.
Site A can try to differentiate between link failure and site failure by send- ing an Are-you-up? message to B by another route (if one exists). If and when B receives this message, it immediately replies positively. This positive reply tells A that B is up and that the failure is in the direct link between them. Since we do not know in advance how long it will take the message to travel fromAto B and back, we must use a time-out scheme. At the time A sends the Are-you-up? message, it specifies a time interval during which it is willing to wait for the reply from B. If A receives the reply message within that time interval, then it can safely conclude that B is up. If not, however (that is, if a time-out occurs), then A may conclude only that one or more of the following situations has occurred:
• Site B is down.
• The direct link (if one exists) from A to B is down.
• The alternative path from A to B is down.
• The message has been lost. (Although the use of a reliable transport pro- tocol such as TCP should eliminate this concern.)
Site A cannot, however, determine which of these events has occurred.
Reconfiguratio
Suppose that site A has discovered, through the mechanism just described, that a failure has occurred. It must then initiate a procedure that will allow the system to reconfigure and to continue its normal mode of operation.
• If a direct link fromAto B has failed, this informationmust be broadcast to every site in the system, so that the various routing tables can be updated accordingly.
• If the system believes that a site has failed (because that site can no longer be reached), then all sites in the system must be notified, so that they will no longer attempt to use the services of the failed site. The failure of a site that serves as a central coordinator for some activity (such as deadlock detection) requires the election of a new coordinator. Note that, if the site has not failed (that is, if it is up but cannot be reached), then we may have the undesirable situation in which two sites serve as the coordinator. When the network is partitioned, the two coordinators (each for its own partition) may initiate conflicting actions. For example, if the coordinators are responsible for implementing mutual exclusion, we may have a situation in which two processes are executing simultaneously in their critical sections.
Recovery from Failure
When a failed link or site is repaired, it must be integrated into the system gracefully and smoothly.
• Suppose that a link between A and B has failed. When it is repaired, both A and B must be notified. We can accomplish this notification by continu- ously repeating the heartbeat procedure described in Section 19.5.1.1.
• Suppose that site B has failed. When it recovers, it must notify all other sites that it is up again. Site B then may have to receive information from the other sites to update its local tables. For example, it may need routing- table information, a list of sites that are down, undelivered messages, a
transaction log of unexecuted transactions, and mail. If the site has not failed but simply cannot be reached, then it still needs this information.
Transparency
Making the multiple processors and storage devices in a distributed system transparent to the users has been a key challenge to many designers. Ide- ally, a distributed system should look to its users like a conventional, cen- tralized system. The user interface of a transparent distributed system should not distinguish between local and remote resources. That is, users should be able to access remote resources as though these resources were local, and the distributed system should be responsible for locating the resources and for arranging for the appropriate interaction.
Another aspect of transparency is user mobility. It would be convenient to allow users to log into any machine in the system rather than forcing them to use a specific machine. A transparent distributed system facilitates user mobil- ity by bringing over a user’s environment (for example, home directory) to wherever he logs in. Protocols like LDAP provide an authentication system for local, remote, and mobile users. Once the authentication is complete, facilities like desktop virtualization allow users to see their desktop sessions at remote facilities.
Scalability
Still another issue is scalability—the capability of a system to adapt to increased service load. Systems have bounded resources and can become completely saturated under increased load. For example, with respect to a file system, saturation occurs either when a server’s CPU runs at a high utilization rate or when disks’ I/O requests overwhelm the I/O subsystem. Scalability is a relative property, but it can be measured accurately. A scalable system reacts more gracefully to increased load than does a nonscalable one. First, its performance degrades more moderately; and second, its resources reach a saturated state later. Even perfect design however cannot accommodate an ever-growing load. Adding new resources might solve the problem, but it might generate additional indirect load on other resources (for example, adding machines to a distributed system can clog the network and increase service loads). Even worse, expanding the system can call for expensive design modifications. A scalable system should have the potential to grow without these problems. In a distributed system, the ability to scale up gracefully is of special importance, since expanding a network by adding new machines or interconnecting two networks is commonplace. In short, a scalable design should withstand high service load, accommodate growth of the user community, and allow simple integration of added resources.
Scalability is related to fault tolerance, discussed earlier. A heavily loaded component can becomeparalyzed andbehave like a faulty component. In addi- tion, shifting the load from a faulty component to that component’s backup can saturate the latter. Generally, having spare resources is essential for ensur- ing reliability as well as for handling peak loads gracefully. Thus, the multi- ple resources in a distributed system represent an inherent advantage, giving the system a greater potential for fault tolerance and scalability. However,
inappropriate design can obscure this potential. Fault-tolerance and scalability considerations call for a design demonstrating distribution of control and data.
Scalability can also be related to efficient storage schemes. For example, many cloud storage providers use compression or deduplication to cut down on the amount of storage used. Compression reduces the size of a file. For exam- ple, a zip archive file can be generated out of a file (or files) by executing a zip command, which runs a lossless compression algorithm over the data speci- fied. (Lossless compression allows original data to be perfectly reconstructed from compressed data.) The result is a file archive that is smaller than the uncompressed file. To restore the file to its original state, a user runs some sort of unzip command over the zip archive file. Deduplication seeks to lower data storage requirements by removing redundant data. With this technology, only one instance of data is stored across an entire system (even across data owned by multiple users). Both compression and deduplication can be performed at the file level or the block level, and they can be used together. These techniques can be automatically built into a distributed system to compress information without users explicitly issuing commands, thereby saving storage space and possibly cutting down on network communication costs without adding user complexity.