Section notes for CS162 Section #12, April 23, 2001 Barbara Hohlt -------------------------------------------------------------------------- Distributed Systems Challenges - Matt Welsh -------------------------------------------------------------------------- - What's hard about distributed systems? - Components can fail independently - Network link between components can drop/reorder/corrupt/delay messages - Network link between two components of a system can go down - All nodes may not be trusted: may have nodes that are actively trying to corrupt and confuse the rest of the system - Can't synchronize events in time (General's Paradox) - Difficult to maintain consistent view of shared state across all nodes --> Note that none of these things are really an issue in a centralized system! - A practical example: Using the ATM machine - You go to the bank machine in Jakarta and try to withdraw 1 million rupiah (about $90 US). - Long distance charges are excessive so the ATM machine gives you your money on the spot - Later on, ATM machine calls up your bank and says "I need to withdraw a million rupiah" - Machine at the bank has to talk to another machine to figure out the current exchange rate - At the same time UC Berkeley is depositing your monthly salary via EFT - The bank actually has a large number of different computers that each have some idea of what your bank account balance is - The ATM machine is talking to machine A, EFT is talking to B - How to make sure that both transactions go through without leaving you $1,000,000 US in the hole?? - Event ordering - Basic question: Given events A and B in a distributed system, can you tell whether A happened before B? - In general: NO! Why not? - A and B may have taken place on different machines - It's like the theory of relativity: from the point of view of a third machine, can't tell whether A or B happened "first" since observation is limited by messages between the machines - In general networks can drop/reorder/delay messages indefinitely, making absolute notion of time infeasible - In the distributed-systems nomenclature we talk about "processes" rather than "machines". Each "process" is a distributed entity, very much like a standard OS "process", but processes can only interact by sending messages to one another. Of course, the network can do all kinds of nasty things to messages, like reorder and delay them. - Leslie Lamport (better known for LaTeX): "Time, Clocks, and the Ordering of Events in a Distributed System" (1978) defined the "happens-before" relation (usually written as an arrow "-->"): A --> B if: A and B are in the same process, and A occurs temporally before B or A is the sending of a message by one process and B is the receipt of the same message by another process, Note that A --> B and B --> C implies that A --> C. - This is a partial order: Not all events in the system can be characterized by the "-->" relation. Write "did not happen before" as "-/->". Then if A -/-> B and B -/-> A, A and B are "concurrent" events and we can say nothing about their "happens before" relationship. - In this world, it is not important whether A really happened before B in "real time" -- all that is important is that every node in the system agrees on the logical order in which A and B occur. To put it another way, A --> B can be thought of as "A causes B" since network messages are the only way for processes to affect each other (in this very simplistic model of course -- what if messages could go "out of band"?) - Logical clocks - In order to determine whether "A --> B" holds for two events A and B, we assign a LOGICAL CLOCK to each event that occurs in the system. The basic idea is that if A --> B, then the logical clock of A (written as "LC(A)") is less than LC(B). - Note that the reverse DOES NOT hold: LC(A) < LC(B) does not imply that A --> B because A and B might be totally independent events! - Idea is simple: Every system has a local clock which starts out at 0 and is incremented for every local action (for example, on every machine instruction) - Within a process it's clear that A --> B implies that LC(A) < LC(B), since A must have occurred before B in that process. - For different processes, if a process receives a message with LC(message) >= its local timestamp, then that process increments its local timestamp to be equal to LC(message)+1. In some sense this causes the process to "resynchronize" its logical clock with the process that it received the message from. The important thing here is to maintain the ordering that A --> B implies that LC(A) < LC(B). - Our ATM example using logical clocks - Let's have the following machines: - A is the ATM machine in Jakarta - B is the Wells-Fargo Bank machine - C is the government Currency Conversion machine - D is the machine in which the EFT is being Deposited into - E is the Berkeley EFT machine - Say that A starts with a logical clock of 0, B starts with a logical clock of 10, C starts with a logical clock of 0, D starts with a logical clock of 20, and E starts with a logical clock of 30. - Now we have the following messages: 1) A to B: "Withdraw $90" 2) B to C: "What is the currency conversion rate?" 3) C to B: "1 USD = 11000 IDR" 4) E to D: "Deposit $1500" 5) D to B: "Deposit $1500" And in "real time" these events occur in the order 1, 4, 2, 5, 3. (Of course the system does not know that!) - Let's calculate the LC value at each of the machines for each event Event A B C D E LC (msg) ------------------------------------------------------------- Start 0 10 0 20 30 Msg 1 1 10 0 20 30 1 Msg 4 1 10 0 32 31 31 Msg 2 1 11 12 32 31 11 Msg 5 1 34 12 33 31 33 Msg 3 1 34 13 33 31 13 - It is clear from looking at the LC's of the messages that 1 --> 2 --> 3 and 4 --> 5 But of course it also looks like 1, 2, 3 all happened before 4 and 5, when in fact these were intermixed. This is the meaning of "logical" in "logical clocks" -- all we know is that 1 --> 2 implies that LC(1) < LC(2), not the other way around! - Transactions and ACID Semantics - In complex distributed systems, network links and machines can fail independently of one another. Logical clocks only give us a way of talking about a (logical) partial ordering of events: they do not help us deal with failures in the system that can cause state to become inconsistent. - Classic example: What if the ATM machine in Jakarta crashes immediately after asking Wells-Fargo to withdraw $90, is rebooted, and sends the same withdrawal request again? You lose another $90! - The basic building-block of solutions to this sort of distributed problem is a TRANSACTION: an action that may affect state in multiple distributed resources, but is guaranteed to happen just ONCE and leave the system in a consistent state. Getting transactions right is hard: Larry Ellison isn't a multibillionaire for nothing you know! - In the database world it is often stated that transactions should have "ACID semantics". ACID stands for: - Atomicity: The transaction should appear to be a single, indivisible action - Consistency: The transaction leaves the distributed system in a consistent state, with all processes agreeing on what the state is - Isolation: Execution of the transaction is isolated from that of others ; that is, transactions do not interfere with one another - Durability: If the transaction completes successfully, the new state is durable (its effects persist - system crashes etc. do not cause the system to revert to the old state) - Getting ACID semantics isn't always easy, and some systems have explored alternatives which are easier to implement and more lax on the restrictions. One system (Armando Fox's TACC system from Berkeley) used the tongue-in-cheek term "BASE" to mean "Basically available, soft-state, eventual consistency". - Two Phase Commit (2PC) - The classic solution to distributed transactions is the so-called "two-phase commit" protocol, often abbreviated "2PC". The basic idea is that every process participating in a transaction maintains a persistent log (e.g., on disk) that will retain its state even if the process crashes. The processes use this log to store information on the state of the transaction, so that even if they crash they can recover the state. - Every 2PC has a "coordinator" which is responsible for driving the protocol. The protocol is as follows: 1) The coordinator C writes a log entry "prepare T" and commits the log to disk. 2) C sends a "prepare T" message to every participant in the transaction. 3) Each process P decides whether it can commit the transaction T. The idea here is that a node may not be able to commit a transaction if the state of the transaction is out of whack with its view of reality, or if the request is disallowed (i.e. requesting to withdraw more money than is in the bank account). If P can commit, it writes "ready T" to its log, and sends an "OK T" message to C. If not, it writes "no T" to its log and sends an "abort T" message to C. 4) If C receives an "OK T" message from every process P, it writes "commit T" to its log, and sends a "commit T" message to every process P. If C receives an "abort T" from *any* process P, or times out, it writes "abort T" to its log and sends an "abort T" message to every process P. 5) When P receives "commit T" or "abort T" it performs the commit/abort action and writes that message to its log. - In 2PC, processes can crash at any time and the entire transaction will either be committed or aborted at every process -- eventually! The idea is that if a process crashes it can read its log, find out what it was doing before it crashed, and find out from the coordinator what the result of the transaction was. - If P recovers and sees the message "commit T" in its log, it knows that the transaction was completed and performs a local "redo" of whatever the transaction was. For example if T was "deduct $90 from bank account A", the local redo will perform that operation on the local state for the account. - If P recovers and sees "abort T", it knows that the transaction was aborted and does whatever is needed to locally "undo" the transaction. - If P recovers and sees "ready T", it made a promise to commit the transaction locally - but, some other process may have caused an abort. P asks C (or some other node that participated in the transaction) the state of the transaction and either commits or aborts based on C's answer. - If P recovers and sees no log message for T, it must have crashed before responding to the prepare message from C. So, it simply aborts the transaction locally (assuming that C will time out the transaction causing an abort). - What happens if the coordinator crashes? Well, various cases can occur -- C can check with participating processes to tell if the transaction commited or aborted. In some cases however, processes have to wait for C to recover before the state of the transaction can be decided -- this is called the BLOCKING PROBLEM. - What is network links fail? Well, this makes it appear as though those site cut off from the rest of the world have failed, so the 2PC recovery mechanisms kick in and (eventually) cause the transaction to reach a consistent state. A special case of this is a NETWORK PARTITION: when network link failures split the set of nodes into two or more disjoint sets. In this case, if the coordinator and all participants are in one partition, there is no problem -- 2PC can complete. If the coordinator and participants are in several partitions, then this reduces to a case of link failure (on multiple links), and the 2PC recovery mechanisms avoid there being two "parallel universes" where some nodes think that the state of the transaction is X, while others think that the state is Y. - Since 2PC is so great, why doesn't every system in the world use it? - To a large extent 2PC is very popular and used in many "mission critical" systems where distributed transactions must absolutely obey ACID semantics. - However, 2PC does not scale well: as the number of processes grows, having to rely upon a central coordinator for transactions can become a serious bottleneck. Also, as the number of transient failures increases, the overhead of 2PC in terms of messages and logging can become problematic. - For example, would you want 2PC running every time you do a write() to a file over a network filesystem? Even in the absence of failures, the number of messages exchanged can be large (especially as the number of nodes grows), leading to serious performance problems. - Another way to deal with this is to engineer part of the problem away. For example, in a tightly-coupled cluster environment, where all of the systems are in the same well-engineered machine room, it may be possible to reduce the possibility of network partitions to essentially nil. In such a case it is possible to use cheaper alternatives to 2PC which assume partitions don't occur!