Section notes for CS162 
Section #12, April 23, 2001
Barbara Hohlt <hohltb@cs.berkeley.edu>

--------------------------------------------------------------------------
Distributed Systems Challenges
 - Matt Welsh 
--------------------------------------------------------------------------

- What's hard about distributed systems?
   - Components can fail independently
   - Network link between components can drop/reorder/corrupt/delay
     messages
   - Network link between two components of a system can go down
   - All nodes may not be trusted: may have nodes that are
     actively trying to corrupt and confuse the rest of the system
   - Can't synchronize events in time (General's Paradox)
   - Difficult to maintain consistent view of shared state across
     all nodes

   --> Note that none of these things are really an issue in a centralized
       system!

- A practical example: Using the ATM machine
   - You go to the bank machine in Jakarta and try to withdraw 1 million
     rupiah (about $90 US).
   - Long distance charges are excessive so the ATM machine gives you your
     money on the spot
   - Later on, ATM machine calls up your bank and says "I need to 
     withdraw a million rupiah"
   - Machine at the bank has to talk to another machine to figure out the
     current exchange rate
   - At the same time UC Berkeley is depositing your monthly salary via EFT
   - The bank actually has a large number of different computers that each 
     have some idea of what your bank account balance is 
       - The ATM machine is talking to machine A, EFT is talking to B
   - How to make sure that both transactions go through without leaving
     you $1,000,000 US in the hole??
     
- Event ordering
   - Basic question: Given events A and B in a distributed system, can you 
     tell whether A happened before B?

   - In general: NO! Why not?
      - A and B may have taken place on different machines
      - It's like the theory of relativity: from the point of view of
        a third machine, can't tell whether A or B happened "first" 
	since observation is limited by messages between the machines
      - In general networks can drop/reorder/delay messages indefinitely,
        making absolute notion of time infeasible

   - In the distributed-systems nomenclature we talk about "processes" 
     rather than "machines". Each "process" is a distributed entity,
     very much like a standard OS "process", but processes can only interact
     by sending messages to one another. Of course, the network can
     do all kinds of nasty things to messages, like reorder and delay
     them. 

   - Leslie Lamport (better known for LaTeX): "Time, Clocks, and the
     Ordering of Events in a Distributed System" (1978) defined the 
     "happens-before" relation (usually written as an arrow "-->"):
        
	A --> B if:
	  A and B are in the same process, and A occurs temporally before B 
	  or
	  A is the sending of a message by one process and B is the receipt 
	  of the same message by another process,
     
     Note that A --> B and B --> C implies that A --> C.

   - This is a partial order: Not all events in the system can be
     characterized by the "-->" relation. Write "did not happen before"
     as "-/->". Then if A -/-> B and B -/-> A, A and B are "concurrent"
     events and we can say nothing about their "happens before" relationship.

   - In this world, it is not important whether A really happened before B
     in "real time" -- all that is important is that every node in the
     system agrees on the logical order in which A and B occur. To put it
     another way, A --> B can be thought of as "A causes B" since network
     messages are the only way for processes to affect each other (in this
     very simplistic model of course -- what if messages could go "out of
     band"?)

- Logical clocks
   - In order to determine whether "A --> B" holds for two events A and B, 
     we assign a LOGICAL CLOCK to each event that occurs in the system.
     The basic idea is that if A --> B, then the logical clock of A 
     (written as "LC(A)") is less than LC(B). 

   - Note that the reverse DOES NOT hold:
       LC(A) < LC(B) does not imply that A --> B
     because A and B might be totally independent events!

   - Idea is simple: Every system has a local clock which starts out at 0
     and is incremented for every local action (for example, on every
     machine instruction)
   
   - Within a process it's clear that A --> B implies that LC(A) < LC(B),
     since A must have occurred before B in that process.

   - For different processes, if a process receives a message with
     LC(message) >= its local timestamp, then that process increments its
     local timestamp to be equal to LC(message)+1. In some sense this causes
     the process to "resynchronize" its logical clock with the process 
     that it received the message from. The important thing here is to
     maintain the ordering that A --> B implies that LC(A) < LC(B).

- Our ATM example using logical clocks
   - Let's have the following machines:
       - A is the ATM machine in Jakarta
       - B is the Wells-Fargo Bank machine
       - C is the government Currency Conversion machine
       - D is the machine in which the EFT is being Deposited into 
       - E is the Berkeley EFT machine

   - Say that A starts with a logical clock of 0, 
              B starts with a logical clock of 10,
              C starts with a logical clock of 0,
              D starts with a logical clock of 20, and
              E starts with a logical clock of 30.

   - Now we have the following messages:
      1) A to B: "Withdraw $90"
      2) B to C: "What is the currency conversion rate?"
      3) C to B: "1 USD = 11000 IDR"
      4) E to D: "Deposit $1500"
      5) D to B: "Deposit $1500"

      And in "real time" these events occur in the order 1, 4, 2, 5, 3.
      (Of course the system does not know that!)

   - Let's calculate the LC value at each of the machines for each event

	Event     	A	B	C	D	E    LC (msg)
	-------------------------------------------------------------
	Start		0	10	0	20	30   
	Msg 1		1	10	0	20	30   1
	Msg 4		1	10	0	32	31   31
	Msg 2		1	11	12	32	31   11
	Msg 5		1	34	12	33	31   33
	Msg 3		1	34	13	33	31   13

   - It is clear from looking at the LC's of the messages that
        1 --> 2 --> 3 and 4 --> 5
     But of course it also looks like 1, 2, 3 all happened before 4 and 5,
     when in fact these were intermixed. This is the meaning of "logical"
     in "logical clocks" -- all we know is that 1 --> 2 implies that
     LC(1) < LC(2), not the other way around!

- Transactions and ACID Semantics

   - In complex distributed systems, network links and machines can
     fail independently of one another. Logical clocks only give us a way of
     talking about a (logical) partial ordering of events: they do not
     help us deal with failures in the system that can cause state to become
     inconsistent.

   - Classic example: What if the ATM machine in Jakarta crashes immediately
     after asking Wells-Fargo to withdraw $90, is rebooted, and sends the
     same withdrawal request again? You lose another $90!

   - The basic building-block of solutions to this sort of distributed problem
     is a TRANSACTION: an action that may affect state in multiple distributed
     resources, but is guaranteed to happen just ONCE and leave the system in
     a consistent state. Getting transactions right is hard: Larry Ellison
     isn't a multibillionaire for nothing you know!

   - In the database world it is often stated that transactions should have
     "ACID semantics". ACID stands for:
      - Atomicity: The transaction should appear to be a single, indivisible 
         action
      - Consistency: The transaction leaves the distributed system in a
         consistent state, with all processes agreeing on what the state is
      - Isolation: Execution of the transaction is isolated from that of
         others ; that is, transactions do not interfere with one another
      - Durability: If the transaction completes successfully, the new
         state is durable (its effects persist - system crashes etc. do
	 not cause the system to revert to the old state)

   - Getting ACID semantics isn't always easy, and some systems have explored
     alternatives which are easier to implement and more lax on the
     restrictions. One system (Armando Fox's TACC system from Berkeley)
     used the tongue-in-cheek term "BASE" to mean "Basically available,
     soft-state, eventual consistency".

- Two Phase Commit (2PC)
  
   - The classic solution to distributed transactions is the so-called
     "two-phase commit" protocol, often abbreviated "2PC". The basic idea 
     is that every process participating in a transaction maintains a 
     persistent log (e.g., on disk) that will retain its state even if
     the process crashes. The processes use this log to store information
     on the state of the transaction, so that even if they crash they can
     recover the state.

   - Every 2PC has a "coordinator" which is responsible for driving the
     protocol. The protocol is as follows:

      1) The coordinator C writes a log entry "prepare T" and commits the
         log to disk.

      2) C sends a "prepare T" message to every participant in the transaction.

      3) Each process P decides whether it can commit the transaction T.
         The idea here is that a node may not be able to commit a transaction
	 if the state of the transaction is out of whack with its view of
	 reality, or if the request is disallowed (i.e. requesting to withdraw
	 more money than is in the bank account).

         If P can commit, it writes "ready T" to its log, and sends an 
	 "OK T" message to C. If not, it writes "no T" to its log and 
	 sends an "abort T" message to C.

      4) If C receives an "OK T" message from every process P, it writes
         "commit T" to its log, and sends a "commit T" message to every
	 process P. If C receives an "abort T" from *any* process P,
	 or times out, it writes "abort T" to its log and sends an 
	 "abort T" message to every process P.

      5) When P receives "commit T" or "abort T" it performs the commit/abort
         action and writes that message to its log. 

   - In 2PC, processes can crash at any time and the entire transaction will
     either be committed or aborted at every process -- eventually! The idea
     is that if a process crashes it can read its log, find out what it was
     doing before it crashed, and find out from the coordinator what the
     result of the transaction was.

     - If P recovers and sees the message "commit T" in its log, it knows
       that the transaction was completed and performs a local "redo" of
       whatever the transaction was. For example if T was "deduct $90 from
       bank account A", the local redo will perform that operation on the
       local state for the account.

     - If P recovers and sees "abort T", it knows that the transaction was
       aborted and does whatever is needed to locally "undo" the transaction.

     - If P recovers and sees "ready T", it made a promise to commit the
       transaction locally - but, some other process may have caused an
       abort. P asks C (or some other node that participated in the
       transaction) the state of the transaction and either commits or
       aborts based on C's answer. 

     - If P recovers and sees no log message for T, it must have crashed 
       before responding to the prepare message from C. So, it simply
       aborts the transaction locally (assuming that C will time out the 
       transaction causing an abort).

   - What happens if the coordinator crashes? Well, various cases can
     occur -- C can check with participating processes to tell if
     the transaction commited or aborted. In some cases however, processes
     have to wait for C to recover before the state of the transaction can be
     decided -- this is called the BLOCKING PROBLEM. 

   - What is network links fail? Well, this makes it appear as though those
     site cut off from the rest of the world have failed, so the 2PC
     recovery mechanisms kick in and (eventually) cause the transaction to
     reach a consistent state. A special case of this is a NETWORK PARTITION:
     when network link failures split the set of nodes into two or more
     disjoint sets.  In this case, if the coordinator and all participants 
     are in one partition, there is no problem -- 2PC can complete. If 
     the coordinator and participants are in several partitions, then this
     reduces to a case of link failure (on multiple links), and the 2PC
     recovery mechanisms avoid there being two "parallel universes" where
     some nodes think that the state of the transaction is X, while others
     think that the state is Y. 

   - Since 2PC is so great, why doesn't every system in the world use it?

     - To a large extent 2PC is very popular and used in many "mission
       critical" systems where distributed transactions must absolutely obey
       ACID semantics.

     - However, 2PC does not scale well: as the number of processes grows,
       having to rely upon a central coordinator for transactions can become a
       serious bottleneck. Also, as the number of transient failures
       increases, the overhead of 2PC in terms of messages and logging can
       become problematic.

     - For example, would you want 2PC running every time you do a write()
       to a file over a network filesystem? Even in the absence of failures,
       the number of messages exchanged can be large (especially as the 
       number of nodes grows), leading to serious performance problems.

     - Another way to deal with this is to engineer part of the problem away.
       For example, in a tightly-coupled cluster environment, where all of the
       systems are in the same well-engineered machine room, it may be
       possible to reduce the possibility of network partitions to essentially
       nil. In such a case it is possible to use cheaper alternatives to 2PC
       which assume partitions don't occur!