This section describes the distributed transaction commit problem, the two-phase commit protocol, and the recent, fault-tolerant, Paxos commit protocol.
A distributed transaction consists of a set of subtransactions each executing on a separate node of a network. The (distributed) transaction commits if each of its component transactions commits and abots otherwise.
The distributed commit problem is to ensure that either all component transactions commit (if they all wish to commit) or that all of them abort if any one of them wishes to abort.
There are two aspects to the problem:
These properties can be specified formally. E.g., at any time, a resource manager (RM) may be working, prepared (to commit), committed, or aborted. The RM's possible transitions are from working to prepared or aborted, and from prepared to committed or aborted.
The two-phase commit protocol is a standard solution to this problem.
It requires a particular node to act as the transaction manager (TM).
When an RM is prepared to commit, it sends a ``Prepared'' message to the TM.
Phase 1: The TM sends a ``Prepare'' message to each RM. When each RM receives this message, it (eventually) sends a ``Prepared'' message back to the TM indicating ``Commit'' or ``Abort''.
Phase 2: When the TM has received a ``Prepared'' message from every RM, if every message stated ``Commit'', it sends a ``Commit'' message to every RM (which must them commit). If any message stated ``Abort'', it sends an ``Abort'' message to every RM (which must them abort). Initially, if an RM is about to abort, it sends an ``Abort'' message to the TM, which forwards this message to every RM (which must then abort). Details are described by Bernstein et al. and by Gray and Lamport. The TM and every RM must record its state on stable storage before sending a message describing this state. Then, if it every fails and restarts, it can retrieve its state fromo stable storage and send the appropriate message describing this state. If the TM does not receive a response from an RM, it continues sending this message until a response is received. Thus, failure of an RM is not a problem. But if the TM fails, especially after every RM has sent a ``Prepared'' message, but before the TM has broadcast a ``Commit'' or ``Abort'' message, the RMs never discover what they should do. It is this problem that the Paxos commit algorithm addresses (as do various three-phase commit algorithms, but apparently with less success).
The Paxos commit algorithm is based on an older Paxos consensus algorithm which is a protocol that allows a distributed system to reach a concensus, i.e., to agree on a value, such as a leader, despite the failure of some of them.
The Paxos commit algorithm is described as a special case of the Paxos concensus algorithm. It may be possible to give a simpler derivation of the algorithm without going through the concsensus algorithm.
Instead of a single TM, the Paxos commit algorithm uses 2N+1 TMs. If any N of these TMs fail, the remaining N+1 can agree whether to commit or abort. The details are complex and are omitted here. However, the overhead in terms of the total delay and the total number of messages is small. Moreover, when N=0, the Paxos commit algorithm reduces to the classical two-phase commit algorithm.
Problem 1 Present a simple, direct derivation, description and proof of the Paxos commit algorithm.
Problem 2 Implement and test the Paxos commit algorithm in a distributed programming language such as Erlang.
The definitive description, including a formal specification. But harder to read than necessary (?), as based on the following more complex papers.
(Both these papers are available from Lamport's home page.)