1.4 Distributed concurrency control

This section describes the distributed transaction commit problem, the two-phase commit protocol, and the recent, fault-tolerant, Paxos commit protocol.

The distributed commit problem

A distributed transaction consists of a set of subtransactions each executing on a separate node of a network. The (distributed) transaction commits if each of its component transactions commits and abots otherwise.

The distributed commit problem is to ensure that either all component transactions commit (if they all wish to commit) or that all of them abort if any one of them wishes to abort.

There are two aspects to the problem:

Note that transactions, or more precisely the resource managers perfoming the transaction at each node, may either fail or, equivalently, as far as the protocol is concerned, become isolated fromo the rest of the network.

These properties can be specified formally. E.g., at any time, a resource manager (RM) may be working, prepared (to commit), committed, or aborted. The RM's possible transitions are from working to prepared or aborted, and from prepared to committed or aborted.

Two-phase commit

The two-phase commit protocol is a standard solution to this problem.

It requires a particular node to act as the transaction manager (TM).

When an RM is prepared to commit, it sends a ``Prepared'' message to the TM.

Phase 1: The TM sends a ``Prepare'' message to each RM. When each RM receives this message, it (eventually) sends a ``Prepared'' message back to the TM indicating ``Commit'' or ``Abort''.

Phase 2: When the TM has received a ``Prepared'' message from every RM, if every message stated ``Commit'', it sends a ``Commit'' message to every RM (which must them commit). If any message stated ``Abort'', it sends an ``Abort'' message to every RM (which must them abort). Initially, if an RM is about to abort, it sends an ``Abort'' message to the TM, which forwards this message to every RM (which must then abort). Details are described by Bernstein et al. and by Gray and Lamport. The TM and every RM must record its state on stable storage before sending a message describing this state. Then, if it every fails and restarts, it can retrieve its state fromo stable storage and send the appropriate message describing this state. If the TM does not receive a response from an RM, it continues sending this message until a response is received. Thus, failure of an RM is not a problem. But if the TM fails, especially after every RM has sent a ``Prepared'' message, but before the TM has broadcast a ``Commit'' or ``Abort'' message, the RMs never discover what they should do. It is this problem that the Paxos commit algorithm addresses (as do various three-phase commit algorithms, but apparently with less success).

Paxos commit

The Paxos commit algorithm is based on an older Paxos consensus algorithm which is a protocol that allows a distributed system to reach a concensus, i.e., to agree on a value, such as a leader, despite the failure of some of them.

The Paxos commit algorithm is described as a special case of the Paxos concensus algorithm. It may be possible to give a simpler derivation of the algorithm without going through the concsensus algorithm.

Instead of a single TM, the Paxos commit algorithm uses 2N+1 TMs. If any N of these TMs fail, the remaining N+1 can agree whether to commit or abort. The details are complex and are omitted here. However, the overhead in terms of the total delay and the total number of messages is small. Moreover, when N=0, the Paxos commit algorithm reduces to the classical two-phase commit algorithm.

Problem 1 Present a simple, direct derivation, description and proof of the Paxos commit algorithm.

Problem 2 Implement and test the Paxos commit algorithm in a distributed programming language such as Erlang.