Implementing Practical Byzantine Fault Tolerance - Part 1
Part 1: The Foundation
Practical Byzantine Fault Tolerance is one of the seminal papers in the world of Distributed Systems. It is one of the first practical solutions proposed to the Byzantine Fault Problem (also known as the Byzantine Generals problem).
NOTE: To keep this post focused instead of describing everything from the ground up I will assume readers’ familiarity with some of those base concepts, or being comfortable to move on despite not knowing them. If you are not familiar with some or all, do not worry, everyone starts somewhere! If you like reading papers I can send you straight to the source. If you prefer more casual articles there is plenty out there too with some good videos on top. Or… You can just ask Chat GPT to explain it.
Since this paper is very fundamental to the world of Distributed Systems, especially when we venture beyond the walls of trusted systems running in our data center to things like blockchain where nodes are operated by different parties, and so many later algorithms are built on top of it, I reckoned that understanding it well will help me a lot with understanding others. And what is the better way to understand something than to implement it?
In this post, I will lay some groundwork for understanding the protocol, make some high-level design choices to make implementation fairly simple, and try to model basic data structures.
Since this is a learning project, do not expect optimal design, handling all the edge cases, or production-grade code. The goal is to implement a complete protocol for a more or less happy path.
If you are already familiar with pBFT you can probably quickly skim this part. Otherwise, I strongly encourage you to look at the paper, follow along, and read up on any details that I omit.
A good start to understanding pBFT is the base case – achieving consensus on a single value. Let’s ignore unresponsive leaders, view changes just for now, and try to understand the core.
NOTE: Following the paper we will be mostly looking at sections “4.1 The Client’ and “4.2 Normal-Case Operation”. We will also peak at the basics as they unfold.
The paper features a simple diagram, that pretty much illustrates the whole protocol for a single message (see “Figure 1: Normal Case Operation”), here’s my recreation of it with a bit more details:
To understand it, however, we need some background knowledge about how pBFT works, what are the constraints, etc:
pBFT assumes that at a given time (or in each view) there is exactly one replica that acts as a leader.
Since the whole purpose of pBFT is to achieve consensus in distributed system while tolerating the Byzantine Fault it can do so as long as at most
((n-1) / 3)(less than one third) of the replicas are misbehaving. In other words, if we want to tolerate
ffaulty replicas, we need at least
2f + 1honest ones.
NOTE: From Section 3 of the paper “Service Properties”: “The algorithm provides both safety and liveness assuming no more than ((n-1) / 3) replicas are faulty. Safety means that the replicated service satisfies linearizability […]: it behaves like a centralized implementation that executes operations atomically one at a time. Safety requires the bound on the number of faulty replicas because a faulty replica can behave arbitrarily, e.g., it can destroy its state.”
From the above formula we can see that we need at least 4 replicas in the pBFT system to tolerate 1 dishonest one. For non-Byzantite Fault-tolerant consensus protocols such as Raft or Paxos the minimum number of replicas is usually 3, and the number needs to be odd so that they cannot be split into two even groups. Since in basic pBFT, a leader is predetermined for each view, there is no need for an election, hence we do not need a majority for it.
We dipped our toes in the theory enough to establish some ground rules, now let’s come back to our diagram, look at each “Phase” in a bit more detail, and start writing some code.
NOTE: This and following post will include some snippets of code but it is far from the full implementation, so trying to copy-paste from it is rather pointless. For the full code base refer to the repository.
As many great adventures through layers of code begin with someone wanting to do something let’s start this one here as well, with the client request.
As we see on the diagram, the client sends the request to a single replica. That replica ideally is a leader, but if that is not the case, the request should be forwarded to the leader by any backup node.
In pBFT there is no leader election such as in non-BFT protocols like RAFT or Paxos. Instead, replicas become leader in a predetermined order based on their ID and current view. The leader remains in charge for the whole duration of the view, in other words: as long as protocol progresses (requested operations are being executed). The simplest algorithm for selecting the leader would be:
leader_id = view_number % number_of_replicas.
After the client request successfully reaches the node, the client’s watch begins. For it to consider the request executed, it expects to receive matching responses from at least
2f + 1 different replicas directly (the last section on the diagram: reply).
If the client gets tired of waiting and smells that something might be fishy, it will send the request again, this time to all of the replicas.
NOTE: This part is later important for triggering the view changes…
You might have already spotted a common problem that may occur here: multiple executions of the same request. Fear not, however, as the solution for that is rather simple. The client will simply include a sequence number (or timestamp) with each request, which together with some client ID, will uniquely identify each request, hence will prevent multiple executions.
To keep things more straightforward, my implementation is using request UUID as a measure of identifying requests.
With all that I can propose an initial representation of the
ClientRequest. The service behind pBFT will be a simple key-value store, but I will at least try to separate the consensus layer from the application just a bit.
NOTE: To spare screen real estate code in this article will be stripped to the minimum required to get the main points across. I omit things like attributes, helper methods, boilerplate, and other details. Again, the complete code can be found in the GitHub repository.
For the key-value store operations can be defined as a simple enum:
The last thing we need here is a response that the client receives from honest replicas that processed the request:
The response consist of:
- Operation result.
- Sequence number (or simply sequence) - assigned as part of processing request through pBFT, determines the exact order of operation execution.
- Client request - it could include just the request ID, but for this implementation, it might as well be included as a whole.
The client also has to identify which replica sends the response and verify its authenticity, which I am going to cover later (in part 2) with message signatures and API implementation.
Let’s wrap up this part by modeling protocol messages.
Coming back to the diagram, we can see that the core of the protocol is composed of three phases:
Let’s zoom in on those one by one.
When the leader receives a client request it initiates the consensus for the operation. This process consists of:
- Computing the digest of the client request.
- Assigning a sequence number to it.
- Storing the request, digest, and newly composed
PrePrepareand the client request to all replicas.
ProtocolMessage can be represented as an enum with all its possible variants:
PrePrepare includes metadata about the associated request:
As mentioned in the paper,
PrePrepare does not include the client request itself to keep it smaller. This is important later on during view changes when a new leader might need to broadcast a whole bunch of
PrePrepares in a new view for previously sent requests.
The authors also mention the possibility of using separate protocols for smaller pBFT messages and bigger client requests indicating that those are sent completely separately - we are not going that far.
Since initially both
PrePrepare and the client request are easier to process together, they are going to be broadcasted together as well:
With this done, prepare yourself for the next phase…
Upon receiving the
PrePrepare backup performs some validation such as:
- Verify signature and message digest.
- Check that the message is in the current view.
- Ensure it did not previously accept a different
PrePreparefor the same sequence and view.
- Check that the sequence number is between high and low watermarks.
If it is all good replica decides to accept the message and enters the prepare phase by:
- Storing received
PrePrepareand the client request.
- Creating and storing its own
Preparemessage for the same request.
- Broadcasting the
Prepareto all replicas.
Prepare message will include the same metadata as
PrePrepare with a single addition of replica_id. This is because we only expect one
PrePrepare for the consensus round coming from the leader, but each replica is going to create, and send its own
All (or at least most) replicas will be doing it at more or less the same time – bombarding each other with
Prepare messages. The replica is considered prepared for a given message when its log contains:
- Client request.
PrePreparemessage for a given request.
2f + 1
Preparemessages from different replicas (including itself) matching the
NOTE: In the paper we see it state more precisely as: prepared(m, v, n, i), where m is a message, v is a view, n is a sequence number, and i is replica id.
Whether different protocol messages are “matching” is determined by the metadata - view, sequence number, and digest - in other words, if they are associated with the same client request.
This process ensures that the majority of honest replicas will agree on the same message order (determined by sequence number) in a given view. Or to put it simply: agree on the same message for a given combination of view and sequence number.
NOTE: sequence numbers increase independently of the view, meaning that if we achieved consensus for sequences 1-10 in view 1 and transition to view 2, the sequence numbers will not reset but will continue to grow, e.g. 11, 12, 13, etc.
When the replica is prepared it is time to move to the commit phase.
When the replica reaches the prepared state for a given message, it, again, broadcasts a message to all other nodes, this time it is
The process here is very similar to the prepare phase in that, replicas will validate received messages, and insert them into the log.
Similarly to the prepared condition of the previous phase, in this one, we define a committed-local property, which is true for the replica if it got
2f + 1
Commit messages from different peers.
With all the similarities the
Commit message consists of the exact same properties as
There is another cluster-wide property described by the paper - committed(m, v, n) - which is true when at least
f + 1 non-faulty replicas reached prepared state for a given message (this essentially amounts to
2f + 1 all replicas, given the assumption that at most
f of them are faulty). The commit phase ensures that if at least one non-faulty replica is committed-local(m, v, n, i) then committed(m, v, n) is true. In other words, if one honest replica reaches committed-local property for the message, the message can be considered set in stone.
As a result, as soon as committed-local is true for a given message, the replica is ready to execute it (apply the operation to its state). After successful execution, the replica sends the reply to the client thereby finishing the flow represented in the diagram.
The client can be certain that the operation went through if it received
2f + 1 replies from different replicas.
There are still a few gotchas to watch out for. Since replicas can commit the requests out of order (e.g. one with
sequence = 5 before the one with
sequence = 3) depending on message order delivery, and therefore reach committed-local(…, n+1) before committed-local(…, n). The whole point of the protocol is to agree on ordering, hence replicas cannot execute operations out of order.
The paper assumes partial synchrony - that each message from an honest replica will get delivered eventually - so with this assumption we should not get stuck on any message indefinitely as long as
2f + 1 replicas are honest (and operating).
This wraps up a very basic overview of the protocol base case. There are more details to it but if you are interested in a really deep dive the paper is your go-to. Hopefully, this primer gives some good ground to get through it.
This, however, is not the end, as in the following parts I am going to cover the high-level overview of the implementation of both pBFT as well as Key-Value store on top of it and look into the remaining parts of the protocol: checkpoints and view changes.