MIT 6.824 Lab 2: Raft
I recently wanted to redo 6.824 and see whether it would feel easier now. The last time I did it was in my sophomore year, and I remember almost nothing about the old code. It was probably not very good anyway. Lab 1, MapReduce, is not too difficult, so I skipped it this time and started directly from Lab 2.
Following the course requirements, I will not share implementation code here. If you run into problems, feel free to contact me and discuss.
One small complaint: the course suggests not using time.Ticker, but I think time.Sleep is easier to get wrong. If you have written Go before, I suggest ignoring some of the language-specific advice from the course.
Lab 2A: Elections
The goal of 2A is to implement Raft elections:
- Implement RequestVote.
- Add AppendEntries, but not the full behavior from the paper yet.
- Implement role transitions among Follower, Candidate, and Leader.
- Implement the election logic.
Triggering an Election
An election is triggered when a node is a follower and does not receive a valid RPC request within ElectionTimeout. After triggering an election, it becomes a candidate and repeatedly runs elections. Each election should be separated by ElectionTimeout.
ElectionTimeout is not a fixed value. In the paper, it is 150 ms to 300 ms. In this lab, it needs to be larger. A reasonable range is around 650 ms to 900 ms.
Election Flow
When a node starts an election:
- It becomes a candidate, increments the term, and votes for itself.
- It sends RequestVote to all other nodes.
- If it gets a majority of votes, it becomes leader.
If it discovers that another node has a larger term during the election, it gives up and becomes a follower.
Without optimization, elections can take too long when some nodes have network failures. Nodes may vote for one another in turns and no one wins. Implementation details that matter:
- Requests must be sent concurrently.
- There is no need to wait for all RPCs to return.
- If a majority is already reached, the election can finish immediately.
- If it is already impossible to get a majority, it can also stop early.
Heartbeats
Once a leader exists, it must send heartbeats so followers do not start new elections. The heartbeat interval should be much smaller than ElectionTimeout. The lab tests limit heartbeat frequency to at most 10 times per second, so 100 ms is a natural interval.
Leader heartbeat sending should be optimized in the same way as elections. It should not wait synchronously for all RPCs to return. If one node is offline, a heartbeat round would otherwise take too long and followers may start elections.
Lab 2B: Log Replication
Lab 2B implements log replication and handles log consistency when nodes go offline, come back, or the cluster changes leader. This stage requires a complete AppendEntries implementation and leader-side log replication logic. It is harder than 2A, although the amount of code is not large. The key is to understand several Raft concepts.
Commit and Apply
The log replication process is roughly:
- When the leader receives a write request, it first appends the log entry to its own logs.
- The leader tracks the largest replicated index on each node as matchIndex.
- The leader also tracks the next log index to send to each node as nextIndex.
- On every heartbeat, the leader sends logs starting at nextIndex[id] to each follower.
- When a majority of nodes have received a log entry, that entry can be committed, and the leader updates commitIndex.
- Once a log is committed, it can be applied.
Important details:
- It may look like nextIndex is always matchIndex + 1, and in most cases it is. But when a leader is just elected, this condition may not hold. matchIndex is the maximum log index known to be replicated successfully, while nextIndex is the guessed first missing log index on that follower.
- nextIndex is a guess. If the guess is wrong, the follower rejects the AppendEntries request. The leader then needs to adjust nextIndex. Simply doing nextIndex -= 1 may not pass the tests. A simple algorithm that decreases it faster helps find the matching point sooner.
- Commit and apply are independent execution flows. commitIndex is the maximum log index that can be applied. Each node can use a goroutine to compare lastApplied and commitIndex and decide whether to apply logs.
#### Edge Case
Consider a case where S1 is leader in term 1, writes X, later writes Y, then observes that S3 has a larger term and becomes follower. Because S1 has longer logs, it may become leader again and replicate Y to followers. Since Y was not written in S1's current term, S1 cannot commit Y directly.
To handle this case, when a node becomes leader, it should write an empty log entry so earlier logs can be committed correctly. In this lab, tests check command types and indexes, so writing an empty command directly is not convenient. You can ignore this case for the lab or use a more complicated test-specific workaround.
In the lab, apply means sending the command into applyCh. Tests read commands from applyCh to determine whether logs were replicated correctly.
Follower Log Replication
When a follower receives AppendEntries:
- It checks whether prevLogIndex and prevLogTerm match its own log.
- It writes the request's logs into its own logs.
- It updates commitIndex and applies logs.
When writing logs, the follower starts from prevLogIndex + 1. This differs slightly from the paper because in the lab we can ignore log-write cost and simplify by writing directly from prevLogIndex + 1. A mature implementation would check whether existing logs from that position match the request and only overwrite from the first mismatch.
commitIndex is updated from leaderCommit in the request. This value may be smaller than the starting index of the logs in this request. In other words, a log entry may require at least two AppendEntries RPCs before it is committed: one to replicate the log, and another to update commitIndex.
When leaders change or nodes go offline and come back, a far-behind node may receive a request whose leaderCommit is no smaller than the replicated log index, because commit requires only that a majority has replicated the log.
Log Matching Property
The log matching property says: if two logs on different nodes have the same index and term, then they are the same log entry, and all previous log entries are also the same.
This property is the correctness guarantee behind the replication mechanism. The key is that followers check prevLogIndex and prevLogTerm. A proof by contradiction:
- Assume nodes A and B have one log entry with the same index and term but different content. Let that be the first such log.
- The same term means this log was replicated by the same leader.
- When the leader replicated to A and B, prevLogIndex and prevLogTerm both matched. Therefore the leader knew that the next log to copy was its own logs[prevLogIndex + 1].
- That contradicts the assumption that A and B have different entries at that index.
Lab 2C: Persistence
The Lab 2C tests are much stronger. Persistence itself does not require much code and has no particularly tricky part. More often, 2C exposes incomplete details from 2A and 2B. Common mistakes:
- A follower should not suppress elections after receiving just any request within ElectionTimeout. It should remain follower only after receiving a valid request.
- votedFor means the node voted for someone in the current term. When the term changes, this value must be reset.
- One voting condition is that the follower's log must not be newer than the candidate's log. Compare the term of the last log first. A larger term is newer. If terms are equal, the longer log is newer.
- A leader can only commit logs from its own term.
You can repeat a test several times with:
go test -failfast -race -timeout=15m -count 5 -run [TestCase] > test.log
The command stops after the requested count or fails early when a test fails.
Lab 2D: Log Compaction
Lab 2D implements log compaction. Raft uses snapshots for log compaction.
Trim Log
When creating a snapshot at index i, all logs before i should be discarded. The Raft paper says log indexes start from 1. Initialization can keep an empty log in the slice. For snapshots, one option is to put log i at the front of the log slice so the metadata of log_i is preserved.
All indexes in Raft, such as commitIndex, lastApplied, and nextIndex, are logical indexes, not direct slice indexes. After snapshotting, we need to record the snapshot log index. Later, when accessing a logical index, convert it to a physical slice index:
idx = index - snapshotIndex
Each time a snapshot is created, snapshotIndex is updated to the snapshot index.
The applyCh provided by the lab is an unbuffered channel. If the apply goroutine keeps holding the lock while applying a batch of data, triggering Snapshot can deadlock with the Snapshot method. Therefore, the apply process should use select to attempt applying. If it fails, it should release the lock and retry later.
This change affects the Lab 2B tests. When multiple servers are applying, apply failures become very likely and apply efficiency can become too low. Therefore, use context or a timer so apply fails only after a timeout. Do not use a default case directly.