Replication And Recovery In GFS
The Google Filesystem is a distributed filesystem designed to handle Google’s data processing needs. I read the paper more than once before it stuck but it’s something that you can do as well. In this article I will be exploring how GFS replicates data and recovers from errors.
To appreciate how GFS works, we should first consider how it is used. MapReduce workloads use GFS to write outputs of its map and reduce phases. The diagram below shows a MapReduce overview.
graph LR
map_worker_1 -->|write| gfs["out1.txt
out2.txt"]
map_worker_2 -->|write| gfs
map_worker_3 -->|write| gfs
gfs -->|read out1.txt| reduce_worker_1
gfs -->|read out2.txt| reduce_worker_2
reduce_worker_1 -->|write| reduce_out1.txt
reduce_worker_2 -->|write| reduce_out2.txt
In the map phase, workers read an input file and outputs to an out file ie out1.txt, mutlitple workers can write to the same output file concurrently. And in the reduce phase, workers read the map phase’s out files. Not that the writes are append only, that means, there are no random writes. The key ideas to note is that workers write to output files concurrently and that writes are append only.
block-beta
columns 4
client:1
space
master:3
block
columns 1
chunksvr1
block
a["chunk1"]
b["chunk3"]
end
end
block
columns 1
chunksvr2
block
c["chunk2"]
d["chunk4"]
end
end
block
columns 1
chunksvr3
block
f["chunk1"]
g["chunk3"]
end
end
master --> chunksvr1
master --> chunksvr2
master --> chunksvr3
client --> master
A GFS cluster has two main components, master and chunk servers. The master server stores metadata about the cluster like; file namespaces, file-chunk mapping and chunk locations whereas chunk servers stores actual file data. The table below shows this information. The file namspace information is stored as a b-tree.
| File | Chunks | Locations |
|---|---|---|
| /var/foo | ff45 | svr1, svr2 |
| /var/bar | ff46 | svr3, svr4 |
When a MapReduce worker wants to write to some out file out1.txt it sends a request to the master Write /map/out1.txt. If the file doesn’t exist, the path will be added to the file namespace and a chunk id will be associated with the file. The master then responds to the client with the current chunk. Each chunk is 64mb in size.
| /map/out1.txt |
|---|
| chunk ff45 |
| chunk ff46 |
The master itself doesn’t store chunks, only their metadata ie what chunks belong to what files. The master selects a chunk primary and gives it a chunk lease. It also selects secondary server which will hold the chunk’s replicas. It then responds to the MapReduce worker with the payload below.
{
"chunk": "ff45"
"primary": "chunksvr1",
"secondary: ["chunksvr2", "chunksvr3"]
}
The client then pushes the data to the chunk primary, the chunk primary propagates the data to chunksvr2 and chunksvr2 to chunksvr3 in a pipeline. The data is then held in each of the servers LRU cache.
flowchart LR
client --> primary
primary --> chunksvr2
chunksvr2 --> chunksvr3
Remember that we can have multiple clients writing to the same chunk and in this case the clients will be writing data to the chunk primary. The primary serializes the mutations and tells the secondaries to apply mutations in the exact order. When a chunk is full, master creates a new chunk and clients will start writing to that chunk.
T1: Worker 1 appends 10MB → goes to chunk ff45
T2: Worker 2 appends 15MB → goes to chunk ff45
T3: Worker 3 appends 45MB → chunk 5001 full, create ff46
T4: Worker 1 appends 5MB → goes to chunk ff46
T5: Worker 2 appends 20MB → goes to chunk ff46
A chunksvr can hold the chunk lease indefinitely by requesting extensions from the master. This is especially important when a chunk lease expires while the chunksvr is handling mutations to a chunk. Chunk leases also ensure data consistency across the GFS cluster.
The master server holds crucial information about the cluster in its memory like file namespaces, file-chunk mapping and chunk locations. If the master goes down then all this data will be lost. To avoid this scenario, master maintains an operation log that records all mutations to its state. The operation log is stored on disk and replicate remotely. When a master starts, it applies the operation log to bring itself to the state before it went down.
CREATE_DIR /map/
CREATE_FILE /map/out1.txt
GRANT_LEASE primary=chunksvr1 chunk=ff45
ALLOC_CHUNK file=/map/out1.txt
The operation log can grow to be very huge which increases the starting time for the master. To avoid this, GFS checkpoints its state by writing it to disk. Therefore, when the master starts it loads the checkpoint and applies the operation log from the checkpoint instead of from the beginning of time. Shadow masters enhance read availability for files when the master goes down. They apply a replica of the operation log to their own state.