Replication And Recovery In GFS

The Google Filesystem is a distributed filesystem designed to handle Google’s data processing needs. I read the paper more than once before it stuck but it’s something that you can do as well. In this article I will be exploring how GFS replicates data and recovers from errors.

To appreciate how GFS works, we should first consider how it is used. MapReduce workloads use GFS to write outputs of its map and reduce phases. The diagram below shows a MapReduce overview.

      graph LR
    map_worker_1 -->|write| gfs["out1.txt
                                out2.txt"]
    map_worker_2 -->|write| gfs
    map_worker_3 -->|write| gfs

    gfs -->|read out1.txt| reduce_worker_1
    gfs -->|read out2.txt| reduce_worker_2

    reduce_worker_1 -->|write| reduce_out1.txt
    reduce_worker_2 -->|write| reduce_out2.txt

In the map phase, workers read an input file and outputs to an out file ie out1.txt, mutlitple workers can write to the same output file concurrently. And in the reduce phase, workers read the map phase’s out files. Not that the writes are append only, that means, there are no random writes. The key ideas to note is that workers write to output files concurrently and that writes are append only.

      block-beta
columns 4
client:1
space
master:3
    block
        columns 1
        chunksvr1
        block
            a["chunk1"]
            b["chunk3"]
        end
    end
    block
        columns 1
        chunksvr2
        block
            c["chunk2"]
            d["chunk4"]
        end
    end
    block
        columns 1
        chunksvr3
         block
            f["chunk1"]
            g["chunk3"]
        end
    end

master -->  chunksvr1
master -->  chunksvr2
master -->  chunksvr3
client --> master

A GFS cluster has two main components, master and chunk servers. The master server stores metadata about the cluster like; file namespaces, file-chunk mapping and chunk locations whereas chunk servers stores actual file data. The table below shows this information. The file namspace information is stored as a b-tree.

File Chunks Locations
/var/foo ff45 svr1, svr2
/var/bar ff46 svr3, svr4

When a MapReduce worker wants to write to some out file out1.txt it sends a request to the master Write /map/out1.txt. If the file doesn’t exist, the path will be added to the file namespace and a chunk id will be associated with the file. The master then responds to the client with the current chunk. Each chunk is 64mb in size.

/map/out1.txt
chunk ff45
chunk ff46

The master itself doesn’t store chunks, only their metadata ie what chunks belong to what files. The master selects a chunk primary and gives it a chunk lease. It also selects secondary server which will hold the chunk’s replicas. It then responds to the MapReduce worker with the payload below.

{
    "chunk": "ff45"
    "primary":  "chunksvr1",
    "secondary: ["chunksvr2", "chunksvr3"]
}

The client then pushes the data to the chunk primary, the chunk primary propagates the data to chunksvr2 and chunksvr2 to chunksvr3 in a pipeline. The data is then held in each of the servers LRU cache.

      flowchart LR
    client --> primary
    primary --> chunksvr2
    chunksvr2 --> chunksvr3

Remember that we can have multiple clients writing to the same chunk and in this case the clients will be writing data to the chunk primary. The primary serializes the mutations and tells the secondaries to apply mutations in the exact order. When a chunk is full, master creates a new chunk and clients will start writing to that chunk.

T1: Worker 1 appends 10MB → goes to chunk ff45
T2: Worker 2 appends 15MB → goes to chunk ff45
T3: Worker 3 appends 45MB → chunk 5001 full, create ff46
T4: Worker 1 appends 5MB → goes to chunk ff46
T5: Worker 2 appends 20MB → goes to chunk ff46

A chunksvr can hold the chunk lease indefinitely by requesting extensions from the master. This is especially important when a chunk lease expires while the chunksvr is handling mutations to a chunk. Chunk leases also ensure data consistency across the GFS cluster.

The master server holds crucial information about the cluster in its memory like file namespaces, file-chunk mapping and chunk locations. If the master goes down then all this data will be lost. To avoid this scenario, master maintains an operation log that records all mutations to its state. The operation log is stored on disk and replicate remotely. When a master starts, it applies the operation log to bring itself to the state before it went down.

CREATE_DIR /map/
CREATE_FILE /map/out1.txt
GRANT_LEASE primary=chunksvr1 chunk=ff45
ALLOC_CHUNK file=/map/out1.txt

The operation log can grow to be very huge which increases the starting time for the master. To avoid this, GFS checkpoints its state by writing it to disk. Therefore, when the master starts it loads the checkpoint and applies the operation log from the checkpoint instead of from the beginning of time. Shadow masters enhance read availability for files when the master goes down. They apply a replica of the operation log to their own state.