Design differences between HDFS and GFS

                   The design differences between HDFS and GFS
    We know that HDFS was first designed and implemented according to the conceptual model of GFS (Google File System) The
    writing
model     HDFS has made a simplification when considering the writing model, that is, only one writer is allowed at the same time or appenders.
Under this model, only one client is allowed to write or append to the same file at the same time.
GFS, on the other hand, allows multiple clients to concurrently write or append to the same file at the same time.
    Allowing concurrent writes brings more complex consistency issues.
When multiple clients write concurrently, the order between them is not guaranteed, and multiple records that are successfully appended by the same client in succession may also be interrupted. This means that when a client continuously writes file data, the final distribution of its data in the file may be discontinuous.
    The so-called consistency means that for the same file, all clients see the same data, no matter which replica they read from.
   If multiple clients are allowed to write a file at the same time, how to ensure that the written data is consistent among multiple copies?
When we talked about HDFS earlier, it only allows one writer to write multiple copies in a pipelined manner, the writing order is consistent, and the data will remain eventually consistent after the writing is completed. For multiple clients, it is necessary to let all clients writing at the same time write in the same pipeline mode, so as to ensure that the writing order is consistent.
   Write Process
  GFS uses a lease mechanism to guarantee sequential consistency across data writes across multiple replicas.
  The GFS Master issues the chunk lease to one of the replicas, which is called the primary replica, and the other replicas are called secondary replicas. A write order for the chunk is determined by the primary replica, and the secondary replica follows this order, thus ensuring global order consistency. The chunk lease mechanism is mainly designed to reduce the burden of the Master, and the chunkserver where the master copy is located is responsible for the arrangement of the pipeline sequence.
   The client requests the Master to ask which chunkserver holds the lease and where the other replicas are.
If no chunkserver holds a lease, it means that the chunk has not been written recently.
   The Master chooses to grant the lease to one of the chunkservers.
   The Master returns the location information of the client's primary and secondary replicas.
   The client caches this information for future use.
   The client no longer needs to contact the Master in the future, unless the chunkserver where the primary replica is located is unavailable or the lease has expired. The client selects the optimal network order to push data, and the chunkserver caches the data in the internal LRU cache first. The separation of data flow and control flow is adopted in GFS, so that the transmission of data flow can be better scheduled based on the network topology. Once all replicas have acknowledged receipt of the data, the client will send a write request control command to the primary replica. The final write order is determined by the primary replica assigning consecutive sequence numbers. The primary replica forwards write requests to all secondary replicas, and the secondary replicas perform write operations in the order that the primary replicas schedule. After the secondary replica has finished writing, it will reply to the primary replica to confirm that the operation is complete.
Finally, the primary replica responds to the client. If any error occurs during the writing process of any replica, it will be reported to the client, and the client will initiate a retry.
   The writing process of GFS and HDFS adopts the pipeline method, but HDFS does not separate the data flow and control flow.
The transmission order of HDFS data pipeline writing on the network is consistent with the order in which the files are finally written.
The order in which GFS data is transmitted over the network may not be the same as the order in which it is finally written to the file.
GFS makes the best compromise between supporting concurrent writes and optimizing network data transfers.
The paper summarizing
GFS was published in 2003. Later, most of the design and implementation of distributed file systems refer to the design ideas of GFS more or less.
HDFS is the most complete implementation of the conceptual model in the GFS paper among the open source distributed file systems.
However, HDFS still simplifies the idea of ​​concurrent writing in GFS. This article makes some comparisons on the writing models and processes of the two, and hopes to trigger some thoughts.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326213565&siteId=291194637