Tutorial 4: HDFS Read and Write Operation using Java API

Introduction

In the past chapters, we have learned about HDFS which stands for Hadoop Distributed File System and used to store data in Hadoop which is bundled with it. It is a distributed file system with a huge capacity to store very large data files, audio files, video files, etc. HDFS runs on the clusters of the commodity computer hardware that ensures cheap costing for the data storage. Such hardware is scalable, fault-tolerant, and storage capacity can be increased with ease over time. HDFS architecture is designed in such a manner that it manages to divide the data storage from a single machine to multiple separate machines. Such architecture is known as distributed file system. In this chapter, we will be discussing HDFS Architecture, and how Read and write operations are carried out in HDFS by HDFS clients.

 

HDFS Architecture

HDFS cluster consists of two elements known as NameNode and DataNodes. The former is used to manage the file system Metadata and the later is used to store the actual data. In HDFS, the multiple data blocks are created and distributed on these nodes throughout the clusters of commodity computers. Such data replication makes the overall system robust and ensures high scalability and availability of data even in the event of node failure.

HDFS-Read-Write-Using-Java
  1. NameNode: It represents every file and directory which is present for use in the namespace. In other words, it maintains the file system tree that holds the metadata for all the directories and files which are available in the system. Further, it has two files. They are ‘Namespace image’ and the ‘edit log’ which together store metadata details through which it has the knowledge of all the DataNodes. This should be noted that the information of data blocks in DataNodes gets reconstructed every time, the system gets started.
  2. DataNodes: It assists the Hadoop system to manage the state of an HDFS node and helps to interact with the data blocks. They stay on each machine as slave in a cluster and holds the data in the data blocks. They assist the HDFS system in performing the actual read and write operations for the clients. Such read and write operations held at a block level. These blocks are the chunks of 64 MB (default size) each, which serve as independent storage units.
 

Read operation in HDFS

In the HDFS distributed file system, the Data read request is served by HDFS, NameNode, and DataNodes. Let’s understand this in the step by step manner.

  1. The FileSystem object is an object of type DistributedFileSystem. The FileSystem object has an ‘open ()‘ method, which initiates a read request from the HDFS client.
  2. The FileSystem object uses RPC (Remote Procedure Call) to connect to the NameNode in order to gather the metadata details about the location of the data blocks of a data file.
  3. The RPC returns the addresses of the DataNodes in the response. Upon receiving the DataNodes addresses, the FSDataInputStream object is returned to the client. The FSDataInputStream type object facilitates the interactions with DataNodes and NameNode.
  4. Next, the ‘read ()’ method invocation causes DFSInputStream to create a connection with the first DataNode with the file’s first block.
  5. Here, the data is read in the method of streams through which the ‘read ()’ method is invoked repeatedly until it reaches the end of the block.
  6. Upon reaching the end of a block, the connection is closed by DFSInputStream and it movesfurther in order to locate the next DataNode for the forthcoming blocks.
  7. Finally, the ‘close ()’ method is called once HDFS client is finished with the read operation.
 

Write operation in HDFS

In the HDFS distributed file system, the data write request is served by HDFS, NameNode, and DataNodes. Let’s understand this in the step by step manner.

  1. The DistributedFileSystem object has a ‘create ()‘ method, which initiates a written request from the HDFS client. It creates a new file.
  2. The DistributedFileSystem object uses RPC (Remote Procedure Call) to connect to the NameNode in order to initiate a new file creation. The NameNode further verifies that this file already exists or not along with the permissions that an HDFS client possesses to create a new file. IOException exception is thrown if the file already exists or HDFS client does have sufficient privileges to write a file. Otherwise, a new record of the file will be created in the NameNode.
  3. Upon creation of a new record in the NameNode, the FSDataOutputStream object is returned to the HDFS client which is used to write data into the HDFS.
  4. DFSOutputStream object is present in FSDataOutputStream object. It looks after the interaction between DataNodes and NameNode. DFSOutputStream object creates data packets out of the HDFS client data writing operations, which are enqueued into the DataQueue
  5. DataStreamer is another component that helps to consume the data from the DataQueue. It enquires the NameNode for the allocation of new blocks to pick desirable DataNodes for required replication. Such process of replication creates a pipeline using DataNodes.
  6. The DataStreamer discharges packets into the first DataNode in the pipeline. Further, data packets in the pipeline are forwarded to the next DataNode and so on.
  7. ‘Ack Queue’ or acknowledgement Queue is maintained by DFSOutputStream. This Queue stores the waiting packets which are pending acknowledgment from DataNodes. After receiving the acknowledgement for these waiting packets from all the DataNodes, they are removed from the ‘Ack Queue’. In the event of failure, the packets remain in the queue which helps further in the recovery process.
  8. Finally, the ‘close ()’ method is called once HDFS client is finished with the write operation. On the final acknowledgement, the NameNode is communicated to convey that the file write operation has completed successfully.
 

Conclusion

In this chapter, we discussed HDFS Architecture, HDFS Read and write operations in HDFS. HDFS is the robust configuration of Hadoop which enables quick storage and retrieval of the bulk data files.


>>> Checkout Big Data Tutorial List <<<



⇓ Subscribe Us ⇓


If you are not regular reader of this website then highly recommends you to Sign up for our free email newsletter!! Sign up just providing your email address below:


 

Check email in your inbox for confirmation to get latest updates Software Testing for free.


  Happy Testing!!!
 

Leave a Comment

Share This Post