Tuesday, 15 November 2016

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

HDFS is an important component of Hadoop. HDFS is a specially designed file system for storing huge data-sets with a cluster of commodity hardware and with streaming access patterns .Here commodity hardware refers to the cheap hardware. HDFS Uses a block size of 64 MB that can be extended up to 128 MB depending upon the need and type of applications. Normally file systems uses a block size of 4 KB which results in a loss of memory, HDFS by default uses 64 MB .Another reason for using 64 MB block is that meta data would be increased if 4 KB block is used .For Example if we want to store 200 MB of data, whole data will be splitted into 4 files, three files of 64 MB and a single file of 8 MB.

HDFS uses five type of services-
1                    Name Node
2                    Secondary Name Node
3                    Job Tracker
4                    Data Node
5                    Task Tracker

Name Node, Secondary Name Node, Job Tracker are also called as Master Services or Master Daemons or Master Nodes  and Data Node , Task Tracker are called as Slave Services or Slave Nodes or Slave Daemons.

Every Master Service can talk to each other, similarly every Slave Service can talk to each other.
Name Node talks to Data Node and Job Tracker talks to Task Tracker no more combinations of talking between these possible.

Data Node is a commodity hardware and it is a cheap hardware, we need not to implement Data Node as hardware of high quality as HDFS by default makes  3 replicas of each file and there is a no need to worry about file loss. Name Node is a highly reliable hardware as it acts as master and handles all the data nodes.


When a client needs to store the data in HDFS, it approaches Name Node and asks for the space. Name Node also maintains a Meta data which contains all the information about data, space allotted to client for storage, which replica is stored in which data node, file size and so on. This Meta data a wide role to play in HDFS. Name Node then assigns data nodes for storage and maintains by default 3 replicas and the complete information is stored in Meta data file. Each data node gives block report and heartbeat to name node to make sure that data nodes are alive and working properly. If data node gives no block report to name node it is considered dead and the data is maintained at other data node and related information is stored in Meta data. If Name Node fails the whole system would be damaged that is why highly reliable hardware is used for name node and it is called as single point of failure. 


No comments:

Post a Comment