HDFS is an important
component of Hadoop. HDFS is a specially
designed file system for storing huge data-sets with a cluster of commodity
hardware and with streaming access patterns .Here commodity hardware refers to
the cheap hardware. HDFS Uses a block size of 64 MB that can be extended up to
128 MB depending upon the need and type of applications. Normally file systems
uses a block size of 4 KB which results in a loss of memory, HDFS by default
uses 64 MB .Another reason for using 64 MB block is that meta data would be
increased if 4 KB block is used .For Example if we want to store 200 MB of
data, whole data will be splitted into 4 files, three files of 64 MB and a
single file of 8 MB.
HDFS uses five type of
services-
1
Name Node
2
Secondary Name Node
3
Job Tracker
4
Data Node
5
Task Tracker
Name Node, Secondary Name Node, Job Tracker
are also called as Master Services or Master Daemons or Master Nodes and Data Node , Task Tracker are called as
Slave Services or Slave Nodes or Slave Daemons.
Every Master Service can talk to each other,
similarly every Slave Service can talk to each other.
Name Node talks to Data Node and Job Tracker
talks to Task Tracker no more combinations of talking between these possible.
Data Node is a commodity hardware and it is a
cheap hardware, we need not to implement Data Node as hardware of high quality
as HDFS by default makes 3 replicas of
each file and there is a no need to worry about file loss. Name Node is a
highly reliable hardware as it acts as master and handles all the data nodes.
When a client needs to store the data in
HDFS, it approaches Name Node and asks for the space. Name Node also maintains
a Meta data which contains all the information about data, space allotted to
client for storage, which replica is stored in which data node, file size and
so on. This Meta data a wide role to play in HDFS. Name Node then assigns data
nodes for storage and maintains by default 3 replicas and the complete
information is stored in Meta data file. Each data node gives block report and
heartbeat to name node to make sure that data nodes are alive and working
properly. If data node gives no block report to name node it is considered dead
and the data is maintained at other data node and related information is stored
in Meta data. If Name Node fails the whole system would be damaged that is why
highly reliable hardware is used for name node and it is called as single point
of failure.
No comments:
Post a Comment