What are the nodes of HDFS (Hadoop Data File Storage)
1. NameNode and DataNode
If we talk about the architecture of such a system it is necessary to introduce two terms in our vocabulary: NameNode and DataNode. It is a master-slave system.
NameNode is „the master” of the storage system. It handles the storage system of file name and knows where it can find it – mapping files. This system doesn’t store the file data; it just deals with file mapping, knowing the location where these files are stored at any moment. Once the name has been resolved by NameNode it will redirect the clients to the DataNode.
DataNode is „the slave” that stores the actual content of the files. Customers will use the DataNode to access the stored information – reading and writing of the data.
As a system that is ready for the fall of a component, in addition to NameNode, we have a SecondaryNameNode. This component automatically creates NameNode checkpoints and if something happens to the NameNode this component is ready to provide the checkpoint in order to restore the NameNode state before the fall. Note that SecondaryNameNode will never take the position that the NameNode has. It will not solve the location where the files are stored. Its only purpose is to create checkpoints for NameNode.
2. Data storage
All data that is stored as files. The file is not divided into several parts for the client even if this happens internally. Internally the file is divided into blocks that will end up being stored on one or more DataNodes. A large file can be stored in 2, 3 or even 20 nodes. The NameSpace controls this procedure and it may require the blocks to be replicated in several locations.
Figure 1 displays the architecture of the system.
What is interesting about this architecture is how it works with files. The data accessed by the customers never passes through the NameNode. Therefore even if we only have one NameNode in the whole system, once the file location is resolved no client request will need to go through the NameNode.
3. File structure
The way in which files are stored to be accessed by customers is very simple. The customer can define a structure of directories and files. All this data is stored by the NameNode. It is the only component that knows how folders and files are defined by the customer. Options such as a hard-link or soft-link are not supported by HDFS.
Because stored data is very important, HDFS allows us to set the number of copies that we want to have for each file. This can be set when creating the file or anytime thereafter. NameNode is the one that knows the number of copies that must exist for each file and makes sure that it exists.
For example, when increasing the number of replications that we want to have for a file, NameNode makes sure that all blocks will be replicated again. NameNode’s job doesn’t end here. It receives an “I’m alive” signal for each file at a specific time – similar to a heartbeat. If one of these blocks does not receive the signal the NameNode will start the recovery procedure automatically.
The way the data replicates is extremely complex. HDFS must take many factors into account. When we need to make a new copy we must be careful since this operation consumes bandwidth. That is why we have a load-balancer that handles data distribution in the cluster. The cluster location where the data is copied will be able to carry out the load balance in the best way possible. There are many options for replication; one of the most common is the one that requires 30% of responses to be on the same node. In terms of distribution of the replications on racks, 2/3 are on the same rack and the other third is on a separate rack.
Data from a DataNode may automatically be moved to another DataNode if it detects that the data is not evenly distributed.
All copies that exist for a file are used. Depending on the location where you want to retrieve the data, the client will have access the closest copy. Therefore, HDFS is a system that knows what the internal network looks like – including every DataNode, rack and other systems.
The namespace stored by the NameNode in the RAM can be easily accessed. It is copied to the hard disk at precisely defined intervals – the image name that is written on the disk is FsImage. Because the copy on the disk is not the same as the one in the RAM memory, there is a file in which all changes that are made to the file or folder structure are logged – EditLog. This way if something happens to the RAM memory or the NameNode, the recovery is simple and can also contain the latest changes.
6. Data manipulation
A very interesting thing is how the client can create a file. Data is not initially stored directly in the DataNode, but in a temporary location. The NameNode is only notified when there is enough data for a write operation and it copies the data in the DataNode.
When a client wants to delete a file it will not physically be deleted from the system. The file is only marked for deletion and moved into the trash directory. Only the latest copy of the file is kept in the trash and the client can enter into the folder to retrieve it. All files in the trash folder are automatically deleted after a certain period of time.
This article was about how Hadoop appeared, its main properties and how the data is stored. We saw that HDFS is a system created to work with large amounts of data. It does its job extremely well and with minimal costs.
The next article will explain how we can process this information.