How Hadoop stores data
Trends and Big Data
We all heard about trends. We have trends in music, fashion and of course IT. Some trends were announced for 2013 that are already part of our lives. Who didn’t hear about cloud, machine to machine (M2M) or NoSQL. All these trends have entered our lives as part of our ordinary day. Big Data is a trend that existed last year and it remained one of the strongest ones.
I would like to talk about Hadoop in the next part of this article. Because Big Data doesn’t exist without Hadoop. Big Data could be a large amount of bytes that the client doesn’t even know how to process. Clients started asking for a scalable way of processing data a long time ago. Processing 50T of data would be a problem on an ordinary system. Computer systems for indexing and processing this amount of data are extremely costly, not just financially but also in terms of time.
At this point Hadoop is one of the best (maybe the best) processing solution for a big amount of data. First of all let’s see how it appeared and how it ended up being a system that can run on 40.000 distributed instances without a problem.
A little bit of history
Several years ago (2003-2004) Google published an article about the way it handles large amounts of data. It explained what solution is used to process and store large amounts of data. Because the number of sites available on the internet was growing so fast, Apache Software Foundation started to work on creating Apache Hadoop based on the Google article. We could say that the article has become the standard for storing, processing and analysing data.
Two important features that make Hadoop a system that many companies adopt for processing Big Data are scalability and the unique way in which data is processed and stored. We will talk more about these features a little later in the article.
Hadoop has been and will remain an open source project. It was supported by Yahoo at the beginning, because they needed an indexing system for their search engine. It ended up also to be used by Yahoo for publicity because the system was working so well.
A very interesting thing is that the Hadoop system didn’t appear overnight and it also wasn’t as robust as it is today. There was a scalability problem at the beginning when it had to scale up to 10-20 nodes. There were also some performance issues. Companies such as Yahoo, IBM, Cloudera, Hortonworks saw the value that the Hadoop system could bring to their business and they invested in it. Each of these companies had a similar system which tried to solve the same problems. It is now a robust system that can be successfully used. Companies such as Yahoo, Facebook, IBM, ebay, Twitter, Amazon use it without any problems.
Since data can be stored in Hadoop very easily and the processed information occupies very little space, any legacy system or big system can store data for a long time very easily and with minimum costs.
Data storage – HDFS
The way Hadoop is built is very interesting. Each part was created for something big starting with file storage to processing and scalability. One of the most important and interesting components that Hadoop has is the file storage system – Hadoop Distributed File System (HDFS).
Generally when we talk about storage systems with high capacity we think of custom hardware which is extremely costly (price and maintenance). HDFS is a system which doesn’t require special hardware. It runs smoothly on normal configurations and can be used together with our home or office computers.
1. Sub-system management
Maybe one of the most important properties of this system is the way that each hardware problem is seen. It was designed to run on tens, hundreds of instances, therefore any hardware problem that can occur is not seen as an error but as an exception to the normal flow. HDFS is a system that knows that not all the registered components will work. Because the system is aware of this fact, it is always ready to detect any problem that might appear and start the recovery procedure.
Each component of the system stores a part of the files and each stored bit can be replicated in one or more locations. HDFS is seen as a system that can be used to store files several gigabytes in size and which can reach a size of several terabytes. The system is prepared to distribute a file on one or more instances.
2. Data access
Data storage is the main purpose of normal file system storages and they send us the data we need for processing. HDFS is totally different. Because they work with large amounts of data they solved this problem in a very innovative way. We will have problems in the moment we will want to transfer large amount of data for processing regardless of system we will use. HDFS allows us to send the processing logic on the components where we keep the data. This means that the data needed for processing will not be transferred and only the final result must be passed on (only when needed).
You would expect to have a highly complex versioning mechanism for such a system. It is a system that would allow you to have multiple writers on the same file. In fact, HDFS is a storage that allows you to have only one writer and multiple readers. It is designed this way because of the type of data it contains. This data doesn’t change very often and that’s why it doesn’t need modifications. For example the logs of an application will not change and the same thing happens with the data obtained from an audit. Very often data that is stored after processing they end in being erased or never changed.
Before talking about the architecture of the system and how it works I would like to talk about another property that the system has – portability. HDFS is not just used together with the Hadoop system but also as a storage system. I think this property helped HDFS to be widespread.
From a software point of view, it is written in Java and can run on any system.