MapReduce on Hadoop – Big Data in Action
The previous article showed us Hadoop’s secret when it needs to store hundredths of TB. Hadoop is a system that can store and manipulate large amounts of data very easily, based on a simple master-slave architecture.
Hadoop contains two types of nodes for data storage. The NameNode is the node that plays the role of master. It knows the name and locations of each file that Hadoop stores. It is the only node that can identify the location of a file based on the file name. We can have 1 to n nodes that store file contents around this node. The name of this type of node is DataNode.
Hadoop stores big data without any kind of problems. But it became known as the system that can process big data in a simple, fast and stable way. It is a system that can process and extract the information that we want from hundreds of TB of data. This is why Hadoop is the king of big data. In this post we will discover the secret of data processing – how Hadoop manages to do this.
The secret that gives us the ability to process data is called Hadoop. This paradigm was not invented by Hadoop, but it manages to implement it very well. The first meeting with MapReduce will be hard for us. It will be pretty complicated to understand it. Each person that wants to use MapReduce needs to understand the MapReduce paradigm first.
Without understanding MapReduce we will not be able to know if Hadoop is the solution for our problem and what kind of data we should expect from Hadoop.
MapReduce and Tuples
Don’t expect Hadoop to be a system that stores data on tables. This system doesn’t have the concept of tables. It only works with tuples that are formed by a key and a value. This is the only thing that Hadoop uses to extract data. Each task that is executed in this system will accept these tuples as input. Of course the output of a task will be formed by (key, values) pairs. Each pair can contain one or more values.
Even if this tuple seems to be trivial, we will see that this is the only thing that we need if we want to process data.
The MapReduce process is made up of two differents steps – Map and Reduce. The Map is the process used to convert the input data into a new set of data. The data obtained after this step is only intermediate data that will be used in the next step. We have the option to persist this data, but this information is generally not relevant for the end user.
The Map action is not executed on only one node. This action is executed on 1 to m DataNode type nodes. Each DataNode on which this action is executed will contain the input data – which is why we execute the Map over a part of input data on each node. The size resulting from this action is smaller than the input data. This data can be processed more easily. This step gives us results in the memory. The result is not written to the disk.
We can image that the output we have attained from this step is like a summary of our data. We will obtain different results based on the input and how we want to map the input data. The output data attained after this step doesn’t need to have the same format as the input data. The result is partition based on the function that uses the key of the tuple. A hash function is applied in general, but we can define any kind of partitioning mechanism.
The intermediate result can be used by Hadoop for different operations. This step allows us to execute actions like sorting or shuffle. These small steps can prepare the data for the next step. These operations can and are also executed after the Reduce step.
We can have 10 to 100-150 operations at the same time on each node where the map reduce is executed from the parallelise point of view. The number of concurrent operations is dictated by the hardware performance and the complexity of the Map action.
Once we have the intermediate results, we can start the next step of processing – Reduce. In comparison with the Map operation, the Reduce step operation cannot be executed on each node of Hadoop. This operation will be executed on only a small part of the nodes. This is happening because the size of data that we need to process was already reduced. Each data is partitioned for each Reducer.
If the Map reduce was formed by only one step, we will see that Reduce contains 3 main steps:
In the moment when the Shuffle step is executed, each DataNode that was involved in the Map operation starts to send the results to the nodes that will run the Reduce operation. The data is send over an HTTP connection. Because Hadoop runs in a private network, we don’t have any kind of security problems.
All the key value pairs are sent and sorted based on the key. This needs to be done because there are cases when we can have the same key from different nodes. In general, this step is done in parallel with the shuffle process.
Once the shuffle step ends, the Hadoop system will start to carry out another sort. In this moment Hadoop can control how the data is sorted and how the results will be grouped. This sort step gives us the possibility to sort items not only by key, but also based on different parameters. This operation is executed on disk and also on memory.
The last step that needs to be executed is the Reduce. In the moment when this operation is executed, the final results will be written on disk. During this step, each tuple is formed from a key and a collection of values. From this tuple, the Reduce operation will select a key and only one value – the value will represent the final value.
Even if the Reduce step is very important, there are cases when this step is not necessary. In these cases the intermediate data is the final result for the end user.
The MapReduce operation requires two types of services – JobTracker and TaskTracker. These two types of services are in a master-slave relationship that is very similar to the one that we saw earlier on how the data is stored – NameNode and DataNode.
The main scope of the JobTracker is to schedule and monitor each action that is executed. If one of the operations fails, then the JobTracker is capable of rerunning the action.
JobTracker discusses with the NameNode and programs the actions in a way that enables each job to be executed on the DataNode that has the input data – this way no input data is sent over the wire.
TaskTracker is a node that accepts Map, Reduce and Suffle operations. This is usually the DataNode where the input data can be found, but there can also be exceptions to this rule. Each TaskTracker has a limited number of jobs that can be executed – slots. Because of this the JobTracker will try to execute jobs on the TaskTracker that has free slots.
It is interesting how jobs are executed from the execution model. Each job is executed on a separate JVM process. Because of this if something happens (an exception appears), only one job will be affected. The rest of the jobs will run without problems.
Until now we have discovered how MapReduce works. The theory is very good, but we also need to practice. I propose a small example that will help us understand how MapReduce works. In this way we will be able to understand in a simple way how MapReduce is doing its magic.
We will start from the next problem. We have hundreds of files that contain the number of accidents from each city of Europe that happened every month. Because the EU is made up of different countries that have different systems we will end up with a lot of files. Because of this, we have files that contain information from the cities of a country, others that contain information for only one city and so on. Let’s assume that we have the following file format:
London, 2013 January, 120
Berlin, 2013 January, 300
Roma, 2013 February, 110
Berlin, 2013 March, 200
Based on the input data we need to calculate the maximum number of accidents that took place in each city during a month. This simple problem can become a pretty complicated one when we have 10 TB of input data. In this case Hadoop is the perfect solution for us.
The first operation from MapReduce process is Map. In this moment each file will be processed and a key value collection will be obtained. In our case the key will be represented by the name of the city and the value will be the number of accidents. We will extract the maximum number of accidents from each city during a month from each file. This would represent the Map operation and the output would be something like this:
(London, 120), (Berlin, 300), (Rome, 110), (London, 100), (Rome, 210), …
This intermediate result has no value for us (yet). We need to extract the maximum number of accidents for each city. The Reduce operation will be applied now. The final output would be:
A similar mechanism is used by Hadoop. The power of this mechanism is its simplicity. Because it is such a simple mechanism, it can be duplicated and controlled very easily over the network.
In this article we found out how Hadoop processes the data using MapReduce. We discovered that the core of the mechanism is very simple, but is duplicated on all the available nodes. We can have more than one job on each node that can run at the same time. In conclusion we could say that all the tasks that are executed use parallelism to their advantage. Hadoop tries to use all available resources.
As a remark, don’t forget that Hadoop’s native language is Java, but it has support for other languages such as Python, C# or PHP.