An efficient Mapreduce scheduling algorithm in hadoop

Authors: R.Thangaselvi; S.Ananthbabu; R.Aruna
DIN
IJOER-DEC-2015-30
Abstract

Hadoop is a free java based programming framework that supports the processing of large datasets in a distributed computing environment. Mapreduce technique is being used in hadoop for processing and generating large datasets with a parallel distributed algorithm on a cluster. A key benefit of mapreduce is that it automatically handles failures and hides the complexity of fault tolerance from the user. Hadoop uses FIFO (FIRST IN FIRST OUT) scheduling algorithm as default in which the jobs are executed in the order of their arrival. This method suits well for homogeneous cloud and results in poor performance on heterogeneous cloud. Later the LATE (Longest Approximate Time to End) algorithm has been developed which reduces the FIFO's response time by a factor of 2.It gives better performance in heterogenous environment. LATE algorithm is based on three principles i) prioritising tasks to speculate ii) selecting fast nodes to run on iii)capping speculative tasks to prevent thrashing. It takes action on appropriate slow tasks and it could not compute the remaining time for tasks correctly and can't find the real slow tasks. Finally a SAMR (Self Adaptive MapReduce) scheduling algorithm is being introduced which can find slow tasks dynamically by using the historical information recorded on each node to tune parameters. SAMR reduces the execution time by 25% when compared with FIFO and 14% when compared with LATE.

Keywords
Hadoop Mapreduce Colud Computing Scheduling SAMR Tuning.
Introduction

Hadoop is a software library framework that allows for the distributed processing of large datasets across clusters of computers using simple programming model [1]. Mapreduce is the data processing framework that automatically handles failures. It deals with the implementation for processing and generating large datasets with a parallel distributed algorithm on a cluster [2].Mapreduce is used in cloud computing because of hiding the complexity of fault tolerance from the programmer [3].Input data is splitted and fed to each node in the map phase. The results generated in this phase are shuffled and sorted then fed to the nodes in the reduce phase[4]. Hadoopdefaultly schedules the task using FIFO technique which is static [5]. Later several techniques are being developed which supports homogeneous tasks. LATE the dynamic scheduling technique is being introduced to schedule the jobs in heterogeneous environment[6]. Then the SAMR mapreduce scheduling technique is being developed which uses the historical information and find the slow nodes and launches backup tasks. The historical information is stored in each nodes in XML format. It adjusts time weight of each stage of map and reduce tasks according to the historical information respectively[7]. It decreases the execution time of mapreduce job and improve the overall mapreduce performance in the heterogeneous environment. In this paper we are tuning the parameters using k means clustering technique and then assigning tasks to each node thus improving the performance of hadoop in the heterogenous environment which is also known as Lloyd’s algorithm[8]. In Hadoop 1, a single Namenode managed the entire namespace for a Hadoopcluster[9]. With HDFS federation, multiple Namenode servers manage namespaces and this allows for horizontal scaling, performance improvements, and multiple namespaces.YARN, the other major advance in Hadoop 2, brings significant performance improvements for some applications, supports additional processing models, and implements a more flexible execution engine. YARN is a resource manager that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1[18]. YARN is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop[10].

Conclusion

In this paper we proposed a method to improve the efficiency of the map reduce scheduling algorithms. It works better than existing map reduce scheduling algorithms by taking less amount of computation and gives high accuracy. We used the proposed k-means clustering algorithm together with the Self-Adaptive MapReduce(SAMR) algorithm. However this technique works well it can assign only one task to each data node. In the future we have decided to improve its efficiency by allocating more number of tasks to the datanodes.

Article Preview