MapReduce is one of the most popular standard algorithms in distributed processing. This paper contributes to performance improvement in MapReduce, which otherwise performs low on social networking and web-based data due to iterative processing. The work presented has added iterative processing features to MapReduce, thus coining the new term iMapReduce. With MapReduce, users need to redesign several jobs to perform the functions of creating, scheduling, and destroying these jobs every time, resulting in performance penalties; iMapReduce, however, eliminates the need for shuffling data and executes map tasks asynchronously. The new algorithm is claimed to perform 1.2 to 5 times better than MapReduce.
Serial jobs in MapReduce require the first job to be finished before the next job is started, whereas iMapReduce allows the map cycle to start (asynchronously) as soon as the input data becomes available, without waiting for the previous reduce cycle to complete. The map and reduce jobs are persistent and the reduce output is directly fed to the map. This is accomplished by assigning the input data of the map and reduce cycles to a slave worker.
Further, iMapReduce performs task migration periodically for load balancing. In order to keep the map and reduce tasks together in the same processor, they are migrated together along with state and static data. For fault tolerance, it keeps data in a local file system, accessible in a distributed file system (DFS) manner, which returns the last iteration in the event of failure instead of starting a fresh one.
Performance is compared using single-source shortest path (SSSP) and PageRank algorithms, on a commodity hardware and Amazon Elastic Compute Cloud (EC2) cluster using the Hadoop DFS (HDFS). For SSSP, a DBLP author cooperation graph is used: “each node represents an author and a link between two nodes represents the cooperation relationship between [them]”; “link weight is set according to the cooperation frequency of the two linked authors.” Similarly, a Facebook user interaction graph is created and evaluated where interaction frequency is used to assign weight to user friendship links in this graph. Two much larger synthetic graphs are generated using “the power-law parameters on the link weight and the node out-degree ... extracted from the two real graphs.” The data for PageRank is similarly generated: there are two real graphs, Google web graph and Berkley-Stanford web graph, and much larger synthetic graphs are generated using these real graphs.
Overall, the paper demonstrates interesting and useful work. Since HDFS and MapReduce can be easily installed on Linux, it is potentially useful for computer science (CS) graduate projects.