Robert Dyer Robert Dyer

Assistant Professor
School of Computing
University of Nebraska–Lincoln

EmailGitHubLinkedInCVGoogle Scholar

Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map Tasks

Published: August 1, 2016
A master's thesis at Bowling Green State University

The MapReduce programming model provides efficient parallel processing via two user-defined tasks: map and reduce. Input files are split into individual records and each record is processed by a single map task. Implementations of the model, such as Apache Hadoop, typically assume each record is similarly sized and thus they allocate the same memory for every map task. However, certain data sets, such as Wikipedia or the project data from SourceForge, do not follow this assumption and instead contain very asymmetrically sized records. To handle such data sets, the current state of the art solution is to determine the maximum memory requirements for the largest record and then configure the system to allocate this much memory for all map tasks. This simple solution however, decreases the maximum number of tasks running in parallel, leading to lower overall CPU utilization on the cluster. In this work, we propose three solutions to help maintain as much parallelism as possible while processing such asymmetric data sets.

As our first solution we propose a fall-back approach, where initially all map tasks run with the standard allocation of memory. If any task attempt fails due to memory reasons, subsequent attempts of that task are run with twice the memory/heap of the previous attempt. This solution is fully automatic and requires no configuration of Hadoop. Our second solution allows users to manually specify specific map tasks to be allocated more memory. Thus, if users know a priori which records are larger, they can avoid letting that map task run, fail, and retry with the first approach. Our third solution is to automatically infer which map tasks have higher memory requirements, based on learning from prior runs of the same input data which tasks failed and had to fall-back with our first solution. If multiple jobs are run on the same input, this solution avoids the need for users to manually specify the tasks requiring more memory. We evaluate these approaches by measuring performance on several data sets and comparing to the state of the art solution. Our evaluation shows up to 37-48% performance improvement for our approaches when compared to the state of the art solution, due to increased parallelism in the system, and with minimal overhead.


Students



 Back to all publications