PhD Defense: Minimizing Resource Consumption through Workload Consolidation in Large-Scale Distributed Data Platforms

Talk
Ashwin Kumar Kayyoor
Time: 
11.03.2014 11:00 to 13:00
Location: 

AVW 3450

The rapid increase in the data volumes encountered in many application domains has led to widespread adoption of parallel and distributed data management systems like parallel databases and MapReduce-based frameworks (e.g., Hadoop) in recent years. Use of such parallel and distributed frameworks is expected to accelerate in the coming years, putting further strain on already-scarce resources like compute power, network bandwidth, and energy. To reduce total execution times, there is a trend towards increasing execution parallelism by spreading out data across a large number of machines. However, this often increases the total resource consumption, and especially energy consumption, significantly because of process startup costs and other overheads (e.g., communication overheads). In this dissertation, we develop several data management techniques to minimize resource consumption through workload consolidation.
In this dissertation, we introduce a key metric called 'query span', i.e., number of machines involved in the execution of a query or a job. In order to minimize the per query resource consumption we propose to minimize query span. To that end, we develop several workload-driven data partitioning and replica selection algorithms that attempt to minimize the average query span by exploiting the fact that most distributed environments need to use replication for fault tolerance. Extensive experiments on various datasets show that judicious data placement and replication can dramatically reduce the average query spans resulting in significant reductions in resource consumption. We show our results primarily on two applications, distributed data warehouse system and distributed information retrieval. In the first case, we show that minimizing average query spans can minimize overall resource consumption for a given workload and can also improve the performance of complex analytical queries. In the second case, our approach minimizes the overall search cost as well as effectively trades search cost with load imbalance.
The best case of resource efficiency for any underlying data processing system is achieved when the job or the query can be run efficiently on a single machine (i.e., query span=1). In the final part of dissertation, we discuss an in-memory MapReduce system optimized for performing complex analytics tasks on input data sizes that fit in a single machine's memory.
We argue that systems like Hadoop that are designed to operate across a large number of machines are not optimal in performance for small and medium sized complex analytics tasks because of high startup costs, heavy disk activity, and wasteful checkpointing. We have developed a prototype runtime called Hone that is API compatible with standard (distributed) Hadoop. In other words, we can take existing Hadoop code and run it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes.
Overall, in this dissertation, we focus on the issue of resource inefficiency in scale-out architectures and we demonstrate that in these architectures workload consolidation can minimize overall resource consumption. For multiple machines, we show that workload-aware data placement and replication of the data can help reduce resource consumption drastically. Also we note that, underlying system achieves highest resource efficiency when a job is executed on a single machine efficiently and propose several techniques to scale-down execution of medium sized jobs to single machine. Our thorough experiments on real and synthetic datasets demonstrate the efficacy of our proposed approaches.
Examining Committee:
Committee Chair: Dr. Amol Deshpande
Dean's Representative: Dr. Richard Marciano
Committee Members: Dr. Jimmy Lin
Dr. Alan Sussman
Dr. Atif Memon