PhD Proposal: Reducing the Cost of Big Data Analytics in the Cloud

Talk
Abdul Quamar
Time: 
05.01.2014 10:00 to 11:30
Location: 

AVW 4172

Massive amounts of data is being generated and stored in the cloud due to economies of scale. This includes data being generated by a variety of sources such as information networks, social networks, communication, IP traffic and messaging networks, mobile sensors, financial systems, and many others. This data is being generated at a high volume, velocity, and variety depending on the sources that it comes from. There is an increasing interest in complex analytics to derive value out of this big data. However, doing so in the pay-as-you-go environment in the cloud can be very expensive as the cost of analytics grows linearly with time and the resources required which in turn depend on the size of the data and the complexity of analytics performed. Reducing the cost of data analytics in the cloud thus remains a primary challenge. In this dissertation research, I plan to develop techniques and build cost effective systems for complex analytics over big data in the cloud.
In the first part of the proposal, I focus on progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) using domain-specific sampling strategies for exploratory querying. This provides them with user-control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose Now!, a new progressive data-parallel computation framework that provides support for progressive analytics over big data. In particular, Now! enables progressive relational (SQL) query support in the cloud using unique progress semantics that (1) allow users to communicate progressive samples to the system; (2) allow efficient and deterministic query processing over samples providing meaningful early results; and (3) provide repeatable semantics and provenance to data scientists. Now! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred on such analytics.
In the second part, I propose NSCALE, a system for reducing the cost of complex analytics on large-scale graph structured data in the cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data process and reason about multi-hop local neighborhoods or subgraphs around a large number of nodes in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, anomaly detection, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric approaches as they do not scale to large graphs and have a high cost of analytics in the cloud. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the distributed execution of these neighborhood-centric complex analysis tasks on subgraphs of interest over large-scale graphs, while utilizing novel techniques for extracting the relevant portions of the graph for analysis, minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of data analytics in the cloud.
Finally, I conclude with the proposed future work that I plan to pursue to further enhance the applicability of the aforementioned direction of research. As part of the proposed work I intend to: (a) Develop a query language for declarative specification of complex subgraphs and providing support for their efficient extraction from the underlying graph structured data and (b) build a model for providing platform support for progressive analytics for neighborhood-centric complex iterative analytical tasks over graph-structured data.
Examining Committee:
Committee Chair: - Dr. Amol Deshpande
Dept’s. Rep: - Dr. Alan Sussman
Committee Members: - Dr. Jimmy Lin