Architecture-Tailored Parallelization for Accessible Large Model Era

Talk
Xupeng Miao
Talk Series: 
Time: 
04.01.2024 13:00 to 14:00

In this talk, I will introduce my work on machine learning (ML) parallelization, a critical endeavor to bridge the significant gap between diverse ML programs and multitiered computing architectures. Specifically, I will explore ML parallelization at three distinct yet interconnected levels. First, I will show that by leveraging the unexplored space of model partitioning strategies, distributed ML training can be up to 20x faster than existing systems by improving communication efficiency. I will highlight some innovative distributed ML systems, such as HET for sparse embedding models and Galvatron for dense Transformer models, respectively. Second, I will discuss how to improve GPU utilization through ML parallelization. I will present SpecInfer, a system that reduces large language model (LLM) serving latency by 1.5-3.5x compared to existing systems by leveraging a novel tree-based speculative inference and verification mechanism. Third, I will demonstrate how ML parallelization popularizes LLMs by extending its boundaries throughout inter-cloud environments. I will describe SpotServe, the first LLM serving system on spot instances, handling preemptions with dynamic reparallelization, ensuring relatively low tail latency, and reducing monetary cost by 54%. Finally, I will conclude with a discussion on pushing my research forward to a holistic and unified infrastructure for democratizing ML.