Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Tech Program Reg Pass. Upgrade Registration

Mirage: Toward Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

DescriptionAccommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we explore a set of machine learning and reinforcement learning techniques to design a proactive provisioner. We examine the generality of the method using production job traces from three GPU clusters. We validate the effectiveness and generality of our proactive provisioner using the validation trace of each cluster. Our experiments show that the proposed resource provisioner safeguards 23%-76% of jobs with zero interruption across varying load levels on the three clusters.

Authors

Qiyang Ding

University of Texas

Pengfei Zheng

University of Wisconsin

Shreyas Kudari

University of Texas

Shivaram Venkataraman

University of Wisconsin