Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Workshop Reg Pass. Upgrade Registration

A Reinforcement Learning-Based Backfilling Strategy for HPC Batch Jobs

SessionPMBS23: The 14th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems

DescriptionHPC systems employ a scheduling technique called “backfilling”, wherein low-priority jobs are scheduled earlier to use the available resources that are waiting for the pending high-priority jobs. Backfilling relies on job runtime to calculate the start time of the ready-to-schedule jobs and avoid delaying them. It is a common belief that better estimations of job runtime will lead to better backfilling and more effective scheduling. However, our experiments show a different conclusion: there is a missing trade-off between prediction accuracy and backfilling opportunities. To learn how to achieve the best trade-off, we believe reinforcement learning (RL) can be effectively leveraged. Based on this idea, we designed RLBackfilling, a reinforcement learning based backfilling algorithm. Our evaluation results show up to 17x better scheduling performance compared to EASY backfilling using user-provided job runtime and 4.7x better performance comparing with EASY using the ideal predicted job runtime (the actual job runtime).

Author/Presenters

Elliot Kolker-Hicks

University of North Carolina at Charlotte