Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Workshop Reg Pass. Upgrade Registration

Lightning Talk: Toward Efficient Asynchronous Checkpointing for Large-Language Models

SessionFourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23)

DescriptionLarge-language models (LLMs) have been rapidly and widely adopted across research, academia, and enterprises for exploring different endeavors ranging from scientific and educational pursuits to financial and legal assistance. Unsurprisingly, training such sophisticated LLMs requires large-scale infrastructures, typically consisting of accelerators such as GPUs, and spans across multiple months depending on the size of the model and training data. Unfortunately, the GPU memory is in the range of tens of GBs and cannot hold multi-billion parameter models which are typically hundreds of GBs in size. Therefore, a combination of data, model, and tensor parallel techniques are applied to enable training such LLMs, which shard and distribute the model and its associated states across different GPUs. In this context, there is a frequent need to roll back the training of such LLMs to past stable states. This can happen for various reasons: failure of components when running at scale, the need to fine-tune the model and try a different training direction, inspect the evolution of the training to understand how it converges, etc. To this end, state-of-the-art LLM training runtimes such as Deepspeed, PyTorch, etc. use synchronous or partially synchronous checkpointing strategies, which lead to runtime overheads of up to 41% due to I/O bottlenecks. In this talk, we discuss the challenges of adopting existing multi-level checkpointing libraries for distributed LLM training and present novel strategies to perform efficient asynchronous multi-level checkpointing of distributed LLMs to minimize the checkpointing overheads. In particular, our approach is driven by key design ideas such as (1) block training only when attempting to overwrite unflushed tensors; (2) use pinned host memory for faster device-to-host transfers using GPU copy engines; (3) consistently capturing and serializing model states distributed across device and host memory; and (4) selectively flushing checkpoints to minimize storage and I/O bottlenecks.

Presenter

Avinash Maurya

Rochester Institute of Technology

Event Type

Workshop

TimeSunday, 12 November 20234:50pm - 5pm MST

Location710

ask a question

give feedback