Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Workshop Reg Pass. Upgrade Registration

NVMe-Backed GNN Training on GPU Leveraging a Paged UVM Memory System

SessionRSDHA: Redefining Scalability for Diversely Heterogeneous Architectures

DescriptionGraph Neural Networks (GNNs) are powerful machine learning models that learn on graph data by extracting embeddings that represent vertex and edge features, as well as graph topology. With graph data scale increasing, and high memory pressure generated from GNN feature data, we turn to out-of-core training methods on many real world graphs. Current state-of-the-art methods for large-graph GNN training leverage mini-batches, distributed or parallel environments, and memory-aware partitioning and sampling. These methods however require custom training architectures and pipelines. Here, we propose Kirin, a framework for large-graph out-of-core training on a single machine with a single GPU on pre-sampled graphs. Kirin leverages Dragon-direct, allowing for NVMe-backed tensors for out-of-core training through driver managed allocations. Building on UVM, Dragon-direct utilizes a page-based unified memory system, resulting in memory-management that is largely invisible to the user. We showcase Kirin and analyze its performance and effectiveness for GNN workloads.

Author/Presenters