BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250316T231111Z
LOCATION:B308
DTSTART;TZID=America/New_York:20241121T133000
DTEND;TZID=America/New_York:20241121T140000
UID:submissions.supercomputing.org_SC24_sess395_pap215@linklings.com
SUMMARY:Optimizing Distributed ML Communication with Fused Computation-Col
 lective Operations
DESCRIPTION:Paper\n\nKishore Punniyamurthy, Khaled Hamidouche, and Bradfor
 d Beckmann (AMD Research)\n\nMachine learning models are distributed acros
 s multiple nodes using numerous parallelism strategies. The resulting coll
 ective communication is often on the critical path due to a lack of indepe
 ndent coarse-grain computation kernels available to execute.\n\nIn this wo
 rk, we propose fusing computation with its subsequent collective communica
 tion and leverage GPUs' massive parallelism, along with GPU-initiated comm
 unication, to overlap communication and computation. Specifically thread-b
 locks/workgroups (WGs) immediately communicate their results to remote GPU
 s after completing their computation,while other WGs within the same kerne
 l perform computation. We developed three prototype fused operators (embed
 ding+All-to-All, GEMV+AllReduce, and GEMM+All-to-All) to address the commu
 nication overheads in DLRM, Transformers and MoE model architectures. We e
 xpose fused kernels as new PyTorch operators, as well as extend the Triton
  framework to demonstrate their practicality. Our evaluations show our app
 roach effectively overlaps communication with computations, subsequently r
 educing their combined execution time achieving 12% - 31% lower execution 
 time across all three operators.\n\nTag: Artificial Intelligence/Machine L
 earning, Distributed Computing, Heterogeneous Computing, Performance Optim
 ization\n\nRegistration Category: Tech Program Reg Pass\n\nSession Chair: 
 Nikoli Dryden (Lawrence Livermore National Laboratory (LLNL))
END:VEVENT
END:VCALENDAR
