Presentation
LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming
SessionPerformance Modeling
DescriptionThe shift towards high-bandwidth networks driven by AI workloads in data centers and HPC clusters has unintentionally aggravated network latency, adversely affecting the performance of communication-intensive HPC applications. Given the significant difference in network latency tolerance among MPI applications, accurately determining an application's latency resilience is crucial. Traditional methods for assessing this metric, relying on specialized hardware or simulators, tend to be inflexible and time-consuming. In response, we introduce LLAMP, a novel toolchain utilizing the LogGPS model and linear programming to analytically evaluate HPC applications' network latency tolerance. LLAMP equips software developers and network architects with essential insights for optimizing HPC infrastructures and strategically deploying applications to minimize latency impacts. We validate our toolchain across various applications, such as MILC, LULESH, and LAMMPS. Additionally, we include a case study with the ICON climate model, underscoring LLAMP's utility in improving the design and optimization of future HPC systems and applications.


