Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
ZeroSum: User Space Monitoring of Resource Utilization and Contention on Heterogeneous HPC Systems
DescriptionHeterogeneous High Performance Computing (HPC) systems are highly specialized, complex, powerful, and expensive systems. Efficient utilization of these systems requires monitoring tools to confirm that users have configured their jobs, workflows, and applications correctly to consume the limited allocations they have been awarded. Historically system monitoring tools are designed for and only available to system administrators and facilities personnel to ensure that the system is healthy, utilized, and operating within acceptable parameters. However, there is a demand for user space monitoring capabilities to address the configuration validation and optimization problem. We describe a prototype tool, ZeroSum, designed to provide user space monitoring of application processes, lightweight processes (threads), and hardware resources on heterogeneous, distributed HPC systems. ZeroSum is designed to be used either as a limited-use porting tool or as an always-on monitoring library.
Event Type
Workshop
TimeSunday, 12 November 202310:30am - 10:50am MST
Location605
Tags
Programming Frameworks and System Software
Registration Categories
W