Close

Presentation

This content is available for: Tech Program Reg Pass. Upgrade Registration
Prodigy: Toward Unsupervised Anomaly Detection in Production HPC Systems
DescriptionPerformance variations caused by anomalies in modern High Performance Computing (HPC) systems lead to decreased efficiency, impaired application performance, and increased operational costs. While machine learning (ML)-based frameworks for automated anomaly detection (often based on time series telemetry data) are gaining popularity in the literature, practical deployment challenges are often overlooked. Some ML-based frameworks require extensive customization, while others need a rich set of labeled samples, none of which are feasible for a production HPC system.

This paper introduces a variational autoencoder-based anomaly detection framework, Prodigy, that outperforms the state-of-the-art alternatives by achieving a 0.95 F1-score when detecting performance anomalies. The paper also provides a real system implementation of Prodigy that enables easy integration with monitoring frameworks and rapid deployment. We deploy Prodigy on a production HPC system and demonstrate 88% accuracy in detecting anomalies. Prodigy involves an interface to provide job- and node-level analysis and explanations for anomaly predictions.
Event Type
Paper
TimeTuesday, 14 November 20232pm - 2:30pm MST
Location401-402
Tags
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
Registration Categories
TP
Reproducibility Badges