Scalable Data Warehouse for the Operations Monitoring and Notification Infrastructure at NERSC
DescriptionThe Operations Monitoring and Notification Infrastructure (OMNI) is a data collection system and warehouse that collects operational heterogeneous data at the National Energy Research Scientific Computer Center (NERSC) at Lawrence Berkeley National laboratory to provide computational resources to science user at high availability of the high-performance computing (HPC). OMNI assists NERSC in monitoring the health of most of the areas in the facilities that operate on a 24/7 basis. The data provides the team with a holistic view of the HPC data center that includes the building management system data, sensors data, and computer Syslog data. The data is used to plan, procure, build and remodel the next-generation systems.

With the delivery of the new HPC systems, and because of the scale of the new machine, the data rate available is expected to be 100 to 1000x faster. Thus, it is anticipated that the exascale size of data is to be sent to OMNI. To support the ability to collect more data, the team developed and instrumented a scalable and integrated network data collect automation strategy to scale OMNI’s growth.

Using OMNI, the operational team are able to use real-time data to keep the HPC system highly available. The data has been used to lower costs, save hardware, assist with business decision and influence collaborations. OMNI system collected and stored years of data that can be used as training datasets, eventually enabling machine learning and automated optimization
