Optimizations for the Himeno Benchmark on Vector Computing System SX-Aurora TSUBASA
TimeTuesday, June 23rd4:30pm - 4:35pm
DescriptionVector processing has been widely adopted in many applications to achieve high sustained performance. This poster clarifies the performance of vector computing system SX-Aurora TSUBASA by optimizing the Himeno benchmark. SX-Aurora TSUBASA consists of Vector Hosts (VHs) and Vector Engines (VEs). A VH is an x86 processor and is responsible for OS-related tasks like system calls. A VE has a high-performance vector processor with High Bandwidth Memory (HBM2) modules and is responsible for high sustained performance for the main parts of applications. In addition, a VE has High Bandwidth Memory (HBM2) and achieves high memory bandwidth. This poster applies three optimizations to the Himeno benchmark. The Himeno benchmark solves Poisson’s equation using 19-point stencil calculations. First, highly reusable data are allocated to the LLC on a priority basis to effectively utilize the LLC. Next, loop unrolling is applied to reduce the overheads of branch conditions. Finally, the domain decomposition on MPI execution is tuned to achieve a long vector length and the high LLC hit ratio. The performance of a single node reaches 329.4 GFLOPS, which is 7.7% of the peak performance, by applying all the above optimizations. This performance is superior to those of CPUs, a GPU, and KNL. Moreover, SX-Aurora TSUBASA achieves good scalability in the multi-node execution. The parallelization efficiency with 8VEs reaches 76%. These results clarify the high potentials of vector processing by SX-Aurora TSUBASA.