You do not have the proper registration to view this content. This content is available for: Tech Program Reg Pass. Upgrade your registration here.
ET: Re-Thinking Self-Attention for Transformer Models on GPUs
DescriptionTransformer-based deep learning models have become a ubiquitous vehicle driving a variety of natural language processing (NLP) -related tasks beyond their accuracy ceiling. These models, however, also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce E.T., which re-thinks self-attention computation transformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimization, as well as operation reordering optimizations. Second, we achieve tensor core aware weight pruning by revamping the existing pruning algorithms, as well as designing new ones for transformers. This work goes further by introducing an attention-aware adaptive pruning design. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistillBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions; i.e., TensorRT and FasterTransformer.
Event Type
TimeTue, 16 Nov1:30pm - 2pm CST
Registration Categories
Machine Learning and Artificial Intelligence