Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Workshop Reg Pass. Upgrade Registration

Chameleon: A Disaggregated CPU, GPU, and FPGA System for Retrieval-Augmented Language Models

SessionNinth International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC 2023)

DescriptionA Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database via vector search. This strategy facilitates impressive text generation quality even with smaller models, thus saving orders of magnitude of computational resources compared to large language models such as GPT4. However, RALMs introduce significant challenges to system designs due to the diverse workload characteristics of the different RALM components. In this presentation, we present Chameleon, a heterogeneous system that combines CPUs, GPUs, and FPGAs in a disaggregated manner for efficient RALM serving. While GPUs still manage the computationally-intensive model inference, we design a distributed CPU-FPGA engine for large-scale vector search requiring substantial memory capacity and rapid quantized vector decoding, with the CPU server managing the vector index and FPGA-based disaggregated memory nodes scanning database vectors using near-memory accelerators. Chameleon vector search achieves 8.6~29.4x and 1.6~57.9x lower latency than CPU and GPU-based systems.

Author/Presenters

Wenqi Jiang

ETH Zurich – Swiss Federal Institute of Technology

Gustavo Alonso

ETH Zurich – Swiss Federal Institute of Technology

Event Type

Workshop

TimeFriday, 17 November 20239:20am - 9:40am MST

Location403-404

ask a question

give feedback