Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Chameleon: A Disaggregated CPU, GPU, and FPGA System for Retrieval-Augmented Language Models
DescriptionA Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database via vector search. This strategy facilitates impressive text generation quality even with smaller models, thus saving orders of magnitude of computational resources compared to large language models such as GPT4. However, RALMs introduce significant challenges to system designs due to the diverse workload characteristics of the different RALM components. In this presentation, we present Chameleon, a heterogeneous system that combines CPUs, GPUs, and FPGAs in a disaggregated manner for efficient RALM serving. While GPUs still manage the computationally-intensive model inference, we design a distributed CPU-FPGA engine for large-scale vector search requiring substantial memory capacity and rapid quantized vector decoding, with the CPU server managing the vector index and FPGA-based disaggregated memory nodes scanning database vectors using near-memory accelerators. Chameleon vector search achieves 8.6~29.4x and 1.6~57.9x lower latency than CPU and GPU-based systems.
Event Type
Workshop
TimeFriday, 17 November 20239:20am - 9:40am MST
Location403-404
Tags
Architecture and Networks
Registration Categories
W