⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-01-02 更新
KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management
Authors:Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen
The stateful nature of large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests like chain-of-thought reasoning,causing latency spikes due to queuing incoming requests. However, state-of-the-art KVCache centric approaches handleload spikes by dropping, migrating, or swapping KVCache,which faces an essential tradeoff between the performance ofongoing vs. incoming requests and thus still severely violatesSLO.This paper makes a key observation such that model param-eters are independent of the requests and are replicated acrossGPUs, and thus proposes a parameter-centric approach byselectively dropping replicated parameters to leave preciousmemory for requests. However, LLM requires KVCache tobe saved in bound with model parameters and thus droppingparameters can cause either huge computation waste or longnetwork delay, affecting all ongoing requests. Based on the ob-servation that attention operators can be decoupled from otheroperators, this paper further proposes a novel remote attentionmechanism through pipeline parallelism so as to serve up-coming requests with the additional memory borrowed fromparameters on remote GPUs. This paper further addresses sev-eral other challenges including lively exchanging KVCachewith incomplete parameters, generating an appropriate planthat balances memory requirements with cooperative exe-cution overhead, and seamlessly restoring parameters whenthe throttling has gone. Evaluations show thatKUNSERVEreduces the tail TTFT of requests under throttling by up to 27.3x compared to the state-of-the-art.
- 大型语言模型(LLM)服务的状态特性在负载突发或长推理请求时可能导致GPU内存限制和延迟峰值。
- 现有KVCache处理方法面临性能权衡问题,需要在处理持续请求和即将到来的请求之间进行权衡。
- 本论文提出了一种基于参数的解决方案,通过选择性丢弃复制参数来处理内存问题,但同时也面临计算浪费和网络延迟的挑战。
- 观察到注意力运算符与其它运算符可以解耦,因此提出了一种通过管道并行性的远程注意力机制。
- 本论文解决了包括活跃的KVCache交换、生成平衡内存需求和合作执行开销的适当计划等挑战。
- KUNSERVE在减少请求延迟方面取得了显著成效,相较于现有技术减少了尾部TTFT高达27.3倍。