Daily | Yixun Hong

Filtered by: Inference × Serving × GPU × Clear all

HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

Zhixiang Wei, Yun Wang, James Yen, Mingyuan Xia 2026-06-30

Inference × LLM Quantization Accelerator GPU × Hardware Serving ×

The problem is that LLM inference's prefill phase is compute-bound, leaving HBM bandwidth idle, while decode is memory-bound, making costly HBM-based GPUs inefficient for both phases. The method, HMA-Serve, disaggregates serving across memory-heterogeneous accelerators (MemHA) by pairing GDDR-based accelerators for prefill with HBM-based GPUs for decode, using phase-wise quantization, a compute-transfer pipeline, and deferred dequantization. Experimental evidence across four Qwen3 models and three production traces shows HMA-Serve achieves up to 3.2× higher goodput than state-of-the-art memory-homogeneous methods and 4.8× higher goodput-per-dollar with no measurable loss on generation-quality benchmarks. This matters because it enables cost-effective, cross-vendor LLM serving by efficiently utilizing heterogeneous memory technologies, breaking single-vendor assumptions about KV format and software stack.

PDF

Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs

Tianyu Wang, Gourav Rattihalli, Aditya Dhakal, Longfei Shangguan 2026-06-30

Inference × LLM GPU × Runtime Scheduling Serving ×

The problem is the growing energy footprint of LLM inference in cloud clusters, which is exacerbated by serverless serving's elastic GPU sharing that creates conflicting resource demands under a single device-wide operating point. Festina introduces a profiling-guided, power-aware control plane that jointly coordinates request placement, SM partitioning, and GPU operating points to minimize cluster-wide energy while meeting TTFT/TBT SLOs. Experimental evidence shows Festina reduces energy consumption by up to 56% compared to four SOTA LLM serving systems and one DVFS-augmented system, while maintaining SLO attainment within a 2% margin. This matters because it demonstrates that energy-first scheduling can achieve substantial power savings without sacrificing performance, addressing a critical need for sustainable cloud infrastructure.

PDF