Daily | Yixun Hong

Filtered by: Simulation × Clear all

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

Angela Cui, Ferran Hermida-Rivera, Jack Toubes, Raghav Gupta 2026-06-28

Computer Architecture Agentic AI Microarchitectural Gem5 Champsim Microarchitecture

Problem: Existing AI-driven hardware/software co-design research is limited to isolated, small-scale demonstrations due to the difficulty of designing and deploying complex AI-infused workflows. Method: CHIA introduces an open-source framework that models agentic AI-driven co-design flows as directed cyclic graphs (CHIA loops) with node implementations for tools like Chipyard, gem5, and Vivado, and provides features for isolation, profiling, and fault-tolerant execution. Finding: Five case studies demonstrate CHIA's capability, including automatic RTL-to-gem5 alignment, LLM-driven RTL microarchitecture implementation, and evolutionary architectural discovery. Why it matters: CHIA enables principled, scalable, and reproducible research on AI-driven hardware/software co-design, accelerating innovation across computer architecture, systems, compilers, and VLSI.

PDF

Cache-Resident LLM Inference in GB-Scale Last-Level Caches

Wanning Zhang, Tongzhou Gu, Marco Canini, Ceyu Xu 2026-06-28

Memory Hierarchy Microarchitecture Simulation × Cache LLM Inference

The problem is that LLM inference is dominated by data movement across the memory hierarchy, and achieving cache-resident execution is complicated by deeper pipelining, increased KV-cache footprint, and synchronization bottlenecks at operator boundaries. The method introduces a cache-resident execution model that separates weight-centric operators from attention and KV-cache management into dedicated resource domains, relaxes synchronization to sub-operator dependencies, and is instantiated on a multi-socket CPU cluster with a weight-attention decoupled architecture. Experimental evidence shows the prototype achieves 2.04x-11.51x speedup on time-per-output-token for deployed Llama models and up to 13.9x speedup under a validated analytical model, substantially outperforming equally provisioned llama.cpp. This matters because it demonstrates that commodity CPUs with GB-scale last-level caches can efficiently support LLM inference through cache residency, decoupled state management, and dependency-aware coordination.

PDF