CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

Angela Cui, Ferran Hermida-Rivera, Jack Toubes, Raghav Gupta 2026-06-28

Problem: Existing AI-driven hardware/software co-design research is limited to isolated, small-scale demonstrations due to the difficulty of designing and deploying complex AI-infused workflows. Method: CHIA introduces an open-source framework that models agentic AI-driven co-design flows as directed cyclic graphs (CHIA loops) with node implementations for tools like Chipyard, gem5, and Vivado, and provides features for isolation, profiling, and fault-tolerant execution. Finding: Five case studies demonstrate CHIA's capability, including automatic RTL-to-gem5 alignment, LLM-driven RTL microarchitecture implementation, and evolutionary architectural discovery. Why it matters: CHIA enables principled, scalable, and reproducible research on AI-driven hardware/software co-design, accelerating innovation across computer architecture, systems, compilers, and VLSI.

PDF

EGG: An Expert-Guided Agent Framework for Kernel Generation

Yaochen Han, Ke Fan, Hongxu Jiang, Wanqi Xu 2026-06-28

EGG addresses the problem of automating high-performance GPU kernel generation for LLMs, which currently requires manual expert tuning. The method decomposes kernel generation into two hierarchical stages—algorithmic structure design and hardware-specific tuning—guided by expert optimization principles and a stage-aware multi-agent collaboration mechanism. Experimental results on KernelBench and real-world workloads demonstrate a 2.13x average speedup over PyTorch, outperforming existing agent-based and RL-based approaches. This matters because it significantly reduces the reliance on manual optimization, enabling scalable and efficient kernel generation to combat the growing computational costs of LLMs.

PDF

Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch

Shaoyu Wang, Yizhuo Liang, Jaeyong Song, Chong Li 2026-06-28

Moebius addresses the problem that serving Mixture-of-Expert (MoE) models requires choosing between tensor parallelism (TP) and expert parallelism (EP), but the optimal choice depends on concurrency, which varies in production workloads. The method introduces a runtime parallelism switch that transitions between EP and TP without restarting the engine or dropping in-flight requests, by moving only the owner-changed slices of expert weights and KV cache using fused GPU-to-GPU transfer kernels. On 8x H200 GPUs serving Qwen3-235B-A22B, Moebius matches the better static parallelism at every operating point, achieves 1.16-1.25x speedup on RL rollouts, and completes each switch in 215-434 ms with only 2.4% memory overhead. This matters because it eliminates the performance penalty of pinning a single parallelism layout, enabling efficient serving under bursty and decaying concurrency patterns in production and reinforcement-learning workloads.

PDF

RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

Rongjian Chen, Jianmin Hu, Kejiang Ye, Minxian Xu 2026-06-28

Problem: Existing synchronous on-policy GRPO RLVR systems leave trainer GPUs idle during rollout, while asynchronous systems train on stale data. Method: RolloutPipe introduces complete-group pipelining (CGP) and frontier-group dispatch (FGD) to overlap rollout and training in disaggregated architectures while maintaining on-policy correctness. Finding: Evaluated on Qwen3-1.7B across four benchmarks and twelve rollout settings, RolloutPipe reduces rollout-to-train-end time by 30.7%-42.3% and lowers trainer waiting ratio by 37%-76% versus Slime. Why it matters: This enables efficient, on-policy LLM reinforcement learning post-training without idle GPU resources or stale data, critical for scaling reasoning tasks.

PDF

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang 2026-06-28

KernelPro addresses the challenge of automated GPU kernel optimization by introducing a closed-loop multi-agent system that integrates LLM code generation with hardware profiler feedback and pluggable micro-profiling tools. The method employs a two-stage tool invocation architecture with roofline-based bottleneck classification, domain-adapted MCTS search, and direct CuTe source-level code generation from the CUTLASS/CuTe codebase. On KernelBench, KernelPro achieves geometric mean speedups of 2.42x, 4.69x, and 5.30x on Levels 1, 2, and 3, and a 1.23x improvement over hand-tuned Triton on VeOmni's MoE kernels, with ablation studies confirming significant contributions from each design component. This matters because KernelPro is the first CUDA kernel coding agent to optimize energy efficiency beyond speed, achieving an 11.6% measured energy reduction at matched speed, establishing state-of-the-art performance across all difficulty levels.

PDF

Agentic evolution of physically constrained foundation models

Jiangwei Zhang, Wen Sun, Chong Wang, Shiyao Li 2026-06-28

The problem is that contemporary generalist AI agents lack physical grounding, leading to hallucinated hardware-incompatible designs. The method introduces a physically grounded, multi-agent discovery engine that uses an Evolutionary Knowledge Graph and algorithmic Chain-of-Thought to direct structural evolution. Experimental evidence shows the engine evolved two compression methods—Q-Enhance and MoE-Salient-AQ—that surpass human heuristics, and deployed a 235-billion-parameter model on a dual-A100 server with 75% memory reduction and only 0.64% accuracy loss. This matters because it establishes a scalable hardware-software co-design paradigm for machine-driven discovery within strict physical constraints.

PDF

Reading AI Model Compilation in MLIR Through the Lens of Formal Theories

Javed Absar 2026-06-28

The problem is that MLIR's design principles, such as match-and-rewrite and staged lowering, are typically derived from engineering intuition rather than formal theory. The method involves mapping these principles to established formal theories, including term-rewriting systems, refinement calculus, and abstract interpretation. The abstract does not disclose experimental results, as it is a conceptual argument rather than an empirical study. This matters because formal theories provide precise vocabulary for evaluating abstraction completeness and ideal design, which becomes critical as coding agents automate implementation but rely on well-structured semantics.

PDF

GPUSparse: GPU-Accelerated Learned Sparse Retrieval with Parallel Inverted Indices

Ashutosh Sharma 2026-06-28

GPUSparse addresses the CPU bottleneck in learned sparse retrieval by introducing a GPU-accelerated inverted index with parallel scoring. The system uses block-aligned posting lists, batched scatter-add algorithms, and fused Triton kernels to process hundreds of queries simultaneously. On MS MARCO passage ranking, GPUSparse matches exact CPU scoring (MRR@10=0.383) while achieving a 235x speedup over Pyserini and 787 QPS throughput, unlike Seismic which sacrifices 25% recall for speed. This matters because it enables real-time, exact sparse retrieval at scale, revealing a fundamental work-efficiency versus bandwidth-efficiency tradeoff for GPU-based search systems.

PDF

TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization

Ashutosh Sharma 2026-06-28

The problem is that existing GPU implementations of MaxSim scoring for multi-vector retrieval models achieve only 5-18% of peak HBM bandwidth due to materializing the full similarity matrix. The method, TileMaxSim, introduces IO-aware Triton kernels with multi-query SRAM tiling, dimension tiling for embeddings exceeding 128 dimensions, and fused product-quantization scoring via shared-memory lookup tables. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second, achieving a 220x speedup over loop-based scoring and cutting ColBERTv2/PLAID scoring latency from 268 ms to 1.2 ms. This matters because it provides a drop-in replacement that preserves exact retrieval quality while dramatically reducing end-to-end latency and enabling efficient GPU utilization for state-of-the-art multi-vector retrieval models.

PDF

Cache-Resident LLM Inference in GB-Scale Last-Level Caches

Wanning Zhang, Tongzhou Gu, Marco Canini, Ceyu Xu 2026-06-28

The problem is that LLM inference is dominated by data movement across the memory hierarchy, and achieving cache-resident execution is complicated by deeper pipelining, increased KV-cache footprint, and synchronization bottlenecks at operator boundaries. The method introduces a cache-resident execution model that separates weight-centric operators from attention and KV-cache management into dedicated resource domains, relaxes synchronization to sub-operator dependencies, and is instantiated on a multi-socket CPU cluster with a weight-attention decoupled architecture. Experimental evidence shows the prototype achieves 2.04x-11.51x speedup on time-per-output-token for deployed Llama models and up to 13.9x speedup under a validated analytical model, substantially outperforming equally provisioned llama.cpp. This matters because it demonstrates that commodity CPUs with GB-scale last-level caches can efficiently support LLM inference through cache residency, decoupled state management, and dependency-aware coordination.

PDF