Preliminary Program

An Automated Compilation Flow for Efficient GNN Inference on Edge Heterogeneous Platform

Haoyang Fan¹, Sameh Gobriel², Naveena Padmaraju², Chaunte Lacewell², Viktor Prasanna¹
¹University of Southern California (USC), ²Intel Labs

Abstract

To support the surging demands for local AI inference, modern edge platforms have evolved into heterogeneous System-on-Chips (SoCs) integrating diverse Processing Units (PUs), including CPU, GPU, and NPU. These architectures offer unique potential to accelerate Graph Neural Networks (GNNs), which are critical for local applications such as code assistants, by matching diverse PUs to the model's hybrid compute- and memory-intensive phases. However, existing frameworks typically execute on a single PU or rely on rigid, coarse-grained partitioning. Such limitations hinder the effective exploitation of hardware heterogeneity, leading to suboptimal resource utilization, particularly for complex GNN models. In this paper, we propose FUSE (Fine-grained Unified Scheduling Engine), an automated compilation flow that abstracts and decomposes GNNs into fine-grained functional primitives, enabling better exploration of hardware heterogeneity. Unlike prior works that treat data and hardware separately, FUSE explores a unified design space that jointly optimizes data partitioning and finegrained hardware mapping. To efficiently explore the unified design space, FUSE implements a systematic online search engine guided by an offline-profiled hardware cost model. This engine prunes the design space and performs hierarchical search to balance workloads across heterogeneous PUs, ultimately generating an efficient execution plan tailored to the input. We evaluate FUSE on Intel AI PC SoCs using 3 widely-used GNN models and 3 datasets. FUSE outperforms single-device baselines by up to 2.67× and achieves speedups of up to 1.67× over state-of-the-art coarse-grained schedulers, while maintaining higher PU utilization.