Preliminary Program

Intelligence Per Watt: A Study of Local Intelligence Efficiency

Avanika Narayan
Stanford University

Abstract

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (≤20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) can run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? To answer this, we propose Intelligence Per Watt (IPW)—task accuracy per unit power—as a metric to assess the capability and efficiency of local inference across model–accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic consisting of 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals three key findings: (1) Local LMs can accurately answer 88.7% of single-turn chat and reasoning queries, with variation by domain. (2) From 2023–2025, IPW improved by 5.3×, while local query coverage increased from 23.2% to 71.3%. (3) Local accelerators achieve at least 1.4× lower IPW than cloud accelerators running identical models, highlighting significant optimization headroom. These results demonstrate that local inference can meaningfully redistribute demand away from centralized infrastructure, with IPW serving as a critical metric for tracking this transition. We also release our IPW profiling harness to enable systematic intelligence-per-watt benchmarking.