Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer func-tionality, faster real-time communication, stronger security, and higher reliability; while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Ex-isting approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or FPGA, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incur-ring large NRE and TTM; FPGA-based SPAs retain programma-bility at the expense of significant energy and area increase. Fur-thermore, augmenting these SPAs as co-processors adds consider-able communication and synchronization overhead severely com-promising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the com-mon scheme of adding co-processing ASIC/FPGA SPAs to an in-tegrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor’s conventional ALU with a more powerful and flexible Versatile Processing Unit (VPU). VPU enables multiple interdependent in-structions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the appli-cation code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Ex-perimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.