University of Texas at Austin

Although one of the main appeals of Stochastic Computation is the fact that complicated arithmetic operations can be performed with incredibly simple circuitry, it suffers from high latency due to the required bit stream length and the area incurred from generating them. Using deterministic bit streams addresses the area problem to some degree, but latency is still an issue. One of the main reasons contributing to this high latency is the fact that each operation is performed serially, and we argue that due to the uniform nature of deterministic stochastic representations that this is not necessary. Using the multiply-accumulate (MAC) operation as the target application, our research addresses this latency issue by exploiting data-level parallelism: the bottleneck of a single arithmetic unit is broken by splitting into multiple "parallel datapaths", which effectively reduces overall latency by a factor of 1/2^(N-1) at the cost of area. We demonstrate how modifying the amount of parallelism can provide a means of trading off some of these performance gains with area reduction. Additionally, we will show that this design can also provide large performance benefits for an operation that has broad applications in Machine Learning: the inner-product. Finally, we will show that the drastic reduction in computation time also pays dividends on energy consumption per operation, with the fully parallelized multiply-accumulate circuit consuming nearly 4x less energy than the serialized version.