Deep learning using recurrent neural networks has broadened the horizon of artificial intelligence. It can process a massive amount of multimodel natural data (video, audio) and learn useful join representations in various applications. However, implementation of recurrent neural networks in hardware and learn representations requires high throughput and memory bandwidth of that hardware platform. This work offers a 3D hardware architecture with application-specific instruction set processor, 3D-stacked memory, and sized on-chip memory for training and inference of recurrent neural networks. It also implemented a set of short instructions after analyzing differ- ent complex, time-consuming, special operations into high-level function blocks. The accelerator also performed state-of-the-art mixed precision training using custom instructions. A high-level programming environment is developed to generate Very Long Instruction Word (VLIW) instructions for this accelerator and processed a popular and successful variant of the recurrent neural network. At 28nm, this work achieved 8.5× faster pro- cessing speedup, 47.5× energy efficiency per sequence, and 2.71× reduction in silicon area against a GPU.