We propose an efficient implementation of Monte Carlo based statistical static timing analysis (MC-SSTA) on FPGAs. MC-SSTA, which repeatedly executes ordinary STA using a set of randomly generated gate delay samples, is widely accepted as an accuracy reference because of its ability to handle any timing distribution and correlation. Extremely long CPU time has been required for the MC-SSTA, which prevented it from adopting as a mainstream timing analyzer. Motivated by its inherent parallelism, we propose a hardware acceleration of MC-SSTA. In our approach, timing graph of a target netlist will be translated into an RTL code that can be mapped on an FPGAs as an dedicated STA engine. Each delay arc is realized as the random delay generator of a specified parameters with a register, which explores full pipelining operation for the logic gates in a path. With our implementation, both path- and gate-level parallelisms are realized, achieving 87 times acceleration compared to a software implementation in the case of a 6 bit multiplier. The analysis accuracy comparable to the Mersenne Twister and the Box Muller methods, which are the well-known high quality normal distribution random number generator, has been also experimentally verified.