3D integration enables the stacking of multiple devices directly on the top of a microprocessor, thereby significantly improving both power efficiency and latency between the devices. For 3D synchronous digital systems, the clock distribution/generation network (CDN) circuit and architecture is one of the key design considerations. However, previous studies have examined the performance benefits by considering only 2D trees in every tier of a 3D CDN and organizations such as the size reduction of the clock trees and better algorithms for the CDN flip-flop. In this work, we explore more aggressive 3D CDN circuits and structure that improve both energy efficiency and latency by 3D stacking, as well as the additional reduction of CDN power supply. Our results show that with the combination of a novel 3D clock receiver and 3D CDN stacking organization, we can achieve a 2.29 times energy-efficiency improvement over conventional H-tree structures in 40nm CMOS. Our 3D TSV and on-chip CDN channels are based on the highly accurate 3D electromagnetic (EM) solver (HFSS) and 2D EM Momentum models, respectively.