Current out-of-order architectures have the critical path in the memory
access. Since the memory access delay mainly consists of wire delays, the
memory access delay may not be reduced by feature size reduction. Therefore,
the performance of the out-of-order architecture may no longer improve with
future technologies. To solve this problem, we present a novel architecture,
called the Cascade ALU architecture. In a processor with the Cascade ALU
architecture, the critical path lies in the ALU latency. Since the ALU latency
mainly consists of gate delays, the cycle time can be reduced with feature
size reduction. An asynchronous implementation is shown to be suitable for the
Cascade ALU because the instruction execution latency varies depending on the
executed instructions. In order for the Cascade ALU architecture to enhance
the processor performance effectively, this paper presents a method for hiding
asynchronous handshake overhead, based on the fine-grain pipelining. Finally,
we show the evaluation results that demonstrate the Cascade ALU architecture
can outperform the current out-of-order processors.