CPU ArchitectureC++Performance

Building an Out-of-Order Superscalar CPU Simulator

November 10, 2024·7 min read

How I built a configurable superscalar CPU simulator in C++ to explore the design space of modern processors — and what the bottleneck analysis revealed.

Motivation

Modern CPUs are superscalar and out-of-order: they fetch, rename, issue, and retire multiple instructions per cycle, executing them out of program order while maintaining the illusion of sequential execution. Understanding *why* a particular microarchitectural configuration performs the way it does requires more than textbook knowledge — it requires building and measuring.

Architecture of the Simulator

The simulator models the key stages of an out-of-order pipeline:

Fetch & Decode

Up to WIDTH instructions are fetched per cycle from an instruction trace. Each instruction carries an opcode, destination register, and source registers.

Register Renaming

A rename map table (RMT) maps architectural registers to physical registers. The rename stage allocates entries in the Reorder Buffer (ROB) and updates the RMT. This eliminates false dependencies (WAR and WAW hazards).

Issue Queue & Wakeup/Select

Instructions enter the issue queue and wait for their source operands to become ready. The wakeup logic broadcasts completing tags each cycle; the select logic picks up to WIDTH ready instructions for execution, respecting functional unit availability.

Execute

The simulator models multi-latency functional units: integer ALU (1 cycle), multiplier (3 cycles), and memory (variable latency depending on cache hit/miss).

Retire

Instructions retire in program order from the ROB head, freeing physical registers and ROB entries.

Performance Optimization

The initial implementation was correct but slow. Profiling revealed:

Issue queue search was O(n) per wakeup — I switched to a tag-indexed wakeup list for O(1) broadcast.
ROB operations used a linked list — switching to a circular buffer with head/tail indices eliminated pointer chasing.
Cache simulation had poor locality — restructuring the tag arrays to be contiguous in memory improved cache performance of the simulator itself.

These optimizations enabled running design-space sweeps across thousands of configurations in reasonable time.

What the Bottleneck Analysis Showed

By sweeping WIDTH (1–8), ROB size (32–256), and IQ size (16–128), I found:

ROB pressure dominates at narrow widths: a WIDTH=2 machine with ROB=32 retires only 1.4 IPC because the ROB fills up with long-latency instructions.
IQ pressure dominates at wide widths: a WIDTH=8 machine with IQ=16 can't find enough ready instructions to fill the pipeline.
FU contention becomes the bottleneck once ROB and IQ are sufficiently large: adding a second multiplier unit improved IPC by 12% on multiply-heavy traces.

Key Insight

The simulator confirmed a fundamental principle of computer architecture: the bottleneck shifts as you scale different resources. There's no single "right" configuration — it depends on the workload mix and the area/power budget.

← Previous

Designing a CNN Accelerator in Verilog: Lessons from RTL

FSM-Based Memory BIST: From Algorithm to Silicon

Back to all posts