Designing a CNN Accelerator in Verilog: Lessons from RTL
What I learned building a complete CNN inference pipeline in Verilog — from DRAM interfacing to timing closure, and why hardware design forces you to think differently.
Why Build a CNN Accelerator in RTL?
Software engineers can implement a convolutional neural network in a few lines of PyTorch. But when you need inference at the edge — low power, low latency, deterministic timing — you need custom hardware. Building a CNN accelerator in Verilog teaches you to think about computation in terms of data flow, resource sharing, and cycle-accurate timing.
The Pipeline Architecture
My design implements a 4×4 convolution → Leaky ReLU → 2×2 average pooling pipeline. The key design decisions were:
1. Streaming from DRAM
Rather than loading the entire feature map into on-chip SRAM (expensive in area), the pipeline streams data directly from DRAM. This required implementing:
- CMD/ADDR/DQ_oe signals for the DRAM interface
- Burst read/write operations to amortize row activation overhead
- Modulo-8 row padding to align data for efficient burst access
2. The Convolution Engine
The 4×4 convolution uses a systolic-style multiply-accumulate array. Each clock cycle, a new input pixel shifts in while partial products accumulate. The key challenge is data reuse — each input pixel participates in multiple output computations.
I used a line buffer approach: four shift registers hold one row of pixels each, feeding the 4×4 window into the MAC array. When the window slides horizontally, only one new pixel enters per cycle.
3. Leaky ReLU in Hardware
Leaky ReLU is simple in software: y = x > 0 ? x : alpha * x. In hardware, the multiplication by alpha (typically 0.01) is expensive if done with a full multiplier. Instead, I approximated it using a right shift: x >>> 7 gives roughly 0.0078, close enough for inference accuracy.
4. Timing Closure
The entire pipeline needed to meet a ≤ 1.25 × 1024² cycles budget. The critical path ran through the MAC array's carry chain. I resolved this by:
- Pipelining the adder tree with a register stage between the multipliers and the final accumulator
- Using the start/ready handshake to cleanly handle pipeline stalls when DRAM wasn't ready
Lessons Learned
- Hardware makes you respect data movement. In software, memory access feels free. In RTL, every byte moved costs wires, muxes, and cycles.
- Handshakes are everything. The start/ready protocol between the DRAM controller and the compute pipeline was the most debugged part of the design.
- Synthesis numbers don't lie. Clock period, area, and power from the synthesis tool are the ultimate arbiter of design quality — not simulation waveforms.