RTL DesignVerilogFPGA

Designing a CNN Accelerator in Verilog: Lessons from RTL

December 20, 2024·10 min read

What I learned building a complete CNN inference pipeline in Verilog — from DRAM interfacing to timing closure, and why hardware design forces you to think differently.

Why Build a CNN Accelerator in RTL?

Software engineers can implement a convolutional neural network in a few lines of PyTorch. But when you need inference at the edge — low power, low latency, deterministic timing — you need custom hardware. Building a CNN accelerator in Verilog teaches you to think about computation in terms of data flow, resource sharing, and cycle-accurate timing.

The Pipeline Architecture

My design implements a 4×4 convolution → Leaky ReLU → 2×2 average pooling pipeline. The key design decisions were:

1. Streaming from DRAM

Rather than loading the entire feature map into on-chip SRAM (expensive in area), the pipeline streams data directly from DRAM. This required implementing:

CMD/ADDR/DQ_oe signals for the DRAM interface
Burst read/write operations to amortize row activation overhead
Modulo-8 row padding to align data for efficient burst access

2. The Convolution Engine

The 4×4 convolution uses a systolic-style multiply-accumulate array. Each clock cycle, a new input pixel shifts in while partial products accumulate. The key challenge is data reuse — each input pixel participates in multiple output computations.

I used a line buffer approach: four shift registers hold one row of pixels each, feeding the 4×4 window into the MAC array. When the window slides horizontally, only one new pixel enters per cycle.

3. Leaky ReLU in Hardware

Leaky ReLU is simple in software: y = x > 0 ? x : alpha * x. In hardware, the multiplication by alpha (typically 0.01) is expensive if done with a full multiplier. Instead, I approximated it using a right shift: x >>> 7 gives roughly 0.0078, close enough for inference accuracy.

4. Timing Closure

The entire pipeline needed to meet a ≤ 1.25 × 1024² cycles budget. The critical path ran through the MAC array's carry chain. I resolved this by:

Pipelining the adder tree with a register stage between the multipliers and the final accumulator
Using the start/ready handshake to cleanly handle pipeline stalls when DRAM wasn't ready

Lessons Learned

Hardware makes you respect data movement. In software, memory access feels free. In RTL, every byte moved costs wires, muxes, and cycles.
Handshakes are everything. The start/ready protocol between the DRAM controller and the compute pipeline was the most debugged part of the design.
Synthesis numbers don't lie. Clock period, area, and power from the synthesis tool are the ultimate arbiter of design quality — not simulation waveforms.

← Previous

Understanding Cache Coherence Protocols: MOESI vs. MESI

Building an Out-of-Order Superscalar CPU Simulator

Back to all posts