DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, Deming Chen

University of Illinois at Urbana-Champaign, IBM Research

Abstract

Building a high-performance FPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The proposed fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases.

Results

The performance of DNNBuilder can peak at 4218 GOPS when running object detection DNNs, which is the highest throughput reported to the best of our knowledge. It can provide millisecond-scale real-time performance for processing HD video and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

DNNBuilder_rst1.png
  • Comparison using embedded FPGAs for edge-AI solutions

DNNBuilder_rst2.png
  • Comparison using mid-range FPGAs for cloud-AI solutions

DNNBuilder_rst3.png
  • Comparison to GPU solutions

  • DNNBuilder won the IEEE/ACM William J. McCalla ICCAD Best Paper Award 

DNNBuilder_award2_edited.jpg
DNNBuilder_award1.PNG
DNNBuilder_poster.PNG

Demo

Following the DNNBuilder design flow, we generated an FPGA-based YOLO accelerator to perform real-time pedestrian, cyclist, and car detection for real-life scenarios. With the 1280X384 inputs, DNNBuilder achieved 22.1 FPS using an embedded FPGA (Xilinx ZC706).

Invited Talks about DNNBuilder

Conference / Seminar
Location
Date
IBM Research AI Horizons Colloquium
IBM Research (Cambridge)
Oct. 2018
C3SR Seminar
University of Illinois at Urbana-Champaign
Nov. 2018
IBM Workshop on Architectures for Secure, Cognitive, and Datacenter Computing
IBM Research (Yorktown)
Nov. 2018
Seminar
Google Brain
Nov. 2019

Citation

If you find DNNBuilder useful, please cite the DNNBuilder paper: